Skip to content
This repository was archived by the owner on Aug 11, 2021. It is now read-only.

New API proposal #53

Open
Gozala opened this issue Mar 25, 2019 · 15 comments
Open

New API proposal #53

Gozala opened this issue Mar 25, 2019 · 15 comments

Comments

@Gozala
Copy link

Gozala commented Mar 25, 2019

I've being asked to provide feedback on API changes / proposals & sadly that usually means I go and challenge peoples efforts. I think it might be better / only fare to propose API myself and let others challenge it instead. So here is what IMO would be a best API:

Using flow syntax as it removes ambiguity

I would also like to use this opportunity to challenge naming conventions as I have following issues with it:

  • serialize is awfully long and easy to misspell, encode is shorter and lot harder to get wrong.
  • deserialize is even longer, decode isn't.
  • format - Is the way in which something is arranged or set out, in other words it corresponds to how and not what. codec on the other hands is what that knows how to encode / decode to a specific format. There for I would argue that codec is much better term to use as it represents implementation that can encode data into binary of specific format and decode binary in that format back to the data. It also make more sense for codec to provide operations named encode / decode than any other term.

We'll use opaque type alias BinaryEncoded<a> to refer to binary encoded data of type a.

export opaque type BinaryEncoded<data>:Uint8Array = Uint8Array

We'll also use CID<a> interface to refer to CIDs for data of type a.

export interface CID<a> {
  toString(): string;
  toBaseEncodedString(base?: string): string;
  buffer: BinaryEncoded<a>;
  equals(mixed): boolean;
}

With that out of the way I would propose that valid IPLD Codec be an implementation that complies with a IPLDCodec<a> interface as defined below:

interface IPLDCodec<a> {
  // https://github.com/multiformats/multicodec/blob/master/table.csv
  format:Multihash;
  defaultHashAlgorithm:HashAlgorithm;

  encode(a):BinaryEncoded<a>|Promise<BinaryEncoded<a>>;
  decode(BinaryEncoded<a>):a|Promise<a>;
  
   links <b>(BinaryEncoded<a>):AsyncIterable<CID<b>>;
   paths <b>(BinaryEncoded<a>):AsyncIterable<string>;
   get <b>(BinaryEncoded<a>, path):IPLDEntry<b>|Promise<IPLDEntry<b>>;
}

type IPLDEntry<b> =
  | IPLDInlineEntry<b>
  | IPLDLinkEntry<b>

type IPLDLinkEntry<b> = { type:"link", target:CID<b>, path:string }
type IPLDInlineEntry<b> = { type:"inline", data:b }

Additional notes:

  1. I have removed cid function, I am convinced that it's a historical artifact. All the implementations I've seen use provided hashAlgoritm (or default if not provided) to create multihash of the encoded data, and then use passed version & multicodec format to instantiate CID.

    In other words I do not see how codec could do anything else, maybe use different serialization method ? Seems unlikely, but if so maybe instead implementation should be required to provide that instead. Another thing I can imagine is a different way to hash ? If so, again should probably codec should just be required to provide optional method for that otherwise cid is method requirement is nothing but a way to introduce bugs.

  2. Anything returning Promise could also return value that it would resolve to. Consumer still can await without any impact & is also free to avoid await when relevant. From implementer perspective there is no need to wrap anything in promise.

  3. This incorporates links API proposed The Next API #52

  4. tree is replaced by paths and is scoped to immediate paths, meaning it's not codec implement's job to follow links etc... Generalized tree traversal can be implemented outside and isn't codec implementer concern.

  5. get(blob, path) returns (eventual) "entry" that is either some inline data that was encoded into the blob, or a link with-in the other IPLD node. Note that link has path that corresponds to path to the data with-in that node link is targeting.

  6. Different value for entry type is used ("link", "inline") that enables type checkers (flow, typescript) to refine type with in the union and in general makes it obvious how one figures what kind of entry is back (what if data is null ? what if for some reason returned entry also has being passed a target feild ?)

  7. @mikeal was suggesting in The Next API #52 to embrace IPLD Block API which is I'm all up for, however I don't think it should be made codec implementer concern. API proposed in The Next API #52 can trivially be layered on top of proposed API and be agnostic of codec implementation.

Concerns

I have brought up in the other thread ipld/ipld#64 (comment) the fact that support for passing parameters should be a fundamental design concern to ensure that things like IPLD Explorer can be just as useful with parametric IPLD Nodes. That however (at least as proposed) conflicts with links(a) which assumes links can be revealed without any parameters. For that reason I would prefer to omit links and replace paths with something like list (Edit: After thinking bit more I don't think list as proposed below is even going help much, I'll get back on this once I'll have more time to think through):

--- a/Users/gozala/Desktop/before.js
+++ b/Users/gozala/Desktop/after.js
@@ -6,8 +6,7 @@ interface IPLDCodec<a> {
   encode(a): BinaryEncoded<a> | Promise<BinaryEncoded<a>>;
   decode(BinaryEncoded<a>): a | Promise<a>;
 
-  links<b>(BinaryEncoded<a>): AsyncIterable<CID<b>>;
-  paths<b>(BinaryEncoded<a>): AsyncIterable<string>;
+  list<b>(BinaryEncoded<a>): AsyncIterable<{ isLink: boolean, path: string }>;
   get<b>(BinaryEncoded<a>, path): IPLDEntry<b> | Promise<IPLDEntry<b>>;
 }
@rvagg
Copy link
Member

rvagg commented Mar 26, 2019

This seems to be roughly the same as @mikeal's proposal except for naming changes and having everything as a static method exposed by the codec/format.

I'd like to see some use-cases from @mikeal but I'm suspecting that the ergonimics of passing around a Reader instance might be much nicer than having to have both the opaque BinaryEncoded<a> as well as the format/codec itself. Maybe a better name than Reader could be found though because it's essentially making a BinaryEncodedWrapper.

function doSomethingWithBlock(blockReader) vs function doSomethingWithBlock(codec, block) - the two items are (presumably) inseparable, one doesn't make sense without the other, so tying them together might be both safer and make it less clunky to use.

Regarding naming: I agree on codec (although the distinction seems less obvious than you suggest, they're all referring to formats still) and on encode / decode but mainly because I'm a non-American English speaker and the z's always irk me. I think @mikeal's remaining makes more sense than path in your IPLDLinkEntry<b>, however. It needs little or no clarification.

@Gozala
Copy link
Author

Gozala commented Mar 26, 2019

This seems to be roughly the same as @mikeal's proposal except for naming changes and having everything as a static method exposed by the codec/format.

Yes, except this does not operates on IPLDBlock which was one of the major changes in there

@Gozala
Copy link
Author

Gozala commented Mar 26, 2019

I'd like to see some use-cases from @mikeal but I'm suspecting that the ergonimics of passing around a Reader instance might be much nicer than having to have both the opaque BinaryEncoded as well as the format/codec itself. Maybe a better name than Reader could be found though because it's essentially making a BinaryEncodedWrapper.

I am sure there are cases where passing something like Reader is both convenient and necessary, that however doesn’t mean it should be part of core API. In fact making that separate layer would allow generelaize Reader freeing codec implemetera from dealing with it and potential bugs

@Gozala
Copy link
Author

Gozala commented Mar 26, 2019

I think @mikeal's remaining makes more sense than path in your IPLDLinkEntry<b>, however. It needs little or no clarification.

Depends on the context IMO. If you think about link target node + path makes a lot of sense. Just like in URL host + path makes sense.

If you think about trying to resolve something but not quite making it and there for having remaining piece left probably remaining makes more sense. I don’t like that analogy however, I think it’s awkward way to think about it

I should have made that more clear though

@vmx
Copy link
Member

vmx commented Mar 26, 2019

I prefer "remaining", as "path" to me is to ambiguous it's not clear to me if it's the path to the CID or the path to the value after following the CID ("path to the CID" doesn't make sense when you think about it, I would find it non-intuitive though).

I'm fine with "encoding/decoding".

On my TODO list is still a write-up about the different IPLD layers. I need to put more thought into this, but here's the rough idea. Currently we have IPLD Formats and JS-IPLD, but we need more. I like this proposal, this would be about the lowest level where there's only binary data and the IPLD Data Model. One layer on top could then be more about what @mikeal's proposal (#52) is about, where it is about Blocks, hence also about CIDs. The next layer would then be about IPLD Format agnostic APIs, you could e.g. call decode() on a Block and it will automatically use the right format etc. The top layer would then also be about storing and retrieving things.

@Gozala
Copy link
Author

Gozala commented Mar 26, 2019

I prefer "remaining", as "path" to me is to ambiguous it's not clear to me if it's the path to the CID or the path to the value after following the CID ("path to the CID" doesn't make sense when you think about it, I would find it non-intuitive though).

But it’s not the path to CID it’s path to whatever link is so if you have a link with shape {target, path} it’s pointer to a values at the given target and path.

I also would like to extend that further with inline blocks (if we get there) where link target could also be an inline block rather thatn CID. It also would be very composable with general access. Right now resolver needs to manually traverse decoded object, with all the proposed pieces you could instead return link with in the data structure and let other resolver take over {target: new Block(decodedData, "dag-json"), path: remainingPath) }

@Gozala
Copy link
Author

Gozala commented Mar 26, 2019

I think “remaining” makes sence if you’re thinking about in terms of reaching a destination. But if you think of link to something, or a poiner then “remaining” starts to sound little confusing.

Maybe {target, at} would be a more clear than either of two regardless of the way you think about it ?

@mikeal
Copy link
Contributor

mikeal commented Mar 26, 2019

serialize is awfully long and easy to misspell, encode is shorter and lot harder to get wrong.
deserialize is even longer, decode isn't.

+1, I’ll probably go and update my proposal with this as well.

format - Is the way in which something is arranged or set out, in other words it corresponds to how and not what. codec on the other hands is what that knows how to encode / decode to a specific format. There for I would argue that codec is much better term to use as it represents implementation that can encode data into binary of specific format and decode binary in that format back to the data. It also make more sense for codec to provide operations named encode / decode than any other term.

Not too long ago, we had a variety of terms being used to describe the codec, format being one of them. We also had format and codec being used to describe other things which only piled onto the confusion.

I ended up writing up our terminology in the spec repo to try and standardize the language we use.

I went with codec for consistency, as the multicodec project already had their naming locked in and I wasn’t going to be able to change it, and our “codecs” are in-fact references to registered codecs in multicodec.

I then specified the term “format” in our terminology to refer to the existing format we’re leveraging, as that was the thing we were most often getting confused in the terminology.

dag-json is a “codec” that uses the json “format.”

I’m sympathetic to your rational for wanting to see this somewhat reversed but I think it would end up causing confusion given the dependency we have on multicodec 😢

@mikeal
Copy link
Contributor

mikeal commented Mar 26, 2019

One last thing to note, the encode interface should be optional. We have codecs that we need to read but will probably never need to write with this interface (git, eth, and bitcoin blocks for instance). In fact, there’s a whole world of content addressed data that we want to be able to address but never modify or author. This is why we wrote the “Data Model” spec, to distinguish between the codecs we’re using for building all of these higher level data-structures and those that we only want to be able to read and link into.

@Gozala
Copy link
Author

Gozala commented Mar 26, 2019

One last thing to note, the encode interface should be optional. We have codecs that we need to read but will probably never need to write with this interface (git, eth, and bitcoin blocks for instance). In fact, there’s a whole world of content addressed data that we want to be able to address but never modify or author. This is why we wrote the “Data Model” spec, to distinguish between the codecs we’re using for building all of these higher level data-structures and those that we only want to be able to read and link into.

That's fare, but should not such codec just throw exception from encode ?

@Gozala
Copy link
Author

Gozala commented Mar 26, 2019

dag-json is a “codec” that uses the json “format.”

Wait is this a desired terminology or undesired ? I feel like this exactly the right terminology which I see as aligned with proposed naming no ?

@Gozala
Copy link
Author

Gozala commented Mar 26, 2019

I went with codec for consistency, as the multicodec project already had their naming locked in and I wasn’t going to be able to change it, and our “codecs” are in-fact references to registered codecs in multicodec.

I'm failing to see conflict here.

Codec encodes data into specific format. Multicodec tags encoded data with corresponding codec identifier to describe it's format.

I think you register codecs with corresponding codec identifier.

Could just terminology be refined to use "codec identifier" where codec is being used now and "codec" used in place where format is today ?

@mikeal
Copy link
Contributor

mikeal commented Mar 26, 2019

Wait is this a desired terminology or undesired ?

It’s the current terminology :)

@vmx
Copy link
Member

vmx commented Mar 27, 2019

For me a Format implements a Codec.

@mikeal
Copy link
Contributor

mikeal commented Apr 4, 2019

Several of these ideas ended up in https://github.com/ipld/js-ipld-stack.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants