Data-systems messaging #306

0xterminator · 2024-11-08T10:26:45Z

0xterminator
Nov 8, 2024
Collaborator

RFC Goals

The RFC goals are well summarized in the diagram below:

We are investigating a serialization/deserialization mechanism which will allow us to serialize all needed fuel-core data structures (Rust) to a binary vector which will be sent over the wire to NATS. The same binary vector is to be ready by all sorts of clients which will deserialize the data via a deserializer and construct out of it typed native structures - objects/interfaces/types in javascript, structs in rust, go structs with types in golang. We are in a search of a mechanism that allows on the basis of globally shared schemas for different clients and languages to derive typed objects ready for use thus deserializing the data easily. The opposite is also true - a client must be able to serialize the data using serializable structures derived from the global schemas and push them to nats. This way we want to ensure complete convertibility between data sent and received from e.g. rust / typescript /go without compromising its entirety or causing breaking changes in the current fuel-core functionality.

Overview

In this RFC we are examining 2 very fast and commonly used serialization/deserialization algorithms - Borsh and Protobuf in order to determine if any of them is feasible/suitable for implementing a global messaging system within fuel based on shared schemas. Hereby we need to consider the following points before evaluating any possible solutions:

suitability of each algorithm within the fuel-ecosystem
available libraries in Rust, Typescript and eventually other languages
complexity of conversion between types generated in each of these languages
Speed and optimizations

Suitability

Our restriction is that we are bound to the types defined in fuel-core as these are the main data types that the blockchain relies on and internally uses. All fuel-core data types have serde::Serialize and serde::Deserialize implemented which makes them easily convertible from and into JSON on demand. Havind said that, we have realized that JSON is quite slow in terms of transfer over the wire hence we are exploring the borsh vs. protobuf capabilities in this RFC. Whichever of these types is to be potentially used, it needs to be suitable for the morphology of the data types we have in fuel and its ecosystem, meaning that serializers/deserializers need to be compatible with the latter and provide means of ser/deserializing those with not much effort.

Libraries availability

There are official libraries for each of the two investigated algorithms:

Borsh

Typescript: borsh-js
Rust borsh and borsh-derive and borsh-cli
Go borsh-go

Protobuf

Rust: prost
Typescript: grpc-js and [proto-loader] (https://www.npmjs.com/package/@grpc/proto-loader), protobuf. protobuf-ts
Go protobuf

Based on conducted research these are the most-comprehensive libraries available and also the best maintained one. Further research in the RFC will focus on them.

Complexity

It is important to note that an optimum solution is being sought after by which data mapping between fuel-core structures is achieved with minimum effort, can be mechanized as much as possible and provides always deterministic outputs.

Speed and optimizations

The optimum choice between borsh and protobuf is to also reflect on the performance of serializing/deserializing any data retrieved from fuel,, cpu and memory load. The latter results, if needed, are to be benchmarked. Also, another important factor here is that the serialized data needs to be in such a format that standard compression algorithms could easily be applied on top of it if it is to be used for intensive messaging.

Borsh

Description

Borsh JS is an implementation of the Borsh binary serialization format for JavaScript and TypeScript projects.
Borsh stands for Binary Object Representation Serializer for Hashing. It is meant to be used in security-critical projects as it prioritizes consistency, safety, speed, and comes with a strict specification.

Borsh with Rust

Borsh with Rust works by defining BorshSerialize and BorshDeserialize traits on a data structure

        use borsh::{schema::BorshSchemaContainer, BorshDeserialize, BorshSerialize};

        #[derive(BorshSerialize, BorshDeserialize, PartialEq, Debug, Clone)]
        struct MyExampleStruct {
            name: String,
            age: u8,
        }

      let my_example_struct = MyExampleStruct { name: "Albert".to_string(), age: 9 };

      // serialize
      let serialized: Vec<u8> = borsh::to_vec(&my_example_struct).unwrap();
      // deserialize
      // let my_data = MyExampleStruct::try_from_slice(data).unwrap();
      // or
      let deserialized = borsh::from_slice::<MyExampleStruct>(&serialized).unwrap();
      // assert
      assert_eq!(deserialized, my_example_struct);

The output here are binary vectors which the data is converted in.

Borsh can also generate a schema that is appended to the back of the serialized u8 vector. Schemas are usually needed for correct decoding and are less error-prone than the schema-less borsh decoding as ser/deserialization could end up interpreting values of an object in different ways depending on the platform/language where the serialization/deserialization process takes palce. Schemas usually increase the serialized output but have the mentioned above advantages on the other hand. Usually encoding and decoding with schemas is safer and also allows flexibility if e.g. new members are added to a given struct or removed subsequently.

Here is an example with schema:

        use borsh::{schema::BorshSchemaContainer, BorshDeserialize, BorshSchema, BorshSerialize};
        
        #[derive(BorshSerialize, BorshDeserialize, BorshSchema, PartialEq, Debug, Clone)]
        struct MyExampleStruct {
            name: String,
            age: u8,
        }

        let my_example_struct = MyExampleStruct { name: "Albert".to_string(), age: 9 };
        // Serialize object into a vector of bytes and prefix with the schema serialized as vector
        // of bytes in Borsh format.
        let serialized = borsh::try_to_vec_with_schema(&my_example_struct).unwrap();
        // Deserialize this instance from a slice of bytes, but assume that at the beginning we have
        // bytes describing the schema of the type
        let deserialized =
            borsh::try_from_slice_with_schema::<MyExampleStruct>(&serialized).unwrap();
        // assert
        assert_eq!(deserialized, my_example_struct);

Going further, schemas can be extraced as such:

     let container: BorshSchemaContainer = BorshSchemaContainer::for_type::<MyExampleStruct>();

and converted to Vec<u8> as such:

     let schema = borsh::to_vec(&container).unwrap();

which gives us the possibility to persist schemas in binary format in files. Analogically, schemas could be read from some e.g. text files and used if needed. The byte representation of a schema is pretty useless on its own as the rust library does not however make direct use of the latter. Unfortunately there is no human-readible representation fo a borsh schema as reflected in this ticket. In general, if some payload is serialized with schema, it can only be deserialized with that schema. If it is serialized with a schema, but attempted to be serialized without, it would fail.

There are tools borsh-schema-utils however that allow converting a schema to a json file e.g.: borsh-schema-utils by recursively extracting data from the BorshSchemaContainer.

        #[derive(Debug, Default, BorshSerialize, BorshDeserialize, BorshSchema)]
        pub struct Person {
            first_name: String,
            last_name: String,
        }

        write_schema_as_json(Person::default(), "./person_schema0.dat".to_string()).unwrap();
        let file = File::open("./person_schema0.dat").unwrap();
        let reader = BufReader::new(file);
        let result: serde_json::Value = serde_json::from_reader(reader).expect("Deserialization failed");
        println!("Schema as json: {}", result.to_string());

The latter would result in the following json:

{
  "declaration": "Person",
  "definitions": [
    [
      "Person",
      {
        "Struct": {
          "fields": {
            "NamedFields": [
              [
                [
                  "first_name",
                  "String"
                ],
                [
                  "last_name",
                  "String"
                ]
              ]
            ]
          }
        }
      }
    ],
    [
      "String",
      {
        "Sequence": {
          "elements": "u8",
          "length_range": {
            "end": 4294967295,
            "start": 0
          },
          "length_width": 4
        }
      }
    ],
    [
      "u8",
      {
        "Primitive": [
          1
        ]
      }
    ]
  ]
}

In theory that data reflects the structure schema design and could be used by e.g. typescript to extract necessary interfaces or types. Of course, the latter is subject to us being able to implement the BorshDeserialize, BorshSchema, BorshSerialize traits on our rust structures in fuel-core.

Borsh with Typescript

Borsh with typescript takes a slightly different approach to what we have with Rust. Let us take the Person structure again:

import * as borsh from "borsh";

class Person {
    public first_name: string;
    public last_name: string;

    constructor() {
        this.first_name = "jon";
        this.last_name = "doe";
    }
}

const person = new Person();
const schema = { struct: { first_name: "string", last_name: "string" }};
const encoded = borsh.serialize(schema, person);
const decoded = borsh.deserialize(schema, encoded);

As we see, with typescript we always need a schema to perform the serialization/deserialization procedures. The schema describes the morphology of the data with its types. However, the schema that typescript needs is quite different from what we generated above with Rust:

{
  "struct": {
    "first_name": "string",
    "last_name": "string"
  }
}

Discussion

Now, having explored how both algorithms and their libraries work, one can conclude that:

To ensure compatibility between rust and typescript when a payload Vec/Uint8Array is received over the wire the following prerequisites must be met

All fuel-core data structures need to implement BorshSchema, BorshSerialize and BorshDeserialize to ensure that we could fetch their schemas when needed.

For example:

#[derive(Clone, Debug, PartialEq, Eq)]
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize, borsh::Serialize, borsh::Deserialize))]
pub enum Block<TransactionRepresentation = Transaction> {
    /// V1 Block
    V1(BlockV1<TransactionRepresentation>),
}

Schemas produced in 1. could be persisted in json format and updated on demand whenever some of the fuel-core models change.
A converter to map the produced in 2. schemas to json schemas that the typescript library expects needs to be implemented. The new schemas could be packaged in an npm packge e.g.
An extractor needs to be setup in place so we can extract typed structures from the JSON schemas generated in 2. The new extracted types could be packaged in an npm packge e.g.

The flow could easily be summarized as follows :

        write_schema_as_json(Block::default(), "./block_schema.json".to_string()).unwrap();
        let file = File::open("./block_schema0.dat").unwrap();
        let reader = BufReader::new(file);
        let result: serde_json::Value = serde_json::from_reader(reader).expect("Deserialization failed");
        let ts_schema = mapToTsSchema(&result).unwrap();
        generateTsInterfaces(&ts_schema);

or the diagram here:

The latter requires 4 steps to be taken which could be bundled in some form of a pipeline or packaging system. The latter steps could be easily achieved provided provided 1. is doable.

Complications around borsh and fuel-core

After further explorations, borsh seems to have difficulties with some commonly used Rust structures in fuel-core. For example, the following mock code poses complications:

#[derive(Clone, Debug, BorshSerialize, BorshDeserialize, BorshSchema)]
pub struct BlockHeaderV1 {
    pub application: ApplicationHeader,
    pub consensus: ConsensusHeader,
    metadata: Option<Box<BlockHeaderMetadata>>,
}

#[derive(Clone, Debug, BorshSerialize, BorshDeserialize, BorshSchema)]
pub enum BlockHeader {
    /// V1 BlockHeader
    V1(BlockHeaderV1),
    V2(BlockHeaderV1),
}

The error we are getting is:

recursive type <mocktypes::BlockHeader as BorshSchema>::add_definitions_recursively::BlockHeaderV1 has infinite size

The error we are encountering is due to the recursive nature of your BlockHeader enum. The Borsh serialization/deserialization library requires that the size of the data structure be known at compile-time, but the recursive definition of BlockHeader makes this impossible. Since we have the entirety of our structures in fuel-core designed around the principles of nested enums, in order to make it work one would have to apply some indirection techniques such as Box or Rc applied to fuel-core to help borsh serialize properly. Borsh is currently unable to behave in a polymorphic way meaning a message of type:

enum MyEnum {
    V1(Type1),
    V2(Type2),
}

cannot be serialized to an equivalent borsh schema. Same concept was verified also with typescript.

Another thing that one needs to keep an open eye are type differences e.g. link that might lead to substantial differences in the serialization.

Conclusions

We saw that if a borsh serialize message is to be sent over the wire to e.g. a typescript client, we would need to have the typescript client be able to draw the schemas in order to serialize based on the received type which is a cumbersome process we need to generate in 1-4 above. In addition to that we have issues around the morphology of the rust fuel-core data structures and borsh which seem to have interoperability issues by design. There are ways to go around these complications, but that might mean introducing substantial changes to fuel-core which is something undesirable.

Protobuf

Description

Protocol Buffers (protobuf) is a method developed by Google for serializing structured data, making it easier to share data across different platforms and languages. It’s an efficient, language-neutral, and platform-neutral format that’s widely used for defining and exchanging structured information in applications and services.

Key Aspects of Protobuf:

Serialization Format:

Protobuf converts structured data into a compact binary format that can be transmitted over a network or stored, and then deserialized back into its original form. The binary format is more efficient than text-based formats like JSON or XML.

Cross-Language Compatibility:

Protobuf is designed to be language-neutral. You define your data structure in a .proto file, which is then compiled into code for various programming languages (e.g., Java, Python, C++, Go). This makes protobuf an excellent choice for communication between services written in different languages.

Schema-Driven

With Protobuf, data structures are defined in a schema file (a .proto file) where you specify the data types, field names, and field numbers. For example, here’s a simple .proto file:

syntax = "proto3";

message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;
}

The schema ensures that data is structured consistently, and field numbers help protobuf maintain backward compatibility as the schema evolves over time.

Efficiency:

Protobuf messages are typically smaller and faster to encode/decode than JSON or XML. This efficiency is useful in scenarios where bandwidth and speed are critical, such as IoT devices, mobile applications, and real-time systems.

Overall, protocol Buffers offers a compact, fast, and language-agnostic way to handle data serialization, making it well-suited for microservices, general messaging, and real-time systems where efficiency is a priority.

Approach

A good idea is to think about using our already inherited serde::Serialize and serde::Deserialize that most of the fuel-core types exhibit. Unfortunately there is currently no meaningful way or tool to map serde_json schemas to protobuf structures with types and tags. There are however tools that might become helpful in the process:

e.g. json to proto , link1 and another implementation link2 which allow mapping a rendered json (meaning key-values) to a protobuf schema. Of course that is fine within the context of simple un-nested morphology which could be mapped to a 1:1 protobuf structure, but in cases like fuel-core e.g.

 pub enum Transaction {
     Script(Script),
     Mint(Mint),
     ...
 }

a direct mapping from enum Transaction to message Transaction wont work as Transaction is polymorphic and could take different variants. Having an example json for each of them and obtaining an analogous structure in protobuf is quite cumbersome and might not lead to an automatic fully-consistent data generation with a single protobuf schema.

Even with some self-implemented parsers suggested here a fully comprehensive mapping wont be achieved easily as fuel data is quite densely packed in terms of Rust representation which makes a protobuf normalization quite difficult.

In later versions of syntax = "proto3" Protobuf has expanded its schema generation language and included some quite interesting features such as:

bytes data type - equivalent to a Vec in Rust or Uint8Array in typescript
google.protobuf.Timestamp data type for a timestamp
google.protobuf.Duration data type for time duration
google.protobuf.Int64Value data type for special i64 values
google.protobuf.Any polymorphic data type that could represent any data and be named
oneof value { .. } polymorphic data type

in addition to the standard ones. Here is the full list

The most important of all these for our use case is the probably the new oneof data type that allows complex polymorph data messages with similar schemas :

message Message {
  oneof sum {
    PubKeyRequest     pub_key_request          = 1;
    PubKeyResponse  pub_key_response         = 2;
    PingRequest          ping_request             = 3;
    PingResponse       ping_response            = 4;
  }
}

The latter is indeed an important addition to protobufs as it easily allows us to mimic the fuel-core structures nature in a 1:1 manner. Protobufs have always had enum typed messages but they have been quite simple in terms of schema before version3 came out meaning variant mapped to a constant index.

For example:

enum CheckTxType {
  NEW = 0
  RECHECK = 1
}

With version 3 complex data messages could now easily be mapped from:

 pub enum Transaction {
     Script(Script),
     Mint(Mint),
     ...
 }

to:

message Transaction {
  oneof tx {
    Script   script        = 1;
    Mint     mint         = 2;
  }
}

The advantage we get from protobuf here is that we can easily match on each variant if we were to receive a message and that level of branching can be as deeply nested as Integer.MAX_VALUE - link.

Version 3 message members are also all optional which gives us enormous flexibility and mimics quite well some of the optional fuel-core data functionalities. If a data member is missing, Rust would yield None whereas typescript would result in null | undefined on the interface side.

As we see, due to the introduction of oneof, we can end up either matching in Rust:

match message.tx {
  proto::Script => { ... }
  proto::Mint => { ... }
}

and in typescript:

if message.tx instanceof Script { ... } else if message.tx instanceof Mint { ... } else { ... }  `

With these in-build features, the default members optionality and the additional complex types imported under google.protobuf, one could in fact represent the fuel-core morphology quite precisely in a 1:1 linear manner with protobuf messages.

Realization

The suggested above compatibility allows us to go forward and create a process:

define all our .proto schema definitions in a single crate (library) following the fuel-core and fuel-vm data structures morphology. In the diagram below I have depicted this as fuel-protos.
Both crates fuel-core and fuel-vm would import the fuel-protos crate and generate rust protobuf definitions also exporting them. The way to achieve this is to add a build.rs to each crate or sub-crate inside fuel-core and fuel-vm. This will trigger the prost-build process that will lead to message structures being generated in Rust on build of each of these crates.
Define in each crate for those native mapped to protobuf structures corresponding From and TryFrom conversions. The idea is to keep types and traits all in the same place in order not the violate the orphan rule.
Introduce a new crate under fuel-core e.g. fuel-message-types which will collect and re-export the produced in 2. rust proto structures alongside with all necessary traits needed for their conversion in between native fuel-core types and wire message types.
The latter messaging crate is to be used as a dependency by our data-messaging system for easy in-and-out conversion between wire and native rust data types.
Finally the fuel-protos crate could also represent an npm package that is to be imported by our data-systems-ts-sdk and fuel-ts-sdk if needed. The protobuf generator will automatically create fully typed typescript interfaces and definitions for direct use when consuming or sending messages. Other clients such as fuel-go could also easily benefit from the .proto schemas and make direct use of 100% compatible with rust and typescript data structures.

The steps above are all depicted on the diagram below:

Conclusions

We have explored a very feasible architecture where we can make full use of protobuf3's latest features and create a centralized repository with schemas that could be shared by ts-clients, go-clients etc. The latter will be used as a single source of truth. The conversion between wire types and native rust types should be easy to achieve and can be easily checked on every CI/CD run for correctness against the native types. I feel we should proceed with the protobuf approach which will give us a sustainable and flexible solution for the entire networking of all fuel-systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data-systems messaging #306

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Data-systems messaging #306

0xterminator Nov 8, 2024 Collaborator

RFC Goals

Overview

Suitability

Libraries availability

Borsh

Protobuf

Complexity

Speed and optimizations

Borsh

Description

Borsh with Rust

Borsh with Typescript

Discussion

Complications around borsh and fuel-core

Conclusions

Protobuf

Description

Approach

Realization

Conclusions

Replies: 0 comments

0xterminator
Nov 8, 2024
Collaborator