diff --git a/README.md b/README.md index 6fddb64..942ba9b 100644 --- a/README.md +++ b/README.md @@ -59,7 +59,7 @@ The current multibase table is [here](multibase.csv): encoding, code, description, status identity, 0x00, 8-bit binary (encoder and decoder keeps data unmodified), default base2, 0, binary (01010101), candidate -base8, 7, octal, draft +base8, 7, octal (see RFC), draft base10, 9, decimal, draft base16, f, hexadecimal, default base16upper, F, hexadecimal, default @@ -80,10 +80,27 @@ base64, m, rfc4648 no padding, base64pad, M, rfc4648 with padding - MIME encoding, candidate base64url, u, rfc4648 no padding, default base64urlpad, U, rfc4648 with padding, default -proquint, p, PRO-QUINT https://arxiv.org/html/0901.4016, draft +proquint, p, pro-quint https://arxiv.org/html/0901.4016 (see RFC), draft ``` -**NOTE:** Multibase-prefixes are encoding agnostic. "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). For example, in UTF-32, "z" would be `[0x7a, 0x00, 0x00, 0x00]`. +**NOTE:** Multibase-prefixes are encoding agnostic: "z" is "z", not 0x7a ("z" encoded as ASCII/UTF-8). For example, in UTF-32, "z" would be `[0x7a, 0x00, 0x00, 0x00]`. In particular, the multibase code 0x00 listed for the identity encoding is the non-printable ASCII/UTF-8 character with codepoint 0x00, while the multibase code 0 listed for base2 is the ASCII/UTF-8 character "0" (which has codepoint 0x30). + +## Specifications + +Below is a list of specs for the underlying base encodings: + +- `identity` [identity RFC](rfcs/identity.md) +- `base2` [base2 RFC](rfcs/Base2.md) +- `base8` [base8 RFC](rfcs/Base8.md), similar to [rfc4648](https://datatracker.ietf.org/doc/html/rfc4648.html) +- `base10` [base10 RFC](rfcs/Base10.md) +- `base36` [base36 RFC](rfcs/Base36.md) +- `base16*` [rfc4648](https://datatracker.ietf.org/doc/html/rfc4648.html) +- `base32*` (except for `base32z`) [rfc4648](https://datatracker.ietf.org/doc/html/rfc4648.html) +- `base32z` [human-oriented base32 spec](https://philzimmermann.com/docs/human-oriented-base-32-encoding.txt) +- `base64*` [rfc4648](https://datatracker.ietf.org/doc/html/rfc4648.html) +- `base58btc` https://datatracker.ietf.org/doc/html/draft-msporny-base58-02 +- `base58flickr` https://datatracker.ietf.org/doc/html/draft-msporny-base58-02, but using alphabet `123456789abcdefghijkmnopqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ` +- `proquint` [proquint RFC](rfcs/PRO-QUINT.md), which is the [original spec](https://arxiv.org/html/0901.4016) with an added prefix for legibility ## Reserved @@ -160,6 +177,7 @@ Yes, but we already have to agree on base encodings, so this is not hard. The ta - [scala-multibase](//github.com/fluency03/scala-multibase) - [cpp-multibase](//github.com/cpp-ipfs/cpp-multibase) - [ruby-multibase](//github.com/sleeplessbyte/ruby-multibase) +- `multibase` sub-module of Python module [multiformats](//github.com/hashberg-io/multiformats) - [Add yours here!](//github.com/multiformats/multibase/edit/master/README.md) diff --git a/multibase.csv b/multibase.csv index 33f4f09..6a934fc 100644 --- a/multibase.csv +++ b/multibase.csv @@ -1,7 +1,7 @@ encoding, code, description, status identity, 0x00, 8-bit binary (encoder and decoder keeps data unmodified), default base2, 0, binary (01010101), candidate -base8, 7, octal, draft +base8, 7, octal (see RFC), draft base10, 9, decimal, draft base16, f, hexadecimal, default base16upper, F, hexadecimal, default @@ -22,4 +22,4 @@ base64, m, rfc4648 no padding, base64pad, M, rfc4648 with padding - MIME encoding, candidate base64url, u, rfc4648 no padding, default base64urlpad, U, rfc4648 with padding, default -proquint, p, PRO-QUINT https://arxiv.org/html/0901.4016, draft +proquint, p, pro-quint https://arxiv.org/html/0901.4016 (see RFC), draft diff --git a/rfcs/Base2.md b/rfcs/Base2.md index 5352f83..df510c1 100644 --- a/rfcs/Base2.md +++ b/rfcs/Base2.md @@ -16,7 +16,7 @@ order, where each byte of the array is set to the character `1`, if the corresponding bit in the byte is set, and the character `0` if the corresponding bit is unset. -For example, `[0x58, 0x59, 0x60]` can be converted to multibase base2 as +For example, `[0x58, 0x59, 0x5a]` can be converted to multibase base2 as follows: ``` diff --git a/rfcs/PRO-QUINT.md b/rfcs/PRO-QUINT.md index 5d59aa4..64de275 100644 --- a/rfcs/PRO-QUINT.md +++ b/rfcs/PRO-QUINT.md @@ -1,7 +1,16 @@ # PRO-QUINT -See: https://arxiv.org/html/0901.4016 ([/ipfs/bafybeib5jsyi5igjwhi7hzkfebpvnq2ykbwpxeaaxlkyfyxqvcecoao4qa](https://dweb.link/ipfs/bafybeib5jsyi5igjwhi7hzkfebpvnq2ykbwpxeaaxlkyfyxqvcecoao4qa)). +For the original proquint specification, see: https://arxiv.org/html/0901.4016 ([/ipfs/bafybeib5jsyi5igjwhi7hzkfebpvnq2ykbwpxeaaxlkyfyxqvcecoao4qa](https://dweb.link/ipfs/bafybeib5jsyi5igjwhi7hzkfebpvnq2ykbwpxeaaxlkyfyxqvcecoao4qa)). -While the multibase prefix is `p`, the "full" prefix is actually `pro-`. This way, proquints are always easily pronouncable. For example +The multibase prefix for proquints is the character `p`. The base encoded data is the encoded data according to the original specification, with an additional `ro-` prefix: -`127.0.0.1`, as a multibase proquint encoded number, is `pro-lusab-babad`. +``` + +``` + +The resulting full prefix for the actual proquint encoded data is `pro-`, making multibase-encoded proquints easily pronouncable. +For example, the proquint encoding of the bytestring `[127, 0, 0, 1]` (the data for the IPv4 address `127.0.0.1`) is `lusab-babad`, so the corresponding multibase-encoded proquint bytestring is: + +``` +pro-lusab-babad +``` diff --git a/rfcs/identity.md b/rfcs/identity.md new file mode 100644 index 0000000..7880bba --- /dev/null +++ b/rfcs/identity.md @@ -0,0 +1,41 @@ +# Identity + +The multibase identity prefix is the character non-printable ASCII/UTF-8 character with codepoint 0x00. Note that this is different from the multibase prefix 0 listed for base2, which is the ASCII/UTF-8 character "0" with codepoint 0x30. + + +## Encoding + +A byte array `b` is encoded by converting it to the Unicode string `s` having as its UTF-8 bytes the byte array `b` prefixed with a single zero byte. + +Below is a minimal implementation in Python, for clarification: + +```py +def encode_identity(b: bytes) -> str: + utf8_bytes = b"\x00"+b + return utf8_bytes.decode("utf-8") +``` + +## Decoding + +A Unicode string `s` is decoded by obtaining its UTF-8 bytes and dropping the leading byte. The UTF-8 byte array must be non-empty and the leading byte must be zero. + +Below is a minimal implementation in Python, for clarification: + +```py +def decode_identity(s: str) -> bytes: + utf8_bytes = s.encode("utf-8") + if not utf8_bytes or utf8_bytes[0] != 0: + raise ValueError("String not identity-encoded.") + return utf8_bytes[1:] +``` + +## Examples + +```py +>>> encode_identity(bytes([0x31, 0x63, 0x57])) +'\x001cW' +>>> decode_identity("\x001cW") +b'1cW' +>>> list(decode_identity("\x001cW")) +[49, 99, 87] # [0x31, 0x63, 0x57] +``` diff --git a/tests/case_insensitivity.csv b/tests/case_insensitivity.csv index 3037d9c..e824ff5 100644 --- a/tests/case_insensitivity.csv +++ b/tests/case_insensitivity.csv @@ -1,4 +1,4 @@ -non-canonical encoding, "hello world" +non-canonical encoding, "yes mani !" base16, "f68656c6c6f20776F726C64" base16upper, "F68656c6c6f20776F726C64" base32, "bnbswy3dpeB3W64TMMQ"