From bf17bc0b7a711be38c4ac9e41808a21ae62da810 Mon Sep 17 00:00:00 2001 From: Daira Hopwood Date: Fri, 22 Feb 2019 23:29:09 +0000 Subject: [PATCH] Add ZIP 173. Signed-off-by: Daira Hopwood --- zip-0173.rst | 469 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 469 insertions(+) create mode 100644 zip-0173.rst diff --git a/zip-0173.rst b/zip-0173.rst new file mode 100644 index 000000000..0b8f10080 --- /dev/null +++ b/zip-0173.rst @@ -0,0 +1,469 @@ +:: + + ZIP: 173 + Title: Bech32 Format + Author: Daira Hopwood + Credits: Pieter Wuille + Greg Maxwell + Rusty Russell + Mark Friedenbach + Category: Wallet + Created: 2018-06-13 + License: MIT + + +Terminology +=========== + +The key words "MUST", "MUST NOT", "SHOULD", and "SHOULD NOT" in this document are +to be interpreted as described in RFC 2119. [#RFC2119]_ + +The term "network upgrade" in this document is to be interpreted as described in +ZIP 200. [#zip-0200]_ + +The term "Sapling" in this document is to be interpreted as described in ZIP 205. +[#zip-0205]_ + + +Abstract +======== + +This document proposes a checksummed base32 format, "Bech32", and a standard for +Sapling addresses and keys using it. + + +Motivation +========== + +Since launch, Zcash has relied on Base58 addresses with a truncated double-SHA256 +checksum, inherited from Bitcoin. They were part of the original Bitcoin software +and their scope was extended in BIP 13 [#bip-0013]_ for P2SH (Pay-to-Script-Hash) +addresses. However, both the character set and the checksum algorithm have +limitations: + + * Base58 needs a lot of space in QR codes, as it cannot use the + ''alphanumeric mode''. + + * The mixed case in Base58 makes it inconvenient to reliably write down, type on + mobile keyboards, or read out loud. + + * The double-SHA256 checksum is slow and has no error-detection guarantees. + + * Most of the research on error-detecting codes only applies to character-set + sizes that are a `prime power `_, + which 58 is not. + + * Base58 decoding is complicated and relatively slow. + +To address these issues, Bitcoin adopted a new encoding called Bech32 [#bip-0173]_, +for use with address types added by its Segregated Witness proposal. Zcash does +not implement Segregated Witness, but it reuses Bech32 with address and key types +introduced by the Sapling network upgrade [#zip-0205]_. + +Since the description of Bech32 in [#bip-0173]_ is partially entangled with +Segregated Witness address formats, we have found it clearer to write this ZIP to +specify Bech32 separately. This specification should be read in conjunction with +section 5.6 ("Encodings of Addresses and Keys") of the Zcash Sapling protocol +specification [#protocol]_, and with ZIP 32 which specifies additional key types +for support of shielded hierarchical deterministic wallets [#zip-0032]_. + + +Specification +============= + +We describe the general checksummed base32 format called ''Bech32''. Its use for +Sapling payment addresses and keys is defined in the Sapling protocol specification +[#protocol]_. + +A Bech32 string consists of: + + * The *human-readable part*, which is intended to convey the type of data, or + anything else that is relevant to the reader. This part MUST contain 1 to 83 + US-ASCII characters, with each character having a value in the range [33-126]. + HRP validity may be further restricted by specific applications. + + * The *separator*, which is always "1". In case "1" is allowed inside the + human-readable part, the last one in the string is the separator. + + * The *data part*, which is at least 6 characters long and only consists of + alphanumeric characters excluding "1", "b", "i", and "o". + ++-----+---+---+---+---+---+---+---+---+ +| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ++=====+===+===+===+===+===+===+===+===+ +| +0 | q | p | z | r | y | 9 | x | 8 | ++-----+---+---+---+---+---+---+---+---+ +| +8 | g | f | 2 | t | v | d | w | 0 | ++-----+---+---+---+---+---+---+---+---+ +| +16 | s | 3 | j | n | 5 | 4 | k | h | ++-----+---+---+---+---+---+---+---+---+ +| +24 | c | e | 6 | m | u | a | 7 | l | ++-----+---+---+---+---+---+---+---+---+ + + +Checksum +-------- + +The last six characters of the data part form a checksum and contain no information. +Valid strings MUST pass the criteria for validity specified by the Python3 code +snippet below. The checksum is defined so that the function ``bech32_verify_checksum`` +returns true when its arguments are: + + * ``hrp``: the human-readable part as a string; + * ``data``: the data part as a list of integers representing the characters after + conversion using the table above. + +:: + + def bech32_polymod(values): + GEN = [0x3b6a57b2, 0x26508e6d, 0x1ea119fa, 0x3d4233dd, 0x2a1462b3] + chk = 1 + for v in values: + b = (chk >> 25) + chk = (chk & 0x1ffffff) << 5 ^ v + for i in range(5): + chk ^= GEN[i] if ((b >> i) & 1) else 0 + return chk + + def bech32_hrp_expand(s): + return [ord(x) >> 5 for x in s] + [0] + [ord(x) & 31 for x in s] + + def bech32_verify_checksum(hrp, data): + return bech32_polymod(bech32_hrp_expand(hrp) + data) == 1 + + +This implements a `BCH code `_; +in the case where the encoded string is at most 90 characters, this code +guarantees detection of *any error affecting at most 4 characters* and has +less than a 1 in 109 chance of failing to detect more errors. +More details about the properties can be found in the Checksum Design appendix. +The human-readable part is processed by first feeding the higher bits of each +character's US-ASCII value into the checksum calculation followed by a zero +and then the lower bits of each US-ASCII value. + +To construct a valid checksum given the human-readable part and (non-checksum) +values of the data-part characters, the code below can be used: + +:: + + def bech32_create_checksum(hrp, data): + values = bech32_hrp_expand(hrp) + data + polymod = bech32_polymod(values + [0,0,0,0,0,0]) ^ 1 + return [(polymod >> 5 * (5 - i)) & 31 for i in range(6)] + + +Error correction +'''''''''''''''' + +One of the properties of these BCH codes is that they can be used for error +correction. An unfortunate side effect of error correction is that it erodes +error detection: correction changes invalid inputs into valid inputs, but if +more than a few errors were made then the valid input may not be the correct +input. Use of an incorrect but valid input can cause funds to be lost +irrecoverably. Because of this, implementations SHOULD NOT implement correction +beyond potentially suggesting to the user where in the string an error might be +found, without suggesting the correction to make. + +Uppercase/lowercase +''''''''''''''''''' + +The lowercase form is used when determining a character's value for checksum +purposes. + +Encoders MUST always output an all-lowercase Bech32 string. If an uppercase +version of the encoding result is desired (e.g. for presentation purposes, or +QR code use), then an uppercasing procedure can be performed external to the +encoding process. + +Decoders MUST NOT accept strings where some characters are uppercase and some +are lowercase (such strings are referred to as mixed-case strings). + +For presentation, lowercase is usually preferable, but inside QR codes +uppercase SHOULD be used, as those permit the use of `alphanumeric mode +`_, +which is 45% more compact than the `byte mode +`_ that would +otherwise be used. + +Encoding +'''''''' + + * Start with the bits of the raw encoding of the appropriate address or key + type, most significant bit per byte first. + + * Re-arrange those bits into groups of 5, and pad with zeroes at the end if + needed. + + * Translate those bits, most significant bit first, to characters using the + table above. + +Decoding +'''''''' + +Software interpreting a Bech32-encoded address MUST first verify that the +human-readable part matches that of a specified address type, and similarly +for keys. + +If this check passes, convert the rest of the data to bytes: + + * Translate the values using the table above to 5 bits, most significant bit + first. + + * Re-arrange those bits into groups of 8 bits. Any incomplete group at the + end MUST be 4 bits or fewer, MUST be all zeroes, and is discarded. + + * The resulting groups are interpreted as the raw encoding of the appropriate + address or key type, with the most significant bit first in each byte. + +Decoders SHOULD enforce known-length restrictions on address and key types. + +For example, [#protocol]_ specifies that the length of the raw encoding of a +Sapling payment address is 43 bytes (88 + 256 bits). This results in a +Bech32-encoded Sapling payment address being 78 characters long. + + +Compatibility +============= + +Only software supporting the Sapling network upgrade is able to use these +addresses. + +There is no effect on support for transparent addresses and keys, or for Sprout +z-addresses and keys; these will continue to be encoded using Base58Check, and +MUST NOT be encoded using Bech32. + + +Rationale +========= + +Why use base32 at all? +---------------------- + +The lack of mixed case makes it more efficient to read out loud or to put into +QR codes. It does come with a 15% length increase. However, the length of a +Bech32-encoded Sapling payment address remains below 80 characters, which +reduces the likelihood of line splitting in terminals or email. Thus, +cutting-and-pasting of addresses is still reliable. + +Why call it Bech32? +------------------- + +"Bech" contains the characters BCH (the error detection algorithm used) and +sounds a bit like "base". The term Bech32 is established for Bitcoin and there +was no reason to use a different name for it in Zcash Sapling. + +Why not support Bech32 encoding of Sprout or transparent addresses? +------------------------------------------------------------------- + +This was not considered to be sufficiently well-motivated given the +compatibility issues that would arise from having two formats for these +address types, with pre-Sapling software not supporting the new format. + +Why include a separator in addresses? +------------------------------------- + +That way the human-readable part is unambiguously separated from the data part, +avoiding potential collisions with other human-readable parts that share a +prefix. It also allows us to avoid having character-set restrictions on the +human-readable part. The separator is ''1'' because using a non-alphanumeric +character would complicate copy-pasting of addresses (with no double-click +selection in several applications). Therefore an alphanumeric character outside +the normal character set was chosen. + +Why not use an existing base32 character set? +--------------------------------------------- + +Existing character sets for base-32 encodings include +`RFC 3548 `_ and +`z-base-32 `_. +The set used for Bech32 was chosen to minimize ambiguity according to +`this `_ visual similarity data, and the +ordering is chosen to minimize the number of pairs of similar characters +(according to the same data) that differ in more than 1 bit. As the checksum is +chosen to maximize detection capabilities for low numbers of bit errors, this +choice improves its performance under some error models. + +Why are the high bits of the human-readable part processed first? +----------------------------------------------------------------- + +This design choice had a rationale for Bitcoin Segregated Witness addresses +(see [#bip-0173]_) that does not apply to Zcash Sapling addresses. It is +retained for compatibility with Bech32 encoders/decoders written for Bitcoin. + + +Reference implementations +========================= + +* The encoder/decoder used by ``zcashd``: + + * `In C++ `_ + +* Encoders and decoders written for Bitcoin: + + * `For C `_ + * `For C++ `_ + * `For Javascript `_ + * `For Go `_ + * `For Python `_ + * `For Haskell `_ + * `For Ruby `_ + * `For Rust `_ + +* Fancy decoder written for Bitcoin that localizes errors: + + * `Fancy decoder for Javascript `_ + (`demo website `_) + +Note that the encoders written for Bitcoin may make assumptions specific to +Segregated Witness address formats that do not apply to Zcash. Only the Python +one has been tested with Zcash at the time of writing. + + +Examples +======== + +TODO: add valid Sapling payment addresses and keys, and their corresponding raw encodings. + + +Test vectors +============ + +The following strings are valid Bech32: + +* ``A12UEL5L`` +* ``a12uel5l`` +* ``an83characterlonghumanreadablepartthatcontainsthenumber1andtheexcludedcharactersbio1tt5tgs`` +* ``abcdef1qpzry9x8gf2tvdw0s3jn54khce6mua7lmqqqxw`` +* ``11qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqc8247j`` +* ``split1checkupstagehandshakeupstreamerranterredcaperred2y9e3w`` +* ``?1ezyfcl`` + +The following strings are not valid Bech32 (with reason for invalidity): + +* 0x20 + ``1nwldj5``: HRP character out of range +* 0x7F + ``1axkwrx``: HRP character out of range +* 0x80 + ``1eym55h``: HRP character out of range +* ``pzry9x0s0muk``: No separator character +* ``1pzry9x0s0muk``: Empty HRP +* ``x1b4n0q5v``: Invalid data character +* ``li1dgmt3``: Too short checksum +* ``de1lg7wt`` + 0xFF: Invalid character in checksum +* ``A1G7SGD8``: checksum calculated with uppercase form of HRP +* ``10a06t8``: empty HRP +* ``1qzzfhee``: empty HRP +* ``bc1qw508d6qejxtdg4y5r3zarvary0c5xw7kv8f3t5``: Invalid checksum +* ``tb1qrp33g0q5c5txsp9arysrx4k6zdkfs4nce4xj0gdcccefvpysxf3q0sL5k7``: Mixed case +* ``bc1zw508d6qejxtdg4y5r3zarvaryvqyzf3du``: Zero padding of more than 4 bits +* ``tb1qrp33g0q5c5txsp9arysrx4k6zdkfs4nce4xj0gdcccefvpysxf3pjxtptv``: Non-zero padding in 8-to-5 conversion + + +Checksum design +=============== + +Design choices +-------------- + +BCH codes can be constructed over any prime-power alphabet and can be chosen to +have a good trade-off between size and error-detection capabilities. Unlike +`Reed-Solomon codes `_, +they are not restricted in length to one less than the alphabet size. While they +also support efficient error correction, the implementation of just error detection +is very simple. + +We pick 6 checksum characters as a trade-off between length of the addresses and +the error-detection capabilities, as 6 characters is the lowest number sufficient +for a random failure chance below 1 per billion. For the length of data we're +most interested in protecting (up to 77 bytes excluding the separator for a +Sapling payment address), BCH codes can be constructed that guarantee detecting +up to 4 errors. Longer data is also supported with slightly weaker error detection. + +Selected properties +''''''''''''''''''' + +Many of these codes perform badly when dealing with more errors than they are +designed to detect, but not all. For that reason, we consider codes that are +designed to detect only 3 errors as well as 4 errors, and analyse how well they +perform in practice. + +The specific code chosen here is the result of: + + * Starting with an exhaustive list of 159605 BCH codes designed to detect 3 or 4 + errors up to length 93, 151, 165, 341, 1023, and 1057. + + * From those, requiring the detection of 4 errors up to length 71, resulting in + 28825 remaining codes. + + * From those, choosing the codes with the best worst-case window for 5-character + errors, resulting in 310 remaining codes. + + * From those, picking the code with the lowest chance for not detecting small + numbers of ''bit'' errors. + +As a naive search would require over 6.5 \* 10\ :sup:`19` checksum evaluations, +a collision-search approach was used for analysis. The code can be found +`here `_. + +Properties +'''''''''' + +The following table summarizes the chances for detection failure (as multiples of +1 in 10\ :sup:`9`). + ++-------------------------------------+-----------------------------------------------+ +| Window length | Number of wrong characters | ++--------+----------------------------+-------+-------+-------+-------+-------+-------+ +| Length | Description | ≤4 | 5 | 6 | 7 | 8 | ≥9 | ++--------+----------------------------+-------+-------+-------+-------+-------+-------+ +| 8 | Longest detecting 6 errors | 0 | 1.127 | 0.909 | n/a | ++--------+----------------------------+---------------+-------+-------+-------+-------+ +| 18 | Longest detecting 5 errors | 0 | 0.965 | 0.929 | 0.932 | 0.931 | ++--------+----------------------------+-------+-------+-------+-------+-------+-------+ +| 19 | Worst case for 6 errors | 0 | 0.093 | 0.972 | 0.928 | 0.931 | ++--------+----------------------------+-------+-------+-------+-------+---------------+ +| 39 | Length ≤ 39 characters | 0 | 0.756 | 0.935 | 0.932 | 0.931 | ++--------+----------------------------+-------+-------+-------+-------+---------------+ +| 59 | Length ≤ 59 characters | 0 | 0.805 | 0.933 | 0.931 | ++--------+----------------------------+-------+-------+-------+-----------------------+ +| 71 | Length ≤ 71 characters | 0 | 0.830 | 0.934 | 0.931 | ++--------+----------------------------+-------+-------+-------+-----------------------+ +| 89 | Longest detecting 4 errors | 0 | 0.867 | 0.933 | 0.931 | ++--------+----------------------------+-------+-------+-------+-----------------------+ + +TODO: fill in table for a Sapling payment address, and check the following paragraph. + +This means that when 5 changed characters occur randomly distributed in the 77 +characters (excluding the separator) of a Sapling payment address, there is a chance +of at most 0.867 per billion that it will go undetected. When those 5 changes occur +randomly within a 19-character window, that chance goes down to 0.093 per billion. +As the number of errors goes up, the chance converges towards 1 in 2\ :sup:`30` = +0.931 per billion. + +The chosen code performs reasonably well up to 1023 characters, but is best for +lengths up to 89 characters (excluding the separator). Since the search for suitable +codes was based on the requirements for Bitcoin P2WPKH and P2WSH addresses, it is +not quite optimal for the address lengths used by Sapling, but the advantages of +compatibility with existing Bitcoin libraries were considered to outweigh this +consideration. + + +Acknowledgements +================ + +This document is closely based on BIP 173 written by Pieter Wuille and Greg Maxwell, +which was inspired by the `address proposal `_ by +Rusty Russell and the `base32 +`_ +proposal by Mark Friedenbach. BIP 173 also had input from Luke Dashjr, Johnson Lau, +Eric Lombrozo, Peter Todd, and various other reviewers. + + +References +========== + +.. [#RFC2119] `Key words for use in RFCs to Indicate Requirement Levels `_ +.. [#protocol] `Zcash Protocol Specification, Version 2018.0-beta-37 [Overwinter+Sapling] `_ +.. [#zip-0032] `ZIP 32: Shielded Hierarchical Deterministic Wallets `_ +.. [#zip-0200] `ZIP 200: Network Upgrade Mechanism `_ +.. [#zip-0205] `ZIP 205: Deployment of the Sapling Network Upgrade `_ +.. [#bip-0013] `BIP 13: Address Format for pay-to-script-hash `_ +.. [#bip-0173] `BIP 173: Base32 address format for native v0-16 witness outputs `_