However, both the character set and the checksum algorithm have limitations:
* Base58 needs a lot of space in QR codes, as it cannot use the ''alphanumeric mode''.
* The mixed case in base58 makes it inconvenient to reliably write down, type on mobile keyboards, or read out loud.
* The double SHA256 checksum is slow and has no error-detection guarantees.
* Most of the research on error-detecting codes only applies to character-set sizes that are a [https://en.wikipedia.org/wiki/Prime_power prime power], which 58 is not.
* Base58 decoding is complicated and relatively slow.
Included in the Segregated Witness proposal are a new class of outputs
* The '''human-readable part''', which is intended to convey the type of data, or anything else that is relevant to the reader. This part MUST contain 1 to 83 US-ASCII characters, with each character having a value in the range [33-126]. HRP validity may be further restricted by specific applications.
* The '''separator''', which is always "1". In case "1" is allowed inside the human-readable part, the last one in the string is the separator<ref>'''Why include a separator in addresses?''' That way the human-readable
part is unambiguously separated from the data part, avoiding potential
collisions with other human-readable parts that share a prefix. It also
allows us to avoid having character-set restrictions on the human-readable part. The
separator is ''1'' because using a non-alphanumeric character would
complicate copy-pasting of addresses (with no double-click selection in
several applications). Therefore an alphanumeric character outside the normal character set
was chosen.</ref>.
* The '''data part''', which is at least 6 characters long and only consists of alphanumeric characters excluding "1", "b", "i", and "o"<ref>'''Why not use an existing character set like [http://www.faqs.org/rfcs/rfc3548.html RFC3548] or [https://philzimmermann.com/docs/human-oriented-base-32-encoding.txt z-base-32]'''?
The character set is chosen to minimize ambiguity according to
[https://hissa.nist.gov/~black/GTLD/ this] visual similarity data, and
the ordering is chosen to minimize the number of pairs of similar
characters (according to the same data) that differ in more than 1 bit.
As the checksum is chosen to maximize detection capabilities for low
numbers of bit errors, this choice improves its performance under some
error models.</ref>.
{| class="wikitable"
|-
!
!0
!1
!2
!3
!4
!5
!6
!7
|-
!+0
|q||p||z||r||y||9||x||8
|-
!+8
|g||f||2||t||v||d||w||0
|-
!+16
|s||3||j||n||5||4||k||h
|-
!+24
|c||e||6||m||u||a||7||l
|}
'''Checksum'''
The last six characters of the data part form a checksum and contain no
information. Valid strings MUST pass the criteria for validity specified
by the Python3 code snippet below. The function
<tt>bech32_verify_checksum</tt> must return true when its arguments are:
* <tt>hrp</tt>: the human-readable part as a string
* <tt>data</tt>: the data part as a list of integers representing the characters after conversion using the table above
<pre>
def bech32_polymod(values):
GEN = [0x3b6a57b2, 0x26508e6d, 0x1ea119fa, 0x3d4233dd, 0x2a1462b3]
chk = 1
for v in values:
b = (chk >> 25)
chk = (chk & 0x1ffffff) << 5 ^ v
for i in range(5):
chk ^= GEN[i] if ((b >> i) & 1) else 0
return chk
def bech32_hrp_expand(s):
return [ord(x) >> 5 for x in s] + [0] + [ord(x) & 31 for x in s]
checksum calculation followed by a zero and then the lower bits of each<ref>'''Why are the high bits of the human-readable part processed first?'''
This results in the actually checksummed data being ''[high hrp] 0 [low hrp] [data]''. This means that under the assumption that errors to the
human readable part only change the low 5 bits (like changing an alphabetical character into another), errors are restricted to the ''[low hrp] [data]''
part, which is at most 89 characters, and thus all error detection properties (see appendix) remain applicable.</ref>.
To construct a valid checksum given the human-readable part and (non-checksum) values of the data-part characters, the code below can be used:
A segwit address<ref>'''Why not make an address format that is generic for all scriptPubKeys?'''
That would lead to confusion about addresses for
existing scriptPubKey types. Furthermore, if addresses that do not have a one-to-one mapping with scriptPubKeys (such as ECDH-based
addresses) are ever introduced, having a fully generic old address type available would
permit reinterpreting the resulting scriptPubKeys using the old address
format, with lost funds as a result if bitcoins are sent to them.</ref> is a Bech32 encoding of:
* The human-readable part "bc"<ref>'''Why use 'bc' as human-readable part and not 'btc'?''' 'bc' is shorter.</ref> for mainnet, and "tb"<ref>'''Why use 'tb' as human-readable part for testnet?''' It was chosen to
be of the same length as the mainnet counterpart (to simplify
implementations' assumptions about lengths), but still be visually
distinct.</ref> for testnet.
* The data-part values:
** 1 value: the witness version
** A conversion of the the 2-to-40-byte witness program (as defined by [https://github.com/bitcoin/bips/blob/master/bip-0141.mediawiki BIP141]) to base32:
*** Start with the bits of the witness program, most significant bit per byte first.
*** Re-arrange those bits into groups of 5, and pad with zeroes at the end if needed.
*** Translate those bits to characters using the table above.
'''Decoding'''
Software interpreting a segwit address:
* MUST verify that the human-readable part is "bc" for mainnet and "tb" for testnet.
* MUST verify that the first decoded data value (the witness version) is between 0 and 16, inclusive.
* Convert the rest of the data to bytes:
** Translate the values to 5 bits, most significant bit first.
** Re-arrange those bits into groups of 8 bits. Any incomplete group at the end MUST be 4 bits or less, MUST be all zeroes, and is discarded.
** There MUST be between 2 and 40 groups, which are interpreted as the bytes of the witness program.
Decoders SHOULD enforce known-length restrictions on witness programs.
For example, BIP141 specifies ''If the version byte is 0, but the witness
program is neither 20 nor 32 bytes, the script must fail.''
As a result of the previous rules, addresses are always between 14 and 74 characters long, and their length modulo 8 cannot be 0, 3, or 5.
Version 0 witness addresses are always 42 or 62 characters, but implementations MUST allow the use of any version.
* <tt>?1ezyfcl</tt> WARNING: During conversion to US-ASCII some encoders may set unmappable characters to a valid US-ASCII character, such as '?'. For example:
This means that when 5 changed characters occur randomly distributed in
the 39 characters of a P2WPKH address, there is a chance of
''0.756 per billion'' that it will go undetected. When those 5 changes
occur randomly within a 19-character window, that chance goes down to
''0.093 per billion''. As the number of errors goes up, the chance
converges towards ''1 in 2<sup>30</sup>'' = ''0.931 per billion''.
Even though the chosen code performs reasonably well up to 1023 characters,
other designs are preferable for lengths above 89 characters (excluding the
separator).
==Acknowledgements==
This document is inspired by the [https://rusty.ozlabs.org/?p=578 address proposal] by Rusty Russell, the
[https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2014-February/004402.html base32] proposal by Mark Friedenbach, and had input from Luke Dashjr,
Johnson Lau, Eric Lombrozo, Peter Todd, and various other reviewers.