Merge bitcoin/bitcoin#27432: contrib: add tool to convert compact-serialized UTXO set to SQLite database

4080b66cbe test: add test for utxo-to-sqlite conversion script (Sebastian Falbesoner)
ec99ed7380 contrib: add tool to convert compact-serialized UTXO set to SQLite database (Sebastian Falbesoner)

Pull request description:

  ## Problem description

  There is demand from users to get the UTXO set in form of a SQLite database (#24628). Bitcoin Core currently only supports dumping the UTXO set in a binary _compact-serialized_ format, which was crafted specifically for AssumeUTXO snapshots (see PR #16899), with the primary goal of being as compact as possible. Previous PRs tried to extend the `dumptxoutset` RPC with new formats, either in human-readable form (e.g. #18689, #24202), or most recently, directly as SQLite database (#24952). Both are not optimal: due to the huge size of the ever-growing UTXO set with already more than 80 million entries on mainnet, human-readable formats are practically useless, and very likely one of the first steps would be to put them in some form of database anyway. Directly adding SQLite3 dumping support on the other hand introduces an additional dependency to the non-wallet part of bitcoind and the risk of increased maintenance burden (see e.g. https://github.com/bitcoin/bitcoin/pull/24952#issuecomment-1163551060, https://github.com/bitcoin/bitcoin/issues/24628#issuecomment-1108469715).

  ## Proposed solution

  This PR follows the "external tooling" route by adding a simple Python script for achieving the same goal in a two-step process (first create compact-serialized UTXO set via `dumptxoutset`, then convert it to SQLite via the new script). Executive summary:
  - single file, no extra dependencies (sqlite3 is included in Python's standard library [1])
  - ~150 LOC, mostly deserialization/decompression routines ported from the Core codebase and (probably the most difficult part) a little elliptic curve / finite field math to decompress pubkeys (essentialy solving the secp256k1 curve equation y^2 = x^3 + 7 for y given x, respecting the proper polarity as indicated by the compression tag)
  - creates a database with only one table `utxos` with the following schema:
    ```(txid TEXT, vout INT, value INT, coinbase INT, height INT, scriptpubkey TEXT)```
  - the resulting file has roughly 2x the size of the compact-serialized UTXO set (this is mostly due to encoding txids and scriptpubkeys as hex-strings rather than bytes)

  [1] note that there are some rare cases of operating systems like FreeBSD though, where the sqlite3 module has to installed explicitly (see #26819)

  A functional test is also added that creates UTXO set entries with various output script types (standard and also non-standard, for e.g. large scripts) and verifies that the UTXO sets of both formats match by comparing corresponding MuHashes. One MuHash is supplied by the bitcoind instance via `gettxoutsetinfo muhash`, the other is calculated in the test by reading back the created SQLite database entries and hashing them with the test framework's `MuHash3072` module.

  ## Manual test instructions
  I'd suggest to do manual tests also by comparing MuHashes. For that, I've written a go tool some time ago which would calculate the MuHash of a sqlite database in the created format (I've tried to do a similar tool in Python, but it's painfully slow).
  ```
  $ [run bitcoind instance with -coinstatsindex]
  $ ./src/bitcoin-cli dumptxoutset ~/utxos.dat
  $ ./src/bitcoin-cli gettxoutsetinfo muhash <block height returned in previous call>
  (outputs MuHash calculated from node)

  $ ./contrib/utxo-tools/utxo_to_sqlite.py ~/utxos.dat ~/utxos.sqlite
  $ git clone https://github.com/theStack/utxo_dump_tools
  $ cd utxo_dump_tools/calc_utxo_hash
  $ go run calc_utxo_hash.go ~/utxos.sqlite
  (outputs MuHash calculated from the SQLite UTXO set)

  => verify that both MuHashes are equal
  ```
  For a demonstration what can be done with the resulting database, see https://github.com/bitcoin/bitcoin/pull/24952#pullrequestreview-956290477 for some example queries. Thanks go to LarryRuane who gave me to the idea of rewriting this script in Python and adding it to `contrib`.

ACKs for top commit:
  ajtowns:
    ACK 4080b66cbe - light review
  achow101:
    ACK 4080b66cbe
  romanz:
    tACK 4080b66cbe on signet (using [calc_utxo_hash](8981aa3e85/calc_utxo_hash/calc_utxo_hash.go)):
  tdb3:
    ACK 4080b66cbe

Tree-SHA512: be8aa0369a28c8421a3ccdf1402e106563dd07c082269707311ca584d1c4c8c7b97d48c4fcd344696a36e7ab8cdb64a1d0ef9a192a15cff6d470baf21e46ee7b
This commit is contained in:
Ava Chow 2025-02-14 15:22:10 -08:00
commit 43e71f7498
No known key found for this signature in database
GPG key ID: 17565732E08E5E41
4 changed files with 321 additions and 0 deletions

View file

@ -43,3 +43,11 @@ Command Line Tools
### [Completions](/contrib/completions) ###
Shell completions for bash and fish.
UTXO Set Tools
--------------
### [UTXO-to-SQLite](/contrib/utxo-tools/utxo_to_sqlite.py) ###
This script converts a compact-serialized UTXO set (as generated by Bitcoin Core with `dumptxoutset`)
to a SQLite3 database. For more details like e.g. the created table name and schema, refer to the
module docstring on top of the script, which is also contained in the command's `--help` output.

View file

@ -0,0 +1,195 @@
#!/usr/bin/env python3
# Copyright (c) 2024-present The Bitcoin Core developers
# Distributed under the MIT software license, see the accompanying
# file COPYING or http://www.opensource.org/licenses/mit-license.php.
"""Tool to convert a compact-serialized UTXO set to a SQLite3 database.
The input UTXO set can be generated by Bitcoin Core with the `dumptxoutset` RPC:
$ bitcoin-cli dumptxoutset ~/utxos.dat
The created database contains a table `utxos` with the following schema:
(txid TEXT, vout INT, value INT, coinbase INT, height INT, scriptpubkey TEXT)
"""
import argparse
import os
import sqlite3
import sys
import time
UTXO_DUMP_MAGIC = b'utxo\xff'
UTXO_DUMP_VERSION = 2
NET_MAGIC_BYTES = {
b"\xf9\xbe\xb4\xd9": "Mainnet",
b"\x0a\x03\xcf\x40": "Signet",
b"\x0b\x11\x09\x07": "Testnet3",
b"\x1c\x16\x3f\x28": "Testnet4",
b"\xfa\xbf\xb5\xda": "Regtest",
}
def read_varint(f):
"""Equivalent of `ReadVarInt()` (see serialization module)."""
n = 0
while True:
dat = f.read(1)[0]
n = (n << 7) | (dat & 0x7f)
if (dat & 0x80) > 0:
n += 1
else:
return n
def read_compactsize(f):
"""Equivalent of `ReadCompactSize()` (see serialization module)."""
n = f.read(1)[0]
if n == 253:
n = int.from_bytes(f.read(2), "little")
elif n == 254:
n = int.from_bytes(f.read(4), "little")
elif n == 255:
n = int.from_bytes(f.read(8), "little")
return n
def decompress_amount(x):
"""Equivalent of `DecompressAmount()` (see compressor module)."""
if x == 0:
return 0
x -= 1
e = x % 10
x //= 10
n = 0
if e < 9:
d = (x % 9) + 1
x //= 9
n = x * 10 + d
else:
n = x + 1
while e > 0:
n *= 10
e -= 1
return n
def decompress_script(f):
"""Equivalent of `DecompressScript()` (see compressor module)."""
size = read_varint(f) # sizes 0-5 encode compressed script types
if size == 0: # P2PKH
return bytes([0x76, 0xa9, 20]) + f.read(20) + bytes([0x88, 0xac])
elif size == 1: # P2SH
return bytes([0xa9, 20]) + f.read(20) + bytes([0x87])
elif size in (2, 3): # P2PK (compressed)
return bytes([33, size]) + f.read(32) + bytes([0xac])
elif size in (4, 5): # P2PK (uncompressed)
compressed_pubkey = bytes([size - 2]) + f.read(32)
return bytes([65]) + decompress_pubkey(compressed_pubkey) + bytes([0xac])
else: # others (bare multisig, segwit etc.)
size -= 6
assert size <= 10000, f"too long script with size {size}"
return f.read(size)
def decompress_pubkey(compressed_pubkey):
"""Decompress pubkey by calculating y = sqrt(x^3 + 7) % p
(see functions `secp256k1_eckey_pubkey_parse` and `secp256k1_ge_set_xo_var`).
"""
P = 2**256 - 2**32 - 977 # secp256k1 field size
assert len(compressed_pubkey) == 33 and compressed_pubkey[0] in (2, 3)
x = int.from_bytes(compressed_pubkey[1:], 'big')
rhs = (x**3 + 7) % P
y = pow(rhs, (P + 1)//4, P) # get sqrt using Tonelli-Shanks algorithm (for p % 4 = 3)
assert pow(y, 2, P) == rhs, f"pubkey is not on curve ({compressed_pubkey.hex()})"
tag_is_odd = compressed_pubkey[0] == 3
y_is_odd = (y & 1) == 1
if tag_is_odd != y_is_odd: # fix parity (even/odd) if necessary
y = P - y
return bytes([4]) + x.to_bytes(32, 'big') + y.to_bytes(32, 'big')
def main():
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('infile', help='filename of compact-serialized UTXO set (input)')
parser.add_argument('outfile', help='filename of created SQLite3 database (output)')
parser.add_argument('-v', '--verbose', action='store_true', help='show details about each UTXO')
args = parser.parse_args()
if not os.path.exists(args.infile):
print(f"Error: provided input file '{args.infile}' doesn't exist.")
sys.exit(1)
if os.path.exists(args.outfile):
print(f"Error: provided output file '{args.outfile}' already exists.")
sys.exit(1)
# create database table
con = sqlite3.connect(args.outfile)
con.execute("CREATE TABLE utxos(txid TEXT, vout INT, value INT, coinbase INT, height INT, scriptpubkey TEXT)")
# read metadata (magic bytes, version, network magic, block height, block hash, UTXO count)
f = open(args.infile, 'rb')
magic_bytes = f.read(5)
version = int.from_bytes(f.read(2), 'little')
network_magic = f.read(4)
block_hash = f.read(32)
num_utxos = int.from_bytes(f.read(8), 'little')
if magic_bytes != UTXO_DUMP_MAGIC:
print(f"Error: provided input file '{args.infile}' is not an UTXO dump.")
sys.exit(1)
if version != UTXO_DUMP_VERSION:
print(f"Error: provided input file '{args.infile}' has unknown UTXO dump version {version} "
f"(only version {UTXO_DUMP_VERSION} supported)")
sys.exit(1)
network_string = NET_MAGIC_BYTES.get(network_magic, f"unknown network ({network_magic.hex()})")
print(f"UTXO Snapshot for {network_string} at block hash "
f"{block_hash[::-1].hex()[:32]}..., contains {num_utxos} coins")
start_time = time.time()
write_batch = []
coins_per_hash_left = 0
prevout_hash = None
max_height = 0
for coin_idx in range(1, num_utxos+1):
# read key (COutPoint)
if coins_per_hash_left == 0: # read next prevout hash
prevout_hash = f.read(32)[::-1].hex()
coins_per_hash_left = read_compactsize(f)
prevout_index = read_compactsize(f)
# read value (Coin)
code = read_varint(f)
height = code >> 1
is_coinbase = code & 1
amount = decompress_amount(read_varint(f))
scriptpubkey = decompress_script(f).hex()
write_batch.append((prevout_hash, prevout_index, amount, is_coinbase, height, scriptpubkey))
if height > max_height:
max_height = height
coins_per_hash_left -= 1
if args.verbose:
print(f"Coin {coin_idx}/{num_utxos}:")
print(f" prevout = {prevout_hash}:{prevout_index}")
print(f" amount = {amount}, height = {height}, coinbase = {is_coinbase}")
print(f" scriptPubKey = {scriptpubkey}\n")
if coin_idx % (16*1024) == 0 or coin_idx == num_utxos:
# write utxo batch to database
con.executemany("INSERT INTO utxos VALUES(?, ?, ?, ?, ?, ?)", write_batch)
con.commit()
write_batch.clear()
if coin_idx % (1024*1024) == 0:
elapsed = time.time() - start_time
print(f"{coin_idx} coins converted [{coin_idx/num_utxos*100:.2f}%], " +
f"{elapsed:.3f}s passed since start")
con.close()
print(f"TOTAL: {num_utxos} coins written to {args.outfile}, snapshot height is {max_height}.")
if f.read(1) != b'': # EOF should be reached by now
print(f"WARNING: input file {args.infile} has not reached EOF yet!")
sys.exit(1)
if __name__ == '__main__':
main()

View file

@ -289,6 +289,7 @@ BASE_SCRIPTS = [
'mempool_package_onemore.py',
'mempool_package_limits.py',
'mempool_package_rbf.py',
'tool_utxo_to_sqlite.py',
'feature_versionbits_warning.py',
'feature_blocksxor.py',
'rpc_preciousblock.py',

View file

@ -0,0 +1,117 @@
#!/usr/bin/env python3
# Copyright (c) 2024-present The Bitcoin Core developers
# Distributed under the MIT software license, see the accompanying
# file COPYING or http://www.opensource.org/licenses/mit-license.php.
"""Test utxo-to-sqlite conversion tool"""
import os.path
try:
import sqlite3
except ImportError:
pass
import subprocess
import sys
from test_framework.key import ECKey
from test_framework.messages import (
COutPoint,
CTxOut,
)
from test_framework.crypto.muhash import MuHash3072
from test_framework.script import (
CScript,
CScriptOp,
)
from test_framework.script_util import (
PAY_TO_ANCHOR,
key_to_p2pk_script,
key_to_p2pkh_script,
key_to_p2wpkh_script,
keys_to_multisig_script,
output_key_to_p2tr_script,
script_to_p2sh_script,
script_to_p2wsh_script,
)
from test_framework.test_framework import BitcoinTestFramework
from test_framework.util import (
assert_equal,
)
from test_framework.wallet import MiniWallet
def calculate_muhash_from_sqlite_utxos(filename):
muhash = MuHash3072()
con = sqlite3.connect(filename)
cur = con.cursor()
for (txid_hex, vout, value, coinbase, height, spk_hex) in cur.execute("SELECT * FROM utxos"):
# serialize UTXO for MuHash (see function `TxOutSer` in the coinstats module)
utxo_ser = COutPoint(int(txid_hex, 16), vout).serialize()
utxo_ser += (height * 2 + coinbase).to_bytes(4, 'little')
utxo_ser += CTxOut(value, bytes.fromhex(spk_hex)).serialize()
muhash.insert(utxo_ser)
con.close()
return muhash.digest()[::-1].hex()
class UtxoToSqliteTest(BitcoinTestFramework):
def set_test_params(self):
self.num_nodes = 1
# we want to create some UTXOs with non-standard output scripts
self.extra_args = [['-acceptnonstdtxn=1']]
def skip_test_if_missing_module(self):
self.skip_if_no_py_sqlite3()
def run_test(self):
node = self.nodes[0]
wallet = MiniWallet(node)
key = ECKey()
self.log.info('Create UTXOs with various output script types')
for i in range(1, 10+1):
key.generate(compressed=False)
uncompressed_pubkey = key.get_pubkey().get_bytes()
key.generate(compressed=True)
pubkey = key.get_pubkey().get_bytes()
# add output scripts for compressed script type 0 (P2PKH), type 1 (P2SH),
# types 2-3 (P2PK compressed), types 4-5 (P2PK uncompressed) and
# for uncompressed scripts (bare multisig, segwit, etc.)
output_scripts = (
key_to_p2pkh_script(pubkey),
script_to_p2sh_script(key_to_p2pkh_script(pubkey)),
key_to_p2pk_script(pubkey),
key_to_p2pk_script(uncompressed_pubkey),
keys_to_multisig_script([pubkey]*i),
keys_to_multisig_script([uncompressed_pubkey]*i),
key_to_p2wpkh_script(pubkey),
script_to_p2wsh_script(key_to_p2pkh_script(pubkey)),
output_key_to_p2tr_script(pubkey[1:]),
PAY_TO_ANCHOR,
CScript([CScriptOp.encode_op_n(i)]*(1000*i)), # large script (up to 10000 bytes)
)
# create outputs and mine them in a block
for output_script in output_scripts:
wallet.send_to(from_node=node, scriptPubKey=output_script, amount=i, fee=20000)
self.generate(wallet, 1)
self.log.info('Dump UTXO set via `dumptxoutset` RPC')
input_filename = os.path.join(self.options.tmpdir, "utxos.dat")
node.dumptxoutset(input_filename, "latest")
self.log.info('Convert UTXO set from compact-serialized format to sqlite format')
output_filename = os.path.join(self.options.tmpdir, "utxos.sqlite")
base_dir = self.config["environment"]["SRCDIR"]
utxo_to_sqlite_path = os.path.join(base_dir, "contrib", "utxo-tools", "utxo_to_sqlite.py")
subprocess.run([sys.executable, utxo_to_sqlite_path, input_filename, output_filename],
check=True, stderr=subprocess.STDOUT)
self.log.info('Verify that both UTXO sets match by comparing their MuHash')
muhash_sqlite = calculate_muhash_from_sqlite_utxos(output_filename)
muhash_compact_serialized = node.gettxoutsetinfo('muhash')['muhash']
assert_equal(muhash_sqlite, muhash_compact_serialized)
if __name__ == "__main__":
UtxoToSqliteTest(__file__).main()