Extracting Blockchain Data with Cryo
A few months ago, noted EVM enthusiast storm released an EVM data extraction tool called Cryo. Cryo is a Rust tool that performs native JSON-RPC queries and stores the extracted data in Apache Parquet format.
His initial announcement on Twitter certainly caught my attention. 16x performance over native Python? I’m in!
Readers may remember my exploration liquidity snapshots for Uniswap V3 pools.
The end result of that effort is a fairly compact JSON file (roughly 12 MB for Ethereum mainnet) that contains a complete snapshot of all V3 liquidity information up to a certain block.
The most painful part of the process was the data extraction. Using web3.py to retrieve blockchain data via JSON-RPC is straightforward but slow. For curiosity, I ran a complete liquidity event retrieval last night against my Geth node. It worked from the deployment block of Uniswap V3 (12,369,621) to block height 18,448,500. The process took about an hour, and most of that involved getting results from
get_logs calls to web3py.
I could have improved the process by doing concurrent requests via the new async web3 provider, but I was drawn to Cryo for a different reason: persistent data storage.
The Apache Parquet format is very powerful, and it solves a particularly difficult problem for indexed data. Before the Cryo release, storm wrote a fascinating Twitter thread about Parquet.
I also recommend this Parquet overview. A key benefit of Parquet files is that you can evolve a schema over time. Say that you initially want to extract only the block numbers, topics, and data for a given set of event logs. Later as your analytics requirement changes, you decide that you need the transaction hashes of those events. You can run a separate Cryo extraction for that info, then bundle the two separate Parquet files together as a dataset, which is merged by common indices. It’s very similar to combining tables with SQL JOIN, but without the need to operate a database or predefine schema.
The takeaway is that Parquet data storage is high performance, compressible, and natively queriable with built-in filters. For blockchain data that is recorded once and then unchanged, Parquet is a fantastic format for cold storage. Node clients do a good job of exposing this data, but they are primarily designed for fast execution, efficient storage, and maintaining pace with the live chain. A node client optimized for heavy historical block queries will need to make performance compromises elsewhere, or impose such a hardware load as to make it impractical.
Cryo solves that problem by taking a different approach: extract the data from your node client, persist it in Parquet format, and then run your queries from that efficient offline data set.
Nothing is free though! The cost you pay for this is additional storage space. Paradigm published a handful of Cryo data sets up to block 16,799,999 earlier this year. Storing them on your system would require roughly 100 GB.
Banteg reported on Twitter that a full Cryo extraction of every block from Erigon consumed 430 GB with zstd level 3 compression. Interestingly, this is nearly the storage difference between running a full node (Geth) and an archive node (Reth, Erigon).
Perhaps some hybrid model will emerge where searchers primarily use full nodes and a regularly-updated Cryo data set for historical queries?
With the preamble done, let’s install Cryo and test it out. The Cryo Github lists several options:
Install Rust-native client from source
Install Rust-native client from crates.io
Install Python bindings from PyPI
Install Python bindings from source
I tried them all, but unfortunately the two Python methods and the Rust-native installation from crates.io fail with various build errors. I will spend more time looking at this, because I’d love to script Cryo from Python without directly managing processes.
The compilation of the Rust-native client from source works (for now), so let’s do that.
First, you need to have Rust installed. Install the appropriate package for your distribution and confirm it works with
[btd@main ~]$ rustc --version rustc 1.73.0 (cc66ad468 2023-10-03) (Fedora 1.73.0-1.fc38)
Make sure that the cargo binary location (
~/.cargo/bin) is added to your PATH by adjusting
.profile. I use Fedora, so I edit
.bashrc, adding one line at the bottom:
Then open a new terminal session and confirm it is active:
[btd@main ~]$ env| grep PATH PATH=/home/btd/.local/bin:/home/btd/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/btd/.foundry/bin:/home/btd/.npm-global/bin:/home/btd/.foundry/bin:/home/btd/.cargo/bin
If yours looks different, don’t worry, just make sure that the cargo location is included.
Now clone the repo and build Cryo per the instructions:
[btd@main code]$ git clone https://github.com/paradigmxyz/cryo [btd@main code]$ cd cryo [btd@main cryo]$ cargo install --path ./crates/cli
After that completes, test that
cryo is available:
[btd@main cryo]$ cryo --version cryo 0.2.0-143-g322d665
I prefer to keep my extracted data in a separate location. In the following examples I will use the top directory
I’m interested in extracting the same data that took so long with web3.py.
To get these, I ask Cryo to extract all
PoolCreated events from block 12,369,621 until now. Cryo has no built-in method to convert an event prototype to a
topic0 hash, so I need to provide it.
Using the familiar WBTC-WETH V3 pool as an example, we find the
PoolCreated event has
topic0 = 0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118. This
topic0 will be the same for all V3 pools. I also want to filter for logs emitted only by the Uniswap V3 Factory contract address (
I also specify an output directory so that this data set is kept separate from others.
I’m running all queries to my local Reth node at port 8543.
Run the query:
[btd@main cryo_data]$ cryo logs --rpc http://localhost:8543 --blocks 12369621:latest --contract 0x1F98431c8aD98523631AE4a59f267346ea31F984 --event 0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118 --output-dir uniswapv3_poolcreated cryo parameters ─────────────── - version: 0.2.0-143-g322d665 - data: - datatypes: logs - blocks: n=6,080,542 min=12,369,621 max=18,450,162 align=no reorg_buffer=0 - contracts: n=6,081 min=0x1f98431c8ad98523631ae4a59f267346ea31f984 max=0x1f98431c8ad98523631ae4a59f267346ea31f984 - topic0s: n=6,081 min=0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118 max=0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118 - source: - network: ethereum - rpc url: http://localhost:8543 - max requests per second: unlimited - max concurrent requests: unlimited - max concurrent chunks: 4 - inner request size: 1 - output: - chunk size: 1,000 - chunks to collect: 6,081 / 6,081 - output format: parquet - output dir: /home/btd/code/cryo_data/uniswapv3_poolcreated - report file: $OUTPUT_DIR/.cryo/reports/2023-10-28_10-24-58.271654.json schema for logs ─────────────── - chain_id: uint64 - log_index: uint32 - address: binary - topic1: binary - data: binary - transaction_index: uint32 - topic0: binary - topic2: binary - transaction_hash: binary - block_number: uint32 - topic3: binary sorting logs by: block_number, log_index other available columns: [none] collecting data ─────────────── [...]
At the time of writing, Cryo extracted 6,081 chunks of 1,000 blocks each, taking 3 minutes, 58 seconds and writing 46 MB of data.
I also want to get the
Burn events for all V3 pools. Following a similar process, removing the contract address (since I want to get all pools), adding an additional
topic0, and specifying a different output directory:
[btd@main cryo_data]$ cryo logs --rpc http://localhost:8543 --blocks 12369621:latest --event 0x0c396cd989a39f4459b5fa1aed6a9a8dcdbc45908acfd67e028cd568da98982c --event 0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde --output-dir v3_liquidity_events/ cryo parameters ─────────────── - version: 0.2.0-143-g322d665 - data: - datatypes: logs - blocks: n=6,080,733 min=12,369,621 max=18,450,353 align=no reorg_buffer=0 - topic0s: n=12,162 min=0x0c396cd989a39f4459b5fa1aed6a9a8dcdbc45908acfd67e028cd568da98982c max=0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde - source: - network: ethereum - rpc url: http://localhost:8543 - max requests per second: unlimited - max concurrent requests: unlimited - max concurrent chunks: 4 - inner request size: 1 - output: - chunk size: 1,000 - chunks to collect: 6,081 / 6,081 - output format: parquet - output dir: /home/btd/code/cryo_data/v3_liquidity_events - report file: $OUTPUT_DIR/.cryo/reports/2023-10-28_11-06-55.944645.json schema for logs ─────────────── - topic1: binary - topic2: binary - chain_id: uint64 - log_index: uint32 - transaction_hash: binary - address: binary - data: binary - topic0: binary - topic3: binary - transaction_index: uint32 - block_number: uint32 sorting logs by: block_number, log_index other available columns: [none] collecting data ─────────────── started at 2023-10-28 11:06:55.944 done at 2023-10-28 11:17:00.962 collection summary ────────────────── - total duration: 605.017 seconds - total chunks: 6,081 - chunks errored: 0 / 6,081 (0.0%) - chunks skipped: 0 / 6,081 (0.0%) - chunks collected: 6,081 / 6,081 (100.0%) - blocks collected: 6,080,733 - blocks per second: 10,050.5 - blocks per minute: 603,030.3 - blocks per hour: 36,181,818.0 - blocks per day: 868,363,632.6
Now that the two data sets are ready, let’s get into Python and pull some data.