Extracting Blockchain Data with Cryo

Stay Frosty

Nov 02, 2023

∙ Paid

UPDATE: After writing this guide and using Cryo more thoroughly, I have refined the extraction method and processing script. The guide remains below as-is, but I recommend reviewing the updated Cryo extraction script in the Base Chain Data Extraction post.

Base L2 Chain Data Extraction

BowTiedDevil

June 15, 2024

Read full story

A few months ago, noted EVM enthusiast storm released an EVM data extraction tool called Cryo. Cryo is a Rust tool that performs native JSON-RPC queries and stores the extracted data in Apache Parquet format.

His initial announcement on Twitter certainly caught my attention. 16x performance over native Python? I’m in!

Readers may remember my exploration liquidity snapshots for Uniswap V3 pools.

The end result of that effort is a fairly compact JSON file (roughly 12 MB for Ethereum mainnet) that contains a complete snapshot of all V3 liquidity information up to a certain block.

The most painful part of the process was the data extraction. Using web3.py to retrieve blockchain data via JSON-RPC is straightforward but slow. For curiosity, I ran a complete liquidity event retrieval last night against my Geth node. It worked from the deployment block of Uniswap V3 (12,369,621) to block height 18,448,500. The process took about an hour, and most of that involved getting results from get_logs calls to web3py.

I could have improved the process by doing concurrent requests via the new async web3 provider, but I was drawn to Cryo for a different reason: persistent data storage.

The Apache Parquet format is very powerful, and it solves a particularly difficult problem for indexed data. Before the Cryo release, storm wrote a fascinating Twitter thread about Parquet.

I also recommend this Parquet overview. A key benefit of Parquet files is that you can evolve a schema over time. Say that you initially want to extract only the block numbers, topics, and data for a given set of event logs. Later as your analytics requirement changes, you decide that you need the transaction hashes of those events. You can run a separate Cryo extraction for that info, then bundle the two separate Parquet files together as a dataset, which is merged by common indices. It’s very similar to combining tables with SQL JOIN, but without the need to operate a database or predefine schema.

The takeaway is that Parquet data storage is high performance, compressible, and natively queriable with built-in filters. For blockchain data that is recorded once and then unchanged, Parquet is a fantastic format for cold storage. Node clients do a good job of exposing this data, but they are primarily designed for fast execution, efficient storage, and maintaining pace with the live chain. A node client optimized for heavy historical block queries will need to make performance compromises elsewhere, or impose such a hardware load as to make it impractical.

Cryo solves that problem by taking a different approach: extract the data from your node client, persist it in Parquet format, and then run your queries from that efficient offline data set.

Nothing is free though! The cost you pay for this is additional storage space. Paradigm published a handful of Cryo data sets up to block 16,799,999 earlier this year. Storing them on your system would require roughly 100 GB.

Banteg reported on Twitter that a full Cryo extraction of every block from Erigon consumed 430 GB with zstd level 3 compression. Interestingly, this is nearly the storage difference between running a full node (Geth) and an archive node (Reth, Erigon).

Perhaps some hybrid model will emerge where searchers primarily use full nodes and a regularly-updated Cryo data set for historical queries?

Installing Cryo

With the preamble done, let’s install Cryo and test it out. The Cryo Github lists several options:

Install Rust-native client from source
Install Rust-native client from crates.io
Install Python bindings from PyPI
Install Python bindings from source

I tried them all, but unfortunately the two Python methods and the Rust-native installation from crates.io fail with various build errors. I will spend more time looking at this, because I’d love to script Cryo from Python without directly managing processes.

The compilation of the Rust-native client from source works (for now), so let’s do that.

First, you need to have Rust installed. Install the appropriate package for your distribution and confirm it works with rustc --version:

[btd@main ~]$ rustc --version
rustc 1.73.0 (cc66ad468 2023-10-03) (Fedora 1.73.0-1.fc38)

Make sure that the cargo binary location (~/.cargo/bin) is added to your PATH by adjusting .bashrc or .profile. I use Fedora, so I edit .bashrc, adding one line at the bottom:

export PATH="$PATH:$HOME/.cargo/bin"

Then open a new terminal session and confirm it is active:

[btd@main ~]$ env| grep PATH
PATH=/home/btd/.local/bin:/home/btd/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/var/lib/snapd/snap/bin:/home/btd/.foundry/bin:/home/btd/.npm-global/bin:/home/btd/.foundry/bin:/home/btd/.cargo/bin

If yours looks different, don’t worry, just make sure that the cargo location is included.

Now clone the repo and build Cryo per the instructions:

[btd@main code]$ git clone https://github.com/paradigmxyz/cryo
[btd@main code]$ cd cryo
[btd@main cryo]$ cargo install --path ./crates/cli

After that completes, test that cryo is available:

[btd@main cryo]$ cryo --version
cryo 0.2.0-143-g322d665

Extracting Data

I prefer to keep my extracted data in a separate location. In the following examples I will use the top directory ~/code/cryo_data.

I’m interested in extracting the same data that took so long with web3.py.

To get these, I ask Cryo to extract all PoolCreated events from block 12,369,621 until now. Cryo has no built-in method to convert an event prototype to a topic0 hash, so I need to provide it.

Using the familiar WBTC-WETH V3 pool as an example, we find the PoolCreated event has topic0 = 0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118. This topic0 will be the same for all V3 pools. I also want to filter for logs emitted only by the Uniswap V3 Factory contract address (0x1F98431c8aD98523631AE4a59f267346ea31F984).

I also specify an output directory so that this data set is kept separate from others.

I’m running all queries to my local Reth node at port 8543.

Run the query:

[btd@main cryo_data]$ cryo logs --rpc http://localhost:8543 --blocks 12369621:latest --contract 0x1F98431c8aD98523631AE4a59f267346ea31F984 --event 0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118 --output-dir uniswapv3_poolcreated

cryo parameters
───────────────
- version: 0.2.0-143-g322d665
- data: 
    - datatypes: logs
    - blocks: n=6,080,542 min=12,369,621 max=18,450,162 align=no reorg_buffer=0
    - contracts: n=6,081 min=0x1f98431c8ad98523631ae4a59f267346ea31f984 max=0x1f98431c8ad98523631ae4a59f267346ea31f984
    - topic0s: n=6,081 min=0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118 max=0x783cca1c0412dd0d695e784568c96da2e9c22ff989357a2e8b1d9b2b4e6b7118
- source: 
    - network: ethereum
    - rpc url: http://localhost:8543
    - max requests per second: unlimited
    - max concurrent requests: unlimited
    - max concurrent chunks: 4
    - inner request size: 1
- output: 
    - chunk size: 1,000
    - chunks to collect: 6,081 / 6,081
    - output format: parquet
    - output dir: /home/btd/code/cryo_data/uniswapv3_poolcreated
    - report file: $OUTPUT_DIR/.cryo/reports/2023-10-28_10-24-58.271654.json


schema for logs
───────────────
- chain_id: uint64
- log_index: uint32
- address: binary
- topic1: binary
- data: binary
- transaction_index: uint32
- topic0: binary
- topic2: binary
- transaction_hash: binary
- block_number: uint32
- topic3: binary

sorting logs by: block_number, log_index

other available columns: [none]


collecting data
───────────────

[...]

At the time of writing, Cryo extracted 6,081 chunks of 1,000 blocks each, taking 3 minutes, 58 seconds and writing 46 MB of data.

I also want to get the Mint and Burn events for all V3 pools. Following a similar process, removing the contract address (since I want to get all pools), adding an additional topic0, and specifying a different output directory:

[btd@main cryo_data]$ cryo logs --rpc http://localhost:8543 --blocks 12369621:latest --event 0x0c396cd989a39f4459b5fa1aed6a9a8dcdbc45908acfd67e028cd568da98982c --event 0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde --output-dir v3_liquidity_events/

cryo parameters
───────────────
- version: 0.2.0-143-g322d665
- data: 
    - datatypes: logs
    - blocks: n=6,080,733 min=12,369,621 max=18,450,353 align=no reorg_buffer=0
    - topic0s: n=12,162 min=0x0c396cd989a39f4459b5fa1aed6a9a8dcdbc45908acfd67e028cd568da98982c max=0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde
- source: 
    - network: ethereum
    - rpc url: http://localhost:8543
    - max requests per second: unlimited
    - max concurrent requests: unlimited
    - max concurrent chunks: 4
    - inner request size: 1
- output: 
    - chunk size: 1,000
    - chunks to collect: 6,081 / 6,081
    - output format: parquet
    - output dir: /home/btd/code/cryo_data/v3_liquidity_events
    - report file: $OUTPUT_DIR/.cryo/reports/2023-10-28_11-06-55.944645.json


schema for logs
───────────────
- topic1: binary
- topic2: binary
- chain_id: uint64
- log_index: uint32
- transaction_hash: binary
- address: binary
- data: binary
- topic0: binary
- topic3: binary
- transaction_index: uint32
- block_number: uint32

sorting logs by: block_number, log_index

other available columns: [none]


collecting data
───────────────
started at 2023-10-28 11:06:55.944    
   done at 2023-10-28 11:17:00.962


collection summary
──────────────────
- total duration: 605.017 seconds
- total chunks: 6,081
    - chunks errored:       0 / 6,081 (0.0%)
    - chunks skipped:       0 / 6,081 (0.0%)
    - chunks collected: 6,081 / 6,081 (100.0%)
- blocks collected: 6,080,733
    - blocks per second:   10,050.5
    - blocks per minute:  603,030.3
    - blocks per hour: 36,181,818.0
    - blocks per day: 868,363,632.6

Querying Parquet

Now that the two data sets are ready, let’s get into Python and pull some data.

Degen Code