The influencers on Twitter keeps telling me it’s Base Szn, but so far… no Base Szn?
It’s not going to start itself, so I’ll do my part by focusing on the chain over the coming weeks.
My previous article covered setting up a Base node. This one will cover how to extract pool data for the most popular exchanges available on the chain.
I’ve already introduced the Cryo tool for chain data extraction. Since then I have refined my understanding of its features, and will share improved fetching and processing scripts.
Base Chain Exchange Overview
According to Defi Llama, 294 protocols are operating on Base including the familiar Uniswap and Sushiswap exchanges.
There are others like Curve & Balance. There is also a very high volume exchange called Aerodrome, which is described as “inspired by Solidly” on their github repo.
I did a bit of development to support Camelot StableSwap pools on Arbitrum, so perhaps I can re-use some of that code here. As we explore the Base ecosystem, we will determine if developing pool and arbitrage helpers for this exchange is worthwhile.
Ecosystem Support
I have developed scripts to extract and process pools for the following exchanges:
Pancakeswap V2
Pancakeswap V3
Sushiswap V2
Sushiswap V3
Uniswap V2
Uniswap V3
I have also improved the V3 liquidity events extractor and processor, which will build liquidity snapshots for all of the Uniswap V3-derived pools listed above.
Note that Pancakeswap V2 & V3 provides an additional fee tier (0.25%) which is not included in the original Uniswap implementation. It also has some small differences related to pool address hashing, using a dedicated deployer contract separate from the factory. I have rolled these changes into degenbot, but have not pushed an official release yet.
Cryo Extractor Improvements
In my introductory article about Cryo, I was ignorant of a few nice features. I have rolled them into the examples below, but want to cover them with detail.
Specifying Events and Signatures
You can limit the extraction to certain events by specifying either --topic0
or --event
followed by the event prototype hash.
For example, Uniswap V3 Mint events are specified by the prototype "Mint(address sender, address indexed owner, int24 indexed tickLower, int24 indexed tickUpper, uint128 amount, uint256 amount0, uint256 amount1)"
. The keccak hash of this string is emitted as topic0
by the node.
To extract these events with Cryo, use the option --event "0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde"
.
After extraction, this and the other elements of the emitted log will be stored within the Parquet table row in raw form (e.g. as a hex string, binary value, etc).
Decoding this after the fact can be cumbersome. But Cryo includes a translation feature that will decode these events inline into human-readable column names, which saves a lot of time.
To do this, pass the event along with its prototype using --event-signature
.
Sticking with the Mint example, Cryo will fetch and decode if passed --event "0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde" --event-signature "Mint(address sender, address indexed owner, int24 indexed tickLower, int24 indexed tickUpper, uint128 amount, uint256 amount0, uint256 amount1)"
.
Increasing Default Chunk Size
The default chunk size is 1,000 blocks. For a deep fetch of events on Base, this can result in Cryo creating tens of thousands of files. If the frequency of the events you’re looking for is low, the storage overhead becomes significant relative to the Parquet data within those files.
For example, Uniswap V2 PairCreated
events consume 740 kB with a 100,000 block chunk size and 22.9 MB with a 1,000 block chunk size.
If you are extracting sparse data over a large block range, I recommend increasing the chunk size.
Cryo Gotchas and Workarounds
Cryo is great but still marked as alpha software, so I’ve found bugs and changed some options away from the defaults to improve performance and storage use.
Extraction Range Overlap
I have discovered some unexpected behavior with Cryo. By default, it extract 1000 blocks at a time into “chunks”, each saved to a separate Parquet file. Running that command again will not re-fetch a completed chunk, which is a nice feature that avoids duplication. However, its operation on partial chunks at the end of the range if unintuitive.
For example, let’s say that I want to fetch information from blocks starting at a fixed point up to the current chain tip.
If the current chain tip is at 10,500 blocks and I do a full fetch, Cryo will save 10 full chunks (0-999; 1,000-1,999; etc.) and one partial chunk (10,000-10,500). Later, when the chain tip has advanced to 10,600, and I ask Cryo to perform the same fetch, it will ignore the full chunks from block 0 to 9,999 and re-fetch the blocks from 10,000 to 10,600.
All fine, but the output directory will now contain the 10 full chunks and two partial chunks with overlapping ranges (one with range 10,000-10,500 and one with range 10,000-10,600).
I don’t know if the Parquet format will deduplicate the elements in these overlapping chunks, but I strongly suspect not. I will package this into a proper bug report on the Cryo github, so hopefully this will be fixed soon.
In the meantime I have developed a workaround: the extraction script removes the last file in the directory before Cryo begins fetching.
So far this has worked as expected and cleaned up overlapping chunks that might lead to double-counting.
Reorg Buffer Messes Up Chunk Boundary
Cryo has a nice feature that allows you to specify a re-org buffer, ignoring blocks newer than n blocks from the current tip. This is useful on chains that will re-organize via consensus forking, like Ethereum.
Strangely, setting a re-org buffer will cause Cryo to only extract up to the last full chunk. In the example above, specifying a re-org buffer of 1 block would cause Cryo to stop fetching at the last full chunk boundary (block 9,999), ignoring all partial chunks after that. I have filed a bug about this.
Since Base has no public consensus mechanism and a single sequencer, the re-org buffer can and should be zero. Therefore it’s not a problem for us here, but be aware of this behavior if you set this value.
Extraction & Processing
Make sure to edit the DATA_DIR
, RPC_URL
and PYTHON
variables in the extraction script to specify the location of your data, the RPC endpoint, and the Python executable used to run the post-processing scripts. If using PyEnv, you can point it to a specific location to automatically activate the corresponding virtual env.
Within the processing scripts that follow it, edit the DATA_DIR
paths to match yours.
fetch_base.sh