It’s critical to learn how to efficiently store, retrieve, and process bulk data.
Every time a bot starts, it must synchronize its internal state with the external world. At minimum, it must establish a connection to the chain network through an RPC.
From there it might set up a subscription to receive events as the chain is updated.
It might read chain data stored in Parquet or pool data stored in JSON.
I generally use a hybrid approach that extracts low-level chain data using Cryo and processes it to make high-level JSON files for use in bots.
However I have some frustrations with this method, particularly with the JSON step.
Friendship Ended With JSON?
Ironically, I’m most often frustrated by JSON’s extreme flexibility.
I’ve used several different Python JSON packages — the Standard Library’s built-in json, ujson, orjson, and jiter via Pydantic. It’s nearly impossible to get a uniform result when comparing two parsers. Some parsers enforce that keys must be strings. Some allow integers. Some support floating point and NaN values. Some crash when encountering integers larger than 64 bits!
Basically it’s a crap-shoot. Pydantic gives the most consistent results, so I use it for all JSON processing in degenbot.
You should be familiar with my use of liquidity maps for Uniswap V3 pools. The bummer about using JSON for the snapshot is that it has to be read in full before decoding, and must be rewritten from scratch every time it’s updated. Reading and writing a complete 500 MB snapshot over and over is just painful!
And with the continued popularity of the V3 approach, the performance of this solution will continue to get worse.
The JSON format also creates a maintenance burden on me — I’ve written various versions of the processing script to support new pools. They’re generally not backwards-compatible, not portable, not easy to glue together, and it’s difficult to keep track of them.