Blockchains have revolutionized trust and transparency in distributed systems, yet their heavy reliance on key-value (KV) storage for managing immutable, rapidly growing data leads to performance bottlenecks due to I/O inefficiencies. In this paper, we analyze Ethereum's storage workload traces, with billions of KV operations, across four dimensions: storage overhead, KV operation distributions, read correlations, and update correlations. Our study reveals 11 key findings and provides suggestions on the design and optimization of blockchain storage.
Our analysis tool is implemented in Go and requires the following prerequisites:
- Go 1.23 or later.
- Stable Internet connection to download the required packages and libraries during the build process.
- Internet connection without a proxy for synchronizing the Ethereum and collecting the traces.
To install the required packages and libraries, run the following command:
sudo apt install -y build-essential golang
# If Go version is lower than 1.22 (required by Geth), update it.
sudo apt remove golang-go
sudo apt remove --auto-remove golang-go
wget https://go.dev/dl/go1.23.2.linux-amd64.tar.gz
sudo rm -rf /usr/local/go
sudo tar -C /usr/local -xzf go1.23.2.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin # Add this line to ~/.bashrc or ~/.zshrc
# Check the Go version
go versionTo evaluate the artifacts, you do not need to run the whole analysis process. Instead, you can directly use the provided traces and the analysis tool to reproduce the main findings. Please refer to the AE_README.md for the details of the analysis tool and its usage.
We use the geth execution client and prysm beacon node to synchronize the Ethereum blockchain and collect the traces. Since we have modified the geth client to collect the traces, we provide the modified geth client in the go-ethereum-1.14.11 folder.
Modify the file go-ethereum-1.14.11/common/globalTraceLog.go to enable the trace collection with a specific start and end block number.
var targetStartBlockNumber uint64 = 20500000 // The start block number for trace collection
var targetEndBlockNumber uint64 = 21500000 // The end block number for trace collectionNote that targetStartBlockNumber and targetEndBlockNumber indicate the range of blocks to collect the traces. You can modify these values to collect the traces for a different range of blocks. The default values are set to collect the traces from block 20500000 to block 21500000. If the value is changed, you need to recompile the geth client.
The collected traces will be stored in the path where the geth client is run, with the file name format geth-trace-<year>-<month>-<day>-<hour>-<minute>-<second>. For example, if you run the geth client at 2025-08-01 00:00:00, the trace log file will be named geth-trace-2025-08-01-00-00-00.
- Build the modified
gethclient:
cd go-ethereum-1.14.11
make- Get the
gethclient binary:
cd go-ethereum-1.14.11/build/bin
cp geth <path to store the binary and run the client> # E.g., ethereum/execution/geth, see belowYou need to create the following folders to store the Ethereum data and logs:
mkdir ethereum
mkdir -p ethereum/consensus
mkdir -p ethereum/executionUse Prysm consensus client to generate the jwt secret.
cd ethereum/consensus
curl https://raw.githubusercontent.com/prysmaticlabs/prysm/master/prysm.sh --output prysm.sh && chmod +x prysm.sh
./prysm.sh beacon-chain generate-auth-secret
# The generated jwt secret is stored in the file `jwt.hex`
cp jwt.hex ../jwt.hexRun the geth client to synchronize the Ethereum blockchain and collect the traces:
- To collect the BareTrace:
cd ethereum/execution
cp ../../go-ethereum-1.14.11/build/bin/geth . # Copy the modified geth client binary to the current folder
mkdir data
export GOMAXPROCS=1 # Set the number of threads to 1 (default is the number of CPU cores)
./geth --cache 0 --cache.noprefetch --snapshot --mainnet --datadir ./data --syncmode full --http --http.api eth,net,engine,admin --authrpc.jwtsecret ../jwt.hex
# --cache 0 # Disable the cache (set size to 0)
# --snapshot # Enable KV snapshot support (by default is enabled)
# --cache.noprefetch # Disable the cache prefetching to avoid the sequence interference - To collect the CacheTrace:
cd ethereum/execution
mkdir data
export GOMAXPROCS=1 # Set the number of threads to 1 (default is the number of CPU cores)
./geth --cache.noprefetch --mainnet --datadir ./data --syncmode full --http --http.api eth,net,engine,admin --authrpc.jwtsecret ../jwt.hexThen, we need to run the prysm beacon node to synchronize the Ethereum blockchain and collect the traces:
cd ethereum/consensus
./prysm.sh beacon-chain --datadir ./data --execution-endpoint=http://localhost:8551 --mainnet --jwt-secret=../jwt.hex --checkpoint-sync-url=https://beaconstate.info --genesis-beacon-api-url=https://beaconstate.infoNote that both geth and prysm clients need to be run on the same machine to collect the traces correctly. The geth client will collect the traces of the KV operations.
To compile the whole project:
cd analysis
./build.sh install # Install the required packages and libraries then build the project
./build.sh build # Build the project without installing the required packages and libraries (i.e., for the second time)If the compilation is successful, the executable files can be found in the bin folder.
Since the original traces collected by the geth client only contain writes, but not updates. We provide a tool to filter out the updates from writes in the traces.
cd analysis/bin
./filterUpdates <path_to_kv_store_in_geth> <path_to_original_trace_log> <path_to_filtered_output_file>Note that the filtering process is time-consuming, and the output file will be large. The output file will share the same format as the original trace log file, but it contains both writes and updates. Here's an example of the trace log file and the filter output file:
geth: 2025/02/25 19:41:34 globalTraceLog.go:28: OPType: Get, key: 4fa1f8b2f76206f0ba778b536f8b5c5571b9cf6ee4279b8082d1545b9c5dc0b907010c03, size: 36
geth: 2025/02/25 19:41:34 globalTraceLog.go:28: OPType: Get, key: 4f953282fe85960010a1d6641213a35d224caca8bf4773cab000c3abd26a0c5a340603, size: 35
Note that the size of the key is counted in bytes, but the content of the key (and value) is in hex format. Our tools will automatically convert the hex format to bytes when processing the keys and values.
You can analyze the KV sizes of the Ethereum workloads by running the following command:
# After synchronizing the Ethereum blockchain and collecting the traces, stop the `geth` client and `prysm` beacon node first
cd analysis/bin
./countKVSizeDistribution <path to the KV store> # E.g., /path/to/ethereum/execution/data/geth/chaindataIt will generate the output log file pebble-database-KV-count.txt with the following format:
...
DataType: iB
KV pair number: 5249
Average KV size: 46.99
Min size for keys: 7
Max size for keys: 15
Key size distribution (Bucket width: 1 B):
Min size for values: 8
Max size for values: 32
Value size distribution (Bucket width: 1 B):
Min size for KVs: 15
Max size for KVs: 47
KV pair size distribution (Bucket width: 1 B):
...
In addition to the KV size summary of each data type, the tool also provides the distribution of key sizes <data_type>_key_histogram.txt, value sizes <data_type>_value_histogram.txt, and KV pair sizes <data_type>_kv_histogram.txt. Each line in the histogram files is formatted as size: count, where size is the size of the key, value, or KV pair, and count is the number of keys, values, or KV pairs with the corresponding size.
To analyze the KV operation distribution, you can use the kvOpDistributionAnalysis.sh script. It reads the provided traces and generates the distribution of KV operations.
cd analysis
./kvOpDistributionAnalysis.sh <path_to_trace>When the script is finished correctly, you will see the following output:
Results:
KV operation count is saved in ./mergedKVOpCount.txt
KV operation distribution (of each KV class and KV operation type) is saved in ./mergedOpDistribution-
mergedKVOpCount.txt: Contains the count of each KV operation. The content is structured as follows:Count of KV operations: Category: HeadHeaderKey OPType: Update, Count: 21 Category: HeadBlockKey OPType: Update, Count: 21 Category: TrieNodeAccountPrefix OPType: Get, Count: 7151 Category: BlockBodyPrefix OPType: Update, Count: 10 OPType: Get, Count: 96 OPType: Put, Count: 11 ... -
mergedKVOpDistribution: Contains the distribution of KV operations categorized by KV class and operation type. Each file in this directory is named as<KV class>_<KV operation type>_with_key_dis.txtand<KV class>_<KV operation type>_without_key_dis.txt, where the former includes the original key in the results and the latter does not. The KV operation number on each key is included, and the results are sorted by the number of operations.- Example output in the result file (with key):
ID Key Count 1 4f159e4890cde46d5cadbfcd08a052f495ae791244728e06d910cc743ed05eae12060b01070d 1 2 4f8679e8eda65bd257638cf8cf09b8238888947cc3c0bea2aa2cc3f1c4ac7a3002080f0b0d0a 1 ... - Example output in the result file (without key):
ID Count 1 1 2 1 ...
- Example output in the result file (with key):
To analyze the read and update correlations, you can use the readCorrelationAnalysis.sh script. It reads the provided traces and generates correlation analysis results based on the specified distances. By default, we analyze the distances of 0, 1, 4, 16, 64, 256, and 1024.
cd analysis
# for read correlation analysis
./readCorrelationAnalysis.sh <path_to_trace> <distances> <starts_blocks> <ends_blocks>
# for update correlation analysis
./updateCorrelationAnalysis.sh <path_to_trace>Note that the distances, starts_blocks, and ends_blocks parameters are optional. If not provided, the default values will be used (i.e., the values used in AE). The distances parameter specifies the distances to analyze (e.g., "0,1,4,16,64,256,1024", which indicate the distances of 0, 1, 4, 16, 64, 256, and 1024). The starts_blocks and ends_blocks parameters specify the range of blocks to analyze. If not provided, the default values will be used (i.e., 20500000 for starts_blocks and 20501000 for ends_blocks in AE).
Output for read correlation analysis
When the above scripts are finished correctly, you will see a folder named readCorrelationOutput in the current directory (i.e., analysis). It contains three types of files:
freq-category-<distance>.log: Contains the frequency of read operations between two categories at a specific distance. It indicates how many correlated reads occurred between two KV classes at the specified distance. The content is structured as follows:
HeaderPrefix;SnapshotStoragePrefix: 2
...
Total frequency: 2
freq-sorted-<distance>.log: Contains the pairs of the correlated reads sorted by their frequency at a specific distance. It includes all the pairs from different KV classes that have correlated reads at the specified distance. The content is structured as follows:
key: 68000000000138ce2521b58599e95e181e6e2ea8bb5e5c6e63b27a92f85b57c1d44be626809fd7ce4074-42;6f159e4890cde46d5cadbfcd08a052f495ae791244728e06d910cc743ed05eae12d77d079272dc0eb87abab5b6869360275148b3afe35fff68ce920c6b46ae9e2b-65; Freq: 2; Blocks: 20500006
...
Dist-<distance>-<KV class 1>-<KV class 2>-freq.log: Contains the pairs of the correlated reads sorted by their frequency at a specific distance for a specific pair of KV classes (i.e., <KV class 1> and <KV class 2>). The content is structured as follows:
key: 68000000000138ce2521b58599e95e181e6e2ea8bb5e5c6e63b27a92f85b57c1d44be626809fd7ce4074-42;6f159e4890cde46d5cadbfcd08a052f495ae791244728e06d910cc743ed05eae12d77d079272dc0eb87abab5b6869360275148b3afe35fff68ce920c6b46ae9e2b-65; Freq: 2; Blocks: 20500006
...
Output for update correlation analysis
When the above scripts are finished correctly, you will see a folder named updateCorrelationOutput in the current directory (i.e., analysis). It contains three types of files:
freq-category-<distance>.log: Contains the frequency of update operations between two categories at a specific distance. It indicates how many correlated updates occurred between two KV classes at the specified distance. The content is structured as follows:
HeaderPrefix;SnapshotStoragePrefix: 2
...
Total frequency: 2
freq-sorted-<distance>.log: Contains the pairs of the correlated updates sorted by their frequency at a specific distance. It includes all the pairs from different KV classes that have correlated updates at the specified distance. The content is structured as follows:
key: 68000000000138ce2521b58599e95e181e6e2ea8bb5e5c6e63b27a92f85b57c1d44be626809fd7ce4074-42;6f159e4890cde46d5cadbfcd08a052f495ae791244728e06d910cc743ed05eae12d77d079272dc0eb87abab5b6869360275148b3afe35fff68ce920c6b46ae9e2b-65; Freq: 2; Blocks: 20500006
...
Dist-<distance>-<KV class 1>-<KV class 2>-freq.log: Contains the pairs of the correlated updates sorted by their frequency at a specific distance for a specific pair of KV classes (i.e., <KV class 1> and <KV class 2>). The content is structured as follows:
key: 68000000000138ce2521b58599e95e181e6e2ea8bb5e5c6e63b27a92f85b57c1d44be626809fd7ce4074-42;6f159e4890cde46d5cadbfcd08a052f495ae791244728e06d910cc743ed05eae12d77d079272dc0eb87abab5b6869360275148b3afe35fff68ce920c6b46ae9e2b-65; Freq: 2; Blocks: 20500006
...