Single (1 file) Parquet dataset
Apache DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
The benchmark should be completed in under an hour. On-demand pricing is $0.6 per hour while spot pricing is only $0.2 to $0.3 per hour (us-east-2).
-
Manually start an AWS EC2 instance. The following environments are included in this directory:
Instance Type OS Disk Arch c6a.xlargeUbuntu 24.04or laterRoot 500GB gp2 SSD AMD64 c6a.2xlargeAMD64 c6a.4xlargeAMD64 c8g.4xlargeARM64
All with no EBS optimization and no instance store.
-
Wait for the status checks to pass, then ssh to EC2:
ssh ubuntu@{ip} -
git clone https://github.com/ClickHouse/ClickBench -
cd ClickBench/datafusion -
vi benchmark.shand modify the following line to target the DataFusion versiongit checkout 47.0.0
-
bash benchmark.sh
You can update/preview the results by running:
./make-json.sh <machine-name> # Example. ./make-json.sh c6a.xlarge
- DataFusion follows the SQL standard with case-sensitive identifiers, so all column names in
queries.sqluse double-quoted literals (e.g.EventTime->"EventTime").
-
Install/build
datafusion-cli. -
Download the parquet file:
wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet
- Run the queries:
datafusion-cli -f create.sql -f queries.sql
Or use the runner script:
PATH="$(pwd)/arrow-datafusion/target/release:$PATH" ./run.sh