This is the raphtory benchmarking suite. It is designed to test the performance of raphtory against other graph processing systems.
There are two benchmarks suites, one for python with support for multiple systems, and one for rust with support for raphtory only.
This only does a light benchmark of raphtory, as it is designed to be used standalone.
Raphtory Quick Benchmark
Usage: raphtory-rust-benchmark [OPTIONS]
Options:
--header Set if the file has a header, default is False
--delimiter <DELIMITER> Delimiter of the csv file [default: "\t"]
--file-path <FILE_PATH> Path to a csv file [default: ]
--from-column <FROM_COLUMN> Position of the from column in the csv [default: 0]
--to-column <TO_COLUMN> Position of the to column in the csv [default: 1]
--time-column <TIME_COLUMN> Position of the time column in the csv, default will ignore time [default: -1]
--download Download default files
--debug Debug to print more info to the screen
-h, --help Print help
-V, --version Print version
First download the example file by cd'ing into the raphtory-rust-benchmark folder and running
cargo run --release -- --download
This will download the example file into tmp folder on your system, it will give you the file path.
You can then run the benchmark by running, with the file path it has given you
cargo run --release -- --file-path <file_path>
You can also provide your own file path, but please ensure you have set the correct arguments. I.e Whether it has a header, what the delimiter is, and what columns are what.
e.g.
cargo run --release -- --file-path="/Users/1337/Documents/dev/Data/lotr.csv" --delimiter="," --from-column=0 --to-column=1
This benchmarks the python version of raphtory. Please ensure your python environment has raphtory installed.
Systems currently supported are:
All benchmarks are run in docker, this documentation ONLY supports docker. They can be run outside of docker, however this requires you to install all the systems yourself.
These benchmarks run using a slim version of the pokec dataset, which is a social network dataset. More information available here
- docker
- python 3.10
- python libraries
- docker
- requests
- tqdm
- pandas
- raphtory
- neo4j
- python libraries
pip install networkx scipy matplotlib raphtory kuzu neo4j
-
First download the pokec dataset, this can be done by running below. This will download the dataset into a folder called 'data'
python benchmark_driver.py -b download
-
Benchmarks can be run by running the below command with a benchmark name
python benchmark_driver.py -b <benchmark_name>
Supported benchmark names are:
allRun AlldownloadDownload DatarRun Raphtory BenchmarkgtRun GraphTool BenchmarkkRun Kuzu BenchmarknxRun NetworkX BenchmarkneoRun Neo4j BenchmarkmemRun Memgraph BenchmarkcozoRun CozoDB Benchmark
You will see the following
Welcome to the Raphtory Benchmarking Tool
Running dockerised benchmarking...
** Running dockerized benchmark neo...
** Running for Neo4j...
Starting docker container...
Creating Docker client...
Pulling Docker image...
Defining volumes...
Running Docker container & benchmark...
Running command ...
...
Welcome to the Raphtory Benchmarking Tool
Running local benchmarking...
** Running for XX...
** Running setup...
setup time: xx
** Running degree...
degree time: xx
** Running out_neighbours...
out_neighbours time: xx
** Running page_rank...
page_rank time: xx
** Running connected_components...
connected_components time: xx
setup degree out_neighbours page_rank connected_components
XX xx xx xx xx xx
Completed command...
Benchmark completed, retrieving results...
Removing container...
Docker container exited with code 0
- To run without docker add the
--no-dockerarg
These benchmarks were run on Amazon AWS m5ad.4xlarge instances. All the scripts and data were stored on the instance NVME drive.
| Setup | Degree | Out Neighbours | Page Rank | Connected Components | |
|---|---|---|---|---|---|
| Raphtory | 63.68 | 2.49 | 23.04 | 1.09 | 17.89 |
| GraphTool | 194.09 | 0.008 | 43.30 | 4.75 | 3.83 |
| Kuzu | 18.17 | 1.13 | 89.03 | NOT IMPL | NOT IMPL |
| NetworkX | 130.57 | 1.17 | 24.42 | 162.0 | 160.99 |
| Neo4J | 61.09 | 9.40 | 1296.30 | ||
| MemGraph | 498.38 | 73.08 | 75.574 | 131.46 | 142.55 |
| Cozo | 137.82 | 35.36 | 35.17 | 32.83 | N/A SEG FAULT |
Some key notes:
-
Network
- We compared the results of our pagerank in Raphtory with NetworkX using a directed graph and the results are identical.
-
Kuzu
- Does not support page rank or connected components
- Out neighbours was run in batches of 100k users at a time, as if you attempt to get all users results it crashes and exceeds some buffer size
-
Neo4J
- Due to the way Neo4J imports batch data, the data import was run in offline mode using the Neo4j admin tool, this was counted in the setup time. The script then starts a Neo4J instance and waits for it to be online, we give the script 50 seconds to do this. These 50 seconds are removed from the setup time
- Due to this, we had to change the format of the data specifically for Neo4J so it would import properly. So Neo4J runs a pre-processing step before running the benchmark. This alters the data, creating a header and adding labels. This is not counted in the setup time. However, it means the data ingested by neo is ever so larger
- The admin ingestion also required that we use both a node list and edge list, which we did not need for some of the other tools.
- To run the neo4j benchmark first run the command
python3 benchmark_driver.py -b neothis will setup the docker container, then one its finished setup it runs atail -f /dev/dull. Quit the process and the container will still be online. Then rundocker psto get the container name, thendocker exec -it <container name> bashto login to the container. Thencd /var/lib/neo4j/import/data2/and runpython3 benchmark_driver.py -b neo --no-docker
-
Memgraph
- It was advised to create indexes with the node list prior to relationships, which was done in the setup time. This could be why it took the longest to setup. https://memgraph.com/docs/memgraph/import-data/load-csv-clause#one-type-of-nodes-and-relationships
-
Cozo
- Triggered a segmentation fault when running the connected components algorithm