This set of benchmarks is based around popularly used datasets in Machine Learning fetched from multiple sources.
Click on the link below to download the respective benchmark. You can also use wget from the command line.
airlines
bank_marketing
cnae-9
connect-4
fashion_mnist
- https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/t10k-images-idx3-ubyte.gz
- https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/t10k-labels-idx1-ubyte.gz
- https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/train-images-idx3-ubyte.gz
- https://github.com/zalandoresearch/fashion-mnist/blob/master/data/fashion/train-labels-idx1-ubyte.gz
nomao
numerai
higgs
census
- https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
- https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test
titanic
creditcard
appetency
nyc_taxi
news_popularity
black_friday
mercedes
diamonds
After you have downloaded a benchmark, run the preprocess.py script with the benchmark name as below
python3 sql/preprocess.py --benchmark <name>
Launch MySQL Shell as below
mysqlsh user@hostname --mysql --sql
On the mysql-shell prompt, run
> source sql/<benchmark_name>.sql
where <benchmark_name> is a name from the above table. The train and test csvs generated above should be present in the current directory in MySQL Shell. Each SQL file will create the schemas for a benchmark, train a HeatWave ML model on it, and score the model on the test data. The test score will be output at the e end.
In order to run scalability numbers for HeatWave ML, for the benchmarks above, run the ML_TRAIN commands from the sql files above for each benchmark on 1, 2, 4, 8 and 16 nodes. Measure the end-to-end training time (ML_TRAIN time from MySQL client perspective) for each configuration (benchmark + number of nodes). Graphing the number of nodes against the runtime on each node should give the scalability for a benchmark.