This repository contains all the code and data we use to produce experimental results in the paper "Separability, Collisions, Triangles, and more: Evaluating SQL Query Similarity Metrics".
The structure of the repository is as follows:
src: The Python code that runs the experiments.SQLFeatureExtraction: The program that can calculate distance matrices for theaouiche,aligonandmakiyamasimilarity algorithms.postgres: The Docker container that can host the SQL databases we utilize.diagrams: The visualizations that are used in the paper, organized by experiment.tables: The experimental results in .csv format
The postgres folder is a Docker container that hosts the Postgres databases we utilize in the experiments. Make sure you have docker-compose installed, and run docker compose run -p 5432:5432 db to start the container, or docker compose up db to more permanently keep it running as a service. Alternatively, you can manually build and host the databases using the .sql files in the postgres/databases folder.
The src folder contains the code that runs all experiments and produces all the tables and plots. To be able to run it, we recommend using Python 3.9.7, and setting up a virtual environment. You can run pip install -r requirements.txt in order to install all libraries and dependencies.
The tools that are available to run are these:
query_similarity_analysis.py- You can use its command-line arguments to run analysis on specific datasets and methods and run experiments in parallel.query_similarity_experiments.py- Runs all analyses, produces all tables and plots.query_similarity-llm.py- Utilizes an OpenAI LLM (gpt-4o-mini model) to produce the llm-based datasets, and the llm-based similarity algorithms. As the datasets and results are included in the repository, you do not need to run it for the experiments.
The tool in the SQLFeatureExtraction folder is used to calculate the distance matrices for the aouiche, aligon and makiyama similarity algorithms for a given set of SQL queries. The program is originally derived from https://github.com/UBOdin/EttuBench, and it's slightly modified to be able to run the algorithms on a user-defined dataset of queries.
In order to run it, you must first install maven and compile it by going to the SQLFeatureExtraction folder and running mvn clean install.
In order to run the experiments you must run these steps:
- Export the queries in a text format.
- Use the
SQLFeatureExtractiontool to produce the distance matrices for theaouiche,aligonandmakiyamasimilarity algorithms. - Run the
query_similarity_analysis.pyprogram to perform all calculations and produce the visualizations (this step can be omitted, but we recommend running it to speed up the calculations). - Run the
query_similarity_experiments.pyprogram to produce the aggregate tables and visualizations
Go to src folder and run python query_similarity_analysis.py --export.
This will create the src/similarity_cache/small/exported_queries.txt, src/similarity_cache/medium/exported_queries.txt, src/similarity_cache/large/exported_queries.txt, src/similarity_cache/llm-actors-limited-0/exported_queries.txt and src/similarity_cache/llm-actors-numerous-0/exported_queries.txt files.
The first time the code runs will take a considerable amount of time, as it will train the vocabularies we use for the word2vec vector embeddings.
For every dataset, separately, you must run the SQLFeatureExtraction tool to produce the distance matrices for the aouiche, aligon and makiyama similarity algorithms, and copy the results to their designated folder.
- Copy
src/similarity_cache/<DATASET>/exported_queries.txtinto theSQLFeatureExtraction/datafolder and rename it tomy_queries.txt. - Go to the
SQLFeatureExtraction/targetfolder, and runjava -jar SQLComparison.jar -input my_queries. This will populate theSQLFeatureExtraction/datafolder with the distance matrices foraouiche,aligon, andmakiyamasimilarity algorithms. It may take minutes to a few hours, depending on the size of the dataset. - Copy the files
my_queries_aouiche.csv,my_queries_aligon.csv,my_queries_makiyama.csv,my_queries_aouiche_regularization.csv,my_queries_aligon_regularization.csv, andmy_queries_makiyama_regularization.csvover tosrc/similarity_cache/<DATASET>/<EMBEDDING_METHOD>/for every embedding method for that dataset. The embedding method folders arebert_vocab(200),bert_vocab(768)andbert, and if they don't exist, you should create them.
Go to the src folder and run the query_similarity_analysis.py program. You can directly run python query_similarity_analysis.py --all to run all experiments and sensitivity analyses, but this is expected to take a long time to run, in the order of days. Alternatively, what we would recommend is to run parts of the experiments in parallel on different processes, speeding up the whole run considerably:
python query_similarity_analysis.py --dataset small --embedding bert_vocab(200)python query_similarity_analysis.py --dataset small --embedding bert_vocab(768)python query_similarity_analysis.py --dataset small --embedding bertpython query_similarity_analysis.py --dataset medium --embedding bert_vocab(200)python query_similarity_analysis.py --dataset medium --embedding bert_vocab(768)python query_similarity_analysis.py --dataset medium --embedding bertpython query_similarity_analysis.py --dataset large --embedding bert_vocab(200)python query_similarity_analysis.py --dataset large --embedding bert_vocab(768)python query_similarity_analysis.py --dataset large --embedding bertpython query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bert_vocab(200)python query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bert_vocab(768)python query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bertpython query_similarity_analysis.py --dataset llm-actors-numerous-0 --embedding bert_vocab(200)python query_similarity_analysis.py --dataset llm-actors-numerous-0 --embedding bert_vocab(768)python query_similarity_analysis.py --sensitivity smallpython query_similarity_analysis.py --sensitivity mediumpython query_similarity_analysis.py --sensitivity large
Go to the src folder and run python query_similarity_experiments.py. If you already ran query_similarity_analysis.py, this shouldn't take more than a few minutes to run, otherwise, it will first run all experiments and take days to finish. The folder /src/similarity_cache/tables should have all produced tables and aggregate plots. The src/similarity_cache/<DATASET>/<EMBEDDING_METHOD>/ folders will also have visualizations for the cluster and distance distributions for all similarity algorithms.