Separability, Collisions, Triangles, and more: Evaluating SQL Query Similarity Metrics

This repository contains all the code and data we use to produce experimental results in the paper "Separability, Collisions, Triangles, and more: Evaluating SQL Query Similarity Metrics".

The structure of the repository is as follows:

src : The Python code that runs the experiments.
SQLFeatureExtraction : The program that can calculate distance matrices for the aouiche, aligon and makiyama similarity algorithms.
postgres : The Docker container that can host the SQL databases we utilize.
diagrams : The visualizations that are used in the paper, organized by experiment.
tables: The experimental results in .csv format

The test databases

The postgres folder is a Docker container that hosts the Postgres databases we utilize in the experiments. Make sure you have docker-compose installed, and run docker compose run -p 5432:5432 db to start the container, or docker compose up db to more permanently keep it running as a service. Alternatively, you can manually build and host the databases using the .sql files in the postgres/databases folder.

The python code that runs the experiments

The src folder contains the code that runs all experiments and produces all the tables and plots. To be able to run it, we recommend using Python 3.9.7, and setting up a virtual environment. You can run pip install -r requirements.txt in order to install all libraries and dependencies. The tools that are available to run are these:

query_similarity_analysis.py - You can use its command-line arguments to run analysis on specific datasets and methods and run experiments in parallel.
query_similarity_experiments.py - Runs all analyses, produces all tables and plots.
query_similarity-llm.py - Utilizes an OpenAI LLM (gpt-4o-mini model) to produce the llm-based datasets, and the llm-based similarity algorithms. As the datasets and results are included in the repository, you do not need to run it for the experiments.

The SQLFeatureExtraction tool

The tool in the SQLFeatureExtraction folder is used to calculate the distance matrices for the aouiche, aligon and makiyama similarity algorithms for a given set of SQL queries. The program is originally derived from https://github.com/UBOdin/EttuBench, and it's slightly modified to be able to run the algorithms on a user-defined dataset of queries.

In order to run it, you must first install maven and compile it by going to the SQLFeatureExtraction folder and running mvn clean install.

Running the experiments

In order to run the experiments you must run these steps:

Export the queries in a text format.
Use the SQLFeatureExtraction tool to produce the distance matrices for the aouiche, aligon and makiyama similarity algorithms.
Run the query_similarity_analysis.py program to perform all calculations and produce the visualizations (this step can be omitted, but we recommend running it to speed up the calculations).
Run the query_similarity_experiments.py program to produce the aggregate tables and visualizations

Exporting the queries

Go to src folder and run python query_similarity_analysis.py --export. This will create the src/similarity_cache/small/exported_queries.txt, src/similarity_cache/medium/exported_queries.txt, src/similarity_cache/large/exported_queries.txt, src/similarity_cache/llm-actors-limited-0/exported_queries.txt and src/similarity_cache/llm-actors-numerous-0/exported_queries.txt files. The first time the code runs will take a considerable amount of time, as it will train the vocabularies we use for the word2vec vector embeddings.

Calculating Aouiche, Aligon and Makiyama

For every dataset, separately, you must run the SQLFeatureExtraction tool to produce the distance matrices for the aouiche, aligon and makiyama similarity algorithms, and copy the results to their designated folder.

Copy src/similarity_cache/<DATASET>/exported_queries.txt into the SQLFeatureExtraction/data folder and rename it to my_queries.txt.
Go to the SQLFeatureExtraction/target folder, and run java -jar SQLComparison.jar -input my_queries. This will populate the SQLFeatureExtraction/data folder with the distance matrices for aouiche, aligon, and makiyama similarity algorithms. It may take minutes to a few hours, depending on the size of the dataset.
Copy the files my_queries_aouiche.csv, my_queries_aligon.csv, my_queries_makiyama.csv, my_queries_aouiche_regularization.csv, my_queries_aligon_regularization.csv, and my_queries_makiyama_regularization.csv over to src/similarity_cache/<DATASET>/<EMBEDDING_METHOD>/ for every embedding method for that dataset. The embedding method folders are bert_vocab(200), bert_vocab(768) and bert, and if they don't exist, you should create them.

Running query_similarity_analysis.py

Go to the src folder and run the query_similarity_analysis.py program. You can directly run python query_similarity_analysis.py --all to run all experiments and sensitivity analyses, but this is expected to take a long time to run, in the order of days. Alternatively, what we would recommend is to run parts of the experiments in parallel on different processes, speeding up the whole run considerably:

python query_similarity_analysis.py --dataset small --embedding bert_vocab(200)
python query_similarity_analysis.py --dataset small --embedding bert_vocab(768)
python query_similarity_analysis.py --dataset small --embedding bert
python query_similarity_analysis.py --dataset medium --embedding bert_vocab(200)
python query_similarity_analysis.py --dataset medium --embedding bert_vocab(768)
python query_similarity_analysis.py --dataset medium --embedding bert
python query_similarity_analysis.py --dataset large --embedding bert_vocab(200)
python query_similarity_analysis.py --dataset large --embedding bert_vocab(768)
python query_similarity_analysis.py --dataset large --embedding bert
python query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bert_vocab(200)
python query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bert_vocab(768)
python query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bert
python query_similarity_analysis.py --dataset llm-actors-numerous-0 --embedding bert_vocab(200)
python query_similarity_analysis.py --dataset llm-actors-numerous-0 --embedding bert_vocab(768)
python query_similarity_analysis.py --sensitivity small
python query_similarity_analysis.py --sensitivity medium
python query_similarity_analysis.py --sensitivity large

Producing the aggregate tables and visualizations

Go to the src folder and run python query_similarity_experiments.py. If you already ran query_similarity_analysis.py, this shouldn't take more than a few minutes to run, otherwise, it will first run all experiments and take days to finish. The folder /src/similarity_cache/tables should have all produced tables and aggregate plots. The src/similarity_cache/<DATASET>/<EMBEDDING_METHOD>/ folders will also have visualizations for the cluster and distance distributions for all similarity algorithms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Separability, Collisions, Triangles, and more: Evaluating SQL Query Similarity Metrics

The test databases

The python code that runs the experiments

The SQLFeatureExtraction tool

Running the experiments

Exporting the queries

Calculating Aouiche, Aligon and Makiyama

Running query_similarity_analysis.py

Producing the aggregate tables and visualizations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
SQLFeatureExtraction		SQLFeatureExtraction
diagrams		diagrams
postgres		postgres
src		src
tables		tables
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Separability, Collisions, Triangles, and more: Evaluating SQL Query Similarity Metrics

The test databases

The python code that runs the experiments

The SQLFeatureExtraction tool

Running the experiments

Exporting the queries

Calculating Aouiche, Aligon and Makiyama

Running query_similarity_analysis.py

Producing the aggregate tables and visualizations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages