Skip to content

admtlab/tree-embed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Separability, Collisions, Triangles, and more: Evaluating SQL Query Similarity Metrics

This repository contains all the code and data we use to produce experimental results in the paper "Separability, Collisions, Triangles, and more: Evaluating SQL Query Similarity Metrics".

The structure of the repository is as follows:

  • src : The Python code that runs the experiments.
  • SQLFeatureExtraction : The program that can calculate distance matrices for the aouiche, aligon and makiyama similarity algorithms.
  • postgres : The Docker container that can host the SQL databases we utilize.
  • diagrams : The visualizations that are used in the paper, organized by experiment.
  • tables: The experimental results in .csv format

The test databases

The postgres folder is a Docker container that hosts the Postgres databases we utilize in the experiments. Make sure you have docker-compose installed, and run docker compose run -p 5432:5432 db to start the container, or docker compose up db to more permanently keep it running as a service. Alternatively, you can manually build and host the databases using the .sql files in the postgres/databases folder.

The python code that runs the experiments

The src folder contains the code that runs all experiments and produces all the tables and plots. To be able to run it, we recommend using Python 3.9.7, and setting up a virtual environment. You can run pip install -r requirements.txt in order to install all libraries and dependencies. The tools that are available to run are these:

  • query_similarity_analysis.py - You can use its command-line arguments to run analysis on specific datasets and methods and run experiments in parallel.
  • query_similarity_experiments.py - Runs all analyses, produces all tables and plots.
  • query_similarity-llm.py - Utilizes an OpenAI LLM (gpt-4o-mini model) to produce the llm-based datasets, and the llm-based similarity algorithms. As the datasets and results are included in the repository, you do not need to run it for the experiments.

The SQLFeatureExtraction tool

The tool in the SQLFeatureExtraction folder is used to calculate the distance matrices for the aouiche, aligon and makiyama similarity algorithms for a given set of SQL queries. The program is originally derived from https://github.com/UBOdin/EttuBench, and it's slightly modified to be able to run the algorithms on a user-defined dataset of queries.

In order to run it, you must first install maven and compile it by going to the SQLFeatureExtraction folder and running mvn clean install.

Running the experiments

In order to run the experiments you must run these steps:

  • Export the queries in a text format.
  • Use the SQLFeatureExtraction tool to produce the distance matrices for the aouiche, aligon and makiyama similarity algorithms.
  • Run the query_similarity_analysis.py program to perform all calculations and produce the visualizations (this step can be omitted, but we recommend running it to speed up the calculations).
  • Run the query_similarity_experiments.py program to produce the aggregate tables and visualizations

Exporting the queries

Go to src folder and run python query_similarity_analysis.py --export. This will create the src/similarity_cache/small/exported_queries.txt, src/similarity_cache/medium/exported_queries.txt, src/similarity_cache/large/exported_queries.txt, src/similarity_cache/llm-actors-limited-0/exported_queries.txt and src/similarity_cache/llm-actors-numerous-0/exported_queries.txt files. The first time the code runs will take a considerable amount of time, as it will train the vocabularies we use for the word2vec vector embeddings.

Calculating Aouiche, Aligon and Makiyama

For every dataset, separately, you must run the SQLFeatureExtraction tool to produce the distance matrices for the aouiche, aligon and makiyama similarity algorithms, and copy the results to their designated folder.

  • Copy src/similarity_cache/<DATASET>/exported_queries.txt into the SQLFeatureExtraction/data folder and rename it to my_queries.txt.
  • Go to the SQLFeatureExtraction/target folder, and run java -jar SQLComparison.jar -input my_queries. This will populate the SQLFeatureExtraction/data folder with the distance matrices for aouiche, aligon, and makiyama similarity algorithms. It may take minutes to a few hours, depending on the size of the dataset.
  • Copy the files my_queries_aouiche.csv, my_queries_aligon.csv, my_queries_makiyama.csv, my_queries_aouiche_regularization.csv, my_queries_aligon_regularization.csv, and my_queries_makiyama_regularization.csv over to src/similarity_cache/<DATASET>/<EMBEDDING_METHOD>/ for every embedding method for that dataset. The embedding method folders are bert_vocab(200), bert_vocab(768) and bert, and if they don't exist, you should create them.

Running query_similarity_analysis.py

Go to the src folder and run the query_similarity_analysis.py program. You can directly run python query_similarity_analysis.py --all to run all experiments and sensitivity analyses, but this is expected to take a long time to run, in the order of days. Alternatively, what we would recommend is to run parts of the experiments in parallel on different processes, speeding up the whole run considerably:

  • python query_similarity_analysis.py --dataset small --embedding bert_vocab(200)
  • python query_similarity_analysis.py --dataset small --embedding bert_vocab(768)
  • python query_similarity_analysis.py --dataset small --embedding bert
  • python query_similarity_analysis.py --dataset medium --embedding bert_vocab(200)
  • python query_similarity_analysis.py --dataset medium --embedding bert_vocab(768)
  • python query_similarity_analysis.py --dataset medium --embedding bert
  • python query_similarity_analysis.py --dataset large --embedding bert_vocab(200)
  • python query_similarity_analysis.py --dataset large --embedding bert_vocab(768)
  • python query_similarity_analysis.py --dataset large --embedding bert
  • python query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bert_vocab(200)
  • python query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bert_vocab(768)
  • python query_similarity_analysis.py --dataset llm-actors-limited-0 --embedding bert
  • python query_similarity_analysis.py --dataset llm-actors-numerous-0 --embedding bert_vocab(200)
  • python query_similarity_analysis.py --dataset llm-actors-numerous-0 --embedding bert_vocab(768)
  • python query_similarity_analysis.py --sensitivity small
  • python query_similarity_analysis.py --sensitivity medium
  • python query_similarity_analysis.py --sensitivity large

Producing the aggregate tables and visualizations

Go to the src folder and run python query_similarity_experiments.py. If you already ran query_similarity_analysis.py, this shouldn't take more than a few minutes to run, otherwise, it will first run all experiments and take days to finish. The folder /src/similarity_cache/tables should have all produced tables and aggregate plots. The src/similarity_cache/<DATASET>/<EMBEDDING_METHOD>/ folders will also have visualizations for the cluster and distance distributions for all similarity algorithms.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors