This directory contains an improved SQL parser utility, implemented in Python, for converting SQL queries into a structured dictionary format.
This is an enhanced version of the original process_sql.py from the test-suite-sql-eval repository.
It is designed for academic and benchmarking purposes, supporting a subset of SQL syntax with specific assumptions and features.
-
Assumptions:
- Input SQL queries are syntactically correct.
- Only table names can have aliases.
- Only one
INTERSECT,UNION, orEXCEPTclause per query.
-
SQL Clauses Supported:
SELECT(with optionalDISTINCTand aggregation:MAX,MIN,COUNT,SUM,AVG)FROM(supports table aliases and subqueries as tables)WHERE(logical operators, comparison operators,BETWEEN,IN,LIKE,IS,EXISTS)GROUP BYORDER BY(supports multiple clauses with different modes:ASC,DESC; supports using numbers as select clause values, e.g.,ORDER BY 1 DESC)HAVINGLIMIT- Set operations:
INTERSECT,UNION,EXCEPT(only one per query)
-
Joins:
- Explicit
JOINandONclauses in theFROMsection.
- Explicit
-
Subqueries:
- Subqueries in
SELECT,FROM, andWHEREclauses. - Subqueries as values in conditions, e.g.,
t.cost = (SELECT ...). - Subqueries as tables in the
FROMclause.
- Subqueries in
-
Alias Handling:
- Explicit (
AS) and implicit table aliases.
- Explicit (
-
Schema Mapping:
- Maps table and column names to unique identifiers using the database schema.
-
Additional Supported Syntax:
- Composite column names (e.g.,
table.column). - Numeric tokens as columns (e.g.,
SELECT 1). - Multiple values connected with
AND/ORin conditions. EXISTSas a condition.
- Composite column names (e.g.,
parse_sql_one.py: Main parser implementation.- Example schema and database files should be placed in the appropriate directories (see usage).
- Python 3.7+
nltk(for tokenization)sqlite3(standard library)- Example SQLite database and schema JSON file
Install NLTK if not already installed:
pip install nltkfrom parse_sql_one import Schema, get_schema, get_sql
db_path = "test-suite-sql-eval-master/database/twitter_1/twitter_1.sqlite"
schema = Schema(get_schema(db_path))
sql_query = "SELECT City, Hanzi FROM city AS hosting_city ORDER BY City ASC"
parsed_sql = get_sql(schema, sql_query)
print(parsed_sql)Or run the script directly:
python parse_sql_one.pyThe parser outputs a nested dictionary representing the SQL query structure, including all supported clauses and set operations. Example output:
{
'select': (isDistinct, [(agg_id, val_unit), ...]),
'from': {'table_units': [...], 'conds': [...]},
'where': [...],
'groupBy': [...],
'orderBy': ('asc'/'desc', [...]),
'having': [...],
'limit': value,
'intersect': None/sql,
'except': None/sql,
'union': None/sql
}- Only supports one set operation (
INTERSECT,UNION, orEXCEPT) per query. - Assumes SQL queries are correct and well-formed.
- Only table names can have aliases; column aliases are not supported.
- Designed for academic and benchmarking use, not for production SQL parsing.
- Original version:
process_sql.pyfrom test-suite-sql-eval
For questions or suggestions, please open an issue or contact the repository