Disambiguate First Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing
This repository contains a framework for handling ambiguous natural language questions in text-to-SQL tasks, supporting two main datasets: Ambrosia and AmbiQT.
The main idea is to disambiguate the question into natural language interpretations first, then parse each interpretation into a SQL query:
I. Disambiguation:
-
Generating initial set of interpretations (default or preferred interpretations)
-
Infilling the set with missing interpretations
II. Text-to-SQL Parsing
We use a modular design to handle the different tasks:
- Generating initial set: Zero-shot prompting with text-based LLM (Llama 3.1 8B Instruct)
- Infilling the set: Finetuned adapter
- Text-to-SQL parsing: Zero-shot prompting with specialized code LLM (Qwen2.5-Coder 32B Instruct)
We also provide the code for annotating AmbiQT with synthetic gold interpretations that we used to finetune the infilling model.
For fast generation, we use Text Generation Inference through the official docker image: ghcr.io/huggingface/text-generation-inference:latest
For finetuning, we use Unsloth through the docker image: irisaparina/unsloth:0125
You also need to install the following packages:
python -m pip install configargparse wandb sqlparse openai numpy datasets
Most of the scripts support both Unsloth and TGI through the --backend
argument except for the finetuning script.
You can find AmbiQT in the data/AmbiQT
folder. Unzip db-content.zip
and create folders for fixed databases (with unique values):
cd data/AmbiQT
unzip db-content.zip
mkdir db-content/database_syn db-content/database_syn_eval
You also need to download the Ambrosia
dataset from the official website and put it in the data/ambrosia
folder.
To resplit the Ambrosia dataset for training, you can use the src/utils/resplit_ambrosia.py
script.
Gold interpretations for AmbiQT are stored in the data/ambiqt_gold_interpretations
folder. To reproduce them, you can use the src/annotate_ambiqt_with_gold_interpretations.py
script:
python src/annotate_ambiqt_with_gold_interpretations.py \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--split SPLIT \ # train or test
--backend tgi \ # TGI or Unsloth, choose TGI for speedier generation
--tgi_url "http://0.0.0.0/v1"
The repository uses YAML configuration files in src/configs/
to manage various settings:
dataset_configs.yaml
: Dataset loading parameters and filtering optionstrain.yaml
: Training hyperparameters and model settings
You can also override any config parameter via command line arguments when running the scripts.
- To generate the initial interpretations, use the
src/generate_initial_interpretations.py
script:
python src/generate_initial_interpretations.py \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--dataset_type DATASET_TYPE \ # ambiqt or ambrosia
--split SPLIT \ # train or test or validation (for ambrosia only)
--backend tgi \
--tgi_url "http://0.0.0.0/v1"
- Then generate SQL queries for the initial interpretations using the
src/generate_sql_from_initial_interpretations.py
script:
python src/generate_sql_from_initial_interpretations.py \
--model_name Qwen/Qwen2.5-Coder-32B-Instruct \
--interpretation_file INTERPRETATION_FILE \
--backend tgi \
--tgi_url "http://0.0.0.0/v1"
INTERPRETATION_FILE
is the file with the initial interpretations generated by the src/generate_initial_interpretations.py
script, e.g. outputs/initial_interpretations/initial_interpretations_meta-llama-3.1-8b-instruct_seed42_ambiqt_train_tgi.json
.
- Finally, filter the initial interpretations using the
src/filter_initial_interpretations.py
script:
python src/filter_initial_interpretations.py \
--input_file INTERPRETATION_WITH_SQL_FILE
To infill the set of interpretations, train the adapter using the src/finetuning.py
script:
python src/finetuning.py \
--hf_token HF_TOKEN \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--model_sql_name Qwen/Qwen2.5-Coder-7B-Instruct \ # model for SQL generation during validation
--interpretation_model_train llama-3.1-8b-instruct \
--interpretation_model_test llama-3.1-8b-instruct \
--dataset_type DATASET_TYPE \
--learn_missing \
--num_epoch 15 \
--validation_checkpoints 5 # use 5 best checkpoints for validation
By default src/finetuning.py
does training, validation and runs inference but if you want to run inference only, you can use the --mode test
argument:
python src/finetuning.py \
--hf_token HF_TOKEN \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--model_sql_name Qwen/Qwen2.5-Coder-7B-Instruct \ # model for SQL generation during validation
--interpretation_model_train llama-3.1-8b-instruct \
--interpretation_model_test llama-3.1-8b-instruct \
--dataset_type DATASET_TYPE \
--learn_missing \
--mode test \
--test_checkpoint CHECKPOINT_PATH
The script uses resplitted Ambrosia dataset by default. If you want to use the original Ambrosia dataset, use the --ambrosia_file data/ambrosia/data/ambrosia.csv
argument.
Finally, generate SQL queries for the infilled interpretations using the src/generate_sql_from_final_interpretations.py
script. It works on top of the predictions from the src/finetuning.py
script.
python src/generate_sql_from_final_interpretations.py \
--hf_token HF_TOKEN \
--model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
--test_predictions FINETUNING_RESULTS_FILE \
--backend tgi \
--tgi_url "http://0.0.0.0/v1"
The script has an option to recompute SQL queries for the initial interpretations or use the existing SQL queries (--use_existing_sql_prediction
).