Disambiguate First Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing

Overview

This repository contains a framework for handling ambiguous natural language questions in text-to-SQL tasks, supporting two main datasets: Ambrosia and AmbiQT.

The main idea is to disambiguate the question into natural language interpretations first, then parse each interpretation into a SQL query:

I. Disambiguation:

Generating initial set of interpretations (default or preferred interpretations)
Infilling the set with missing interpretations

II. Text-to-SQL Parsing

We use a modular design to handle the different tasks:

Generating initial set: Zero-shot prompting with text-based LLM (Llama 3.1 8B Instruct)
Infilling the set: Finetuned adapter
Text-to-SQL parsing: Zero-shot prompting with specialized code LLM (Qwen2.5-Coder 32B Instruct)

We also provide the code for annotating AmbiQT with synthetic gold interpretations that we used to finetune the infilling model.

Installation

For fast generation, we use Text Generation Inference through the official docker image: ghcr.io/huggingface/text-generation-inference:latest

For finetuning, we use Unsloth through the docker image: irisaparina/unsloth:0125

You also need to install the following packages:

python -m pip install configargparse wandb sqlparse openai numpy datasets

Most of the scripts support both Unsloth and TGI through the --backend argument except for the finetuning script.

Data

You can find AmbiQT in the data/AmbiQT folder. Unzip db-content.zip and create folders for fixed databases (with unique values):

cd data/AmbiQT
unzip db-content.zip
mkdir db-content/database_syn db-content/database_syn_eval

You also need to download the Ambrosia dataset from the official website and put it in the data/ambrosia folder.

To resplit the Ambrosia dataset for training, you can use the src/utils/resplit_ambrosia.py script.

Gold interpretations for AmbiQT are stored in the data/ambiqt_gold_interpretations folder. To reproduce them, you can use the src/annotate_ambiqt_with_gold_interpretations.py script:

python src/annotate_ambiqt_with_gold_interpretations.py \
    --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
    --split SPLIT \ # train or test
    --backend tgi \ # TGI or Unsloth, choose TGI for speedier generation
    --tgi_url "http://0.0.0.0/v1"

Configuration

The repository uses YAML configuration files in src/configs/ to manage various settings:

dataset_configs.yaml: Dataset loading parameters and filtering options
train.yaml: Training hyperparameters and model settings

You can also override any config parameter via command line arguments when running the scripts.

Disambiguation

Generating initial interpretations

To generate the initial interpretations, use the src/generate_initial_interpretations.py script:

python src/generate_initial_interpretations.py \
    --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
    --dataset_type DATASET_TYPE \ # ambiqt or ambrosia
    --split SPLIT \ # train or test or validation (for ambrosia only)
    --backend tgi \
    --tgi_url "http://0.0.0.0/v1"

Then generate SQL queries for the initial interpretations using the src/generate_sql_from_initial_interpretations.py script:

python src/generate_sql_from_initial_interpretations.py \
    --model_name Qwen/Qwen2.5-Coder-32B-Instruct \
    --interpretation_file INTERPRETATION_FILE \
    --backend tgi \
    --tgi_url "http://0.0.0.0/v1"

INTERPRETATION_FILE is the file with the initial interpretations generated by the src/generate_initial_interpretations.py script, e.g. outputs/initial_interpretations/initial_interpretations_meta-llama-3.1-8b-instruct_seed42_ambiqt_train_tgi.json.

Finally, filter the initial interpretations using the src/filter_initial_interpretations.py script:

python src/filter_initial_interpretations.py \
    --input_file INTERPRETATION_WITH_SQL_FILE

Infilling with missing interpretations

To infill the set of interpretations, train the adapter using the src/finetuning.py script:

python src/finetuning.py \
    --hf_token HF_TOKEN \
    --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
    --model_sql_name Qwen/Qwen2.5-Coder-7B-Instruct \ # model for SQL generation during validation
    --interpretation_model_train llama-3.1-8b-instruct \
    --interpretation_model_test llama-3.1-8b-instruct \
    --dataset_type DATASET_TYPE \
    --learn_missing \
    --num_epoch 15 \
    --validation_checkpoints 5  # use 5 best checkpoints for validation

By default src/finetuning.py does training, validation and runs inference but if you want to run inference only, you can use the --mode test argument:

python src/finetuning.py \
    --hf_token HF_TOKEN \
    --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
    --model_sql_name Qwen/Qwen2.5-Coder-7B-Instruct \ # model for SQL generation during validation
    --interpretation_model_train llama-3.1-8b-instruct \
    --interpretation_model_test llama-3.1-8b-instruct \
    --dataset_type DATASET_TYPE \
    --learn_missing \
    --mode test \
    --test_checkpoint CHECKPOINT_PATH

The script uses resplitted Ambrosia dataset by default. If you want to use the original Ambrosia dataset, use the --ambrosia_file data/ambrosia/data/ambrosia.csv argument.

Text-to-SQL Parsing and Evaluation

Finally, generate SQL queries for the infilled interpretations using the src/generate_sql_from_final_interpretations.py script. It works on top of the predictions from the src/finetuning.py script.

python src/generate_sql_from_final_interpretations.py \
    --hf_token HF_TOKEN \
    --model_name meta-llama/Meta-Llama-3.1-8B-Instruct \
    --test_predictions FINETUNING_RESULTS_FILE \
    --backend tgi \
    --tgi_url "http://0.0.0.0/v1"

The script has an option to recompute SQL queries for the initial interpretations or use the existing SQL queries (--use_existing_sql_prediction).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
outputs/ambiqt_gold_interpretations		outputs/ambiqt_gold_interpretations
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
dfpl.png		dfpl.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disambiguate First Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing

Overview

Installation

Data

Configuration

Disambiguation

Generating initial interpretations

Infilling with missing interpretations

Text-to-SQL Parsing and Evaluation

About

Releases

Packages

Languages

License

saparina/disambiguate-then-parse

Folders and files

Latest commit

History

Repository files navigation

Disambiguate First Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing

Overview

Installation

Data

Configuration

Disambiguation

Generating initial interpretations

Infilling with missing interpretations

Text-to-SQL Parsing and Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages