This repository contains source code for the EMNLP'2022 paper "Table-Operations Aware Fact Verification via Sentence-Table Cloze Pre-training". In this paper we introduce PASTA, a novel state-of-the-art framework for table-based fact verification via pre-training with synthesized sentence–table cloze questions. In particular, we design six types of common sentence–table cloze tasks, including Filter, Aggregation, Superlative, Comparative, Ordinal, and Unique, based on which we synthesize a large corpus consisting of 1.2 million sentence–table pairs from WikiTables.
|-- data_cache # save the propossed data cache for tabfact/semtabfacts
|-- datasets # save the official datasets of tabfact/semtabfacts
|-- experimental_results # save the evaluation results (acc|f1_micro) of tabfact/semtabfacts
|-- pretrained_models # the model used to adapted to the downstream tasks
|-- save_checkpoints # save the fine-tuned checkpoints of tabfact/semtabfacts
|-- src # code for fine-tune/pre-train
|-- scripts # configs for fine-tune/pre-train
|-- train_semtabfacts.json # you need to modify the path
|-- train_tabfact.json # you need to modify the path
|-- pretrain.json # you need to modify the path
|-- utils
|-- args.py # Definition of different parameters
|-- dataset.py # the class to load and preprocess the dataset
|-- linearize.py # the class to flatten a table into a linearized form, it is adapted from https://github.com/microsoft/Table-Pretraining/blob/0b87efa253232d4aafa52c1f4725cb4f6e027877/tapex/processor/table_linearize.py.
|-- entitylink.py # the class to identify entity links between the sentence and the table, it is adapted from https://github.com/wenhuchen/Table-Fact-Checking/blob/5ea13b8f6faf11557eec728c5f132534e7a22bf7/code/preprocess_data.py.
|-- pasta_mlm_model.py # the class to define the operation-aware pretraing model
|-- TabFV_model.py # the class to define the table-based fact verification model
|-- run_finetune.py # the class to train the table-based fact verification model
|-- run_pretrain.py # the class to pre-train the pasta model from scratch
Before running the code, please make sure your Python version is above 3.8. Then install the necessary packages by:
pip install -r requirements.txt
Please download the TabFact dataset from the official GitHub repository and put it under the folder PASTA/datasets
.
git clone [email protected]:wenhuchen/Table-Fact-Checking.git
mv Table-Fact-Checking tabfact
mv tabfact PASTA/datasets
Please download the sem-tab-facts dataset from the official website:
Then refer to this repository for standardizing the table header, or you can directly download the dataset we have processed.
Finally, put the processed dataset under the folder PASTA/datasets
, and name it semtabfacts
.
Download the PASTA model and put it under folder pretrained_models/
.
Fine-tune on the Tabfact dataset with the following command.
python src/run_finetune.py src/scripts/train_tabfact.json
Note that you need to modify the paths in the .json
file to your own paths.
Following LKA, we also use the trained model on the TabFact to initialize the training of SEM-TAB-FACTS. Therefore, You need to train on the TabFact dataset to get the checkpoint, or you can directly download the checkpoint we provide and put it under /save_checkpoints
.
Then, fine-tune on the SEM-TAB-FACTS dataset with the following command.
python src/run_finetune.py src/scripts/train_semtabfacts.json
Note that you need to modify the paths in the .json
file to your own paths.
Download the pre-training corpus, which consists mostly of 128K sentence-table cloze samples. Here is an example.
Input: Openration-aware Masked Sentence + Linearized Table
# Mask all of the the tokens corresponding to a span at once.
[MASK] [MASK] [MASK] has the highest attendance of all date [Header] date | visitor | score | home | leading scorer | attendance | record [Row] 1 april 2008 | knicks | 115 - 119 | bucks | quentin richardson (22) | 13579 | 20 - 54 [Row] ……
Output: Answer
8 april 2008
Put the unzip dataset under the folder PASTA/datasets
, and name it pasta
.
Pretrain the PASTA model with the following command.
python src/run_pretrain.py src/scripts/pretrain.json
Note that you need to modify the paths in the .json
file to your own paths.
@article{DBLP:journals/corr/abs-2211-02816,
author = {Zihui Gu and
Ju Fan and
Nan Tang and
Preslav Nakov and
Xiaoman Zhao and
Xiaoyong Du},
title = {{PASTA:} Table-Operations Aware Fact Verification via Sentence-Table
Cloze Pre-training},
journal = {CoRR},
volume = {abs/2211.02816},
year = {2022},
url = {https://doi.org/10.48550/arXiv.2211.02816},
doi = {10.48550/arXiv.2211.02816},
eprinttype = {arXiv},
eprint = {2211.02816},
timestamp = {Wed, 09 Nov 2022 17:33:26 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2211-02816.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}