French grammatical parser

Course assignment (MVA Course "Algorithms for speech and language processing", 2020, Dupoux, Zeghidour & Sagot)

The goal of this project was to develop a basic probabilistic parser for French, based on the CYK algorithm and the PCFG model and that is robust to unknown words.

Parser can be trained on annotated dataset sequoia corpus.

The proposed OOV module uses French word embeddings, which can be given by the Polyglot embedding lexicon for French.

Run the parser (file: `run.sh`):

In a nutshell, arguments allows to:

Build grammar from file (path --train_path): ex: sequoia-corpus+fct.mrg_strict
Use only a fraction of training file going from (--train_strat x nb_sentences) to (--train_end x nb_sentences): ex: -train_start=0, -train_end=0.8
Build oov module from embeddings stored in file given by file path --embedding_path: ex: polyglot-fr.pkl
Apply cyk with chosen -beam search to make predictions: ex: --beam=25
Make predictions from sentences in file (path ) --test_path): ex: --test_path=sequoia-corpus+fct.mrg_strict
Use only a fraction of test file going from (--test_strat x nb_sentences) to (--test_end x nb_sentences): ex: --test_start=0.9, --test_end=1.
Chose to only make predictions (in case you only have raw data) or evaluate the algorithm (in that case, input file must contain parsed data): --mode = 'prediction' or 'evaluation'
Store predictions in file (path outputted_path): ex: --output_path=evaluation_data.parser_output

Details

optional arguments:

  -h, --help            show this help message and exit
  
  --train_path TRAIN_PATH, -t TRAIN_PATH
                        set path where to find training data
                        
  --train_start TRAIN_START
                        start data fraction for train data (default 0)
                        
  --train_end TRAIN_END
                        end data fraction for train data (default 1)
                        
  --test_path TEST_PATH, -T TEST_PATH
                        set path where to find test data
                        
  --test_start TEST_START
                        start data fraction for test data (default 0)
                        
  --test_end TEST_END   end data fraction for test data (default 1)
  
  --embedding_path EMBEDDING_PATH, -e EMBEDDING_PATH
                        set path where to find french word embeddings
                        
  --mode MODE, -m MODE  set mode: - 'prediction' / 'e': predict only -
                        'evaluation' / 'e': predict and evaluate predictions
                        
  --output_path OUTPUT_PATH, -o OUTPUT_PATH
                        set path where to write predictions (if None nothing
                        will be written)
                        
  --beam BEAM, -b BEAM  set beam search size for cyk algorithm (default 10)

Examples

Example of configurations

chmod +x run.sh

# make predictions and parsing evaluations from parsed sentences in the last 10% of 'sequoia-corpus+fct.mrg_strict'
./run.sh  --train_path='sequoia-corpus+fct.mrg_strict' --test_path='sequoia-corpus+fct.mrg_strict' --mode='evaluation' --train_end=0.8 --test_start=0.9 --test_end=1 --beam=25 -e='polyglot-fr.pkl' --output_path='evaluation_data.parser_output'

# make predictions from raw sentences in 'test_file.txt'
./run.sh -t='sequoia-corpus+fct.mrg_strict' -T='test_file.txt' -m='prediction' --train_end=0.8 --test_start=0 -b=100 -e='polyglot-fr.pkl' -o='test_output.txt'

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
image		image
utils		utils
.gitignore		.gitignore
README.md		README.md
evaluation_data.parser_output		evaluation_data.parser_output
main.ipynb		main.ipynb
main.py		main.py
run.sh		run.sh
test_file.txt		test_file.txt
test_output.txt		test_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

French grammatical parser

Course assignment (MVA Course "Algorithms for speech and language processing", 2020, Dupoux, Zeghidour & Sagot)

Run the parser (file: `run.sh`):

Details

Examples

Example of configurations

Example of prediction:

Indication on execution time for test sentences taken from the sequoia corpus:

About

Releases

Packages

Languages

AntGro/Grammatical-Parser

Folders and files

Latest commit

History

Repository files navigation

French grammatical parser

Course assignment (MVA Course "Algorithms for speech and language processing", 2020, Dupoux, Zeghidour & Sagot)

Run the parser (file: run.sh):

Details

Examples

Example of configurations

Example of prediction:

Indication on execution time for test sentences taken from the sequoia corpus:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Run the parser (file: `run.sh`):

Packages