Bring cutting-edge representation and transfer learning models to conversational AI systems
This python package was developed in the Multi2ConvAI-project. Goal of Multi2ConvAI was to examine methods for transferring conversational AI models across domains and languages, even with a limited number of dialogues or, in extreme cases, no dialogues at all of the target domain or target language. Within this package we share components to run the intent-classification models that have been developed over the course of the project.
Multi2Convai was a collaboration between the NLP group of the University of Mannheim, Neohelden and inovex. The project was part of the "KI-Innovationswettbewerb" (an AI innovation challenge) funded by the state of Baden WΓΌrttemberg.
Contact: [email protected].
We developed a set of models for several use cases over the course of the project. Our use cases are intent-classification tasks in different domains and languages. The following table gives you an overview about the domains and langauges that have been covered in the project:
Corona | Logistics | Quality |
---|---|---|
German (de) | German (de) | German (de) |
English (en) | English (en) | English (en) |
French (fr) | Croatian (hr) | French (fr) |
Italian (it) | Polish (pl) | Italian (it) |
Turkish (tr) |
Please check this blogpost for more details about the use cases: en, de
All our models are available on the huggingface model hub: https://huggingface.co/inovex. Search for models following the pattern multi2convai-xxx
. Our models can be subdivided into three categories:
- logistic regression using static fasttext word embeddings
- schema:
multi2convai-<domain>-<language>-logreg-ft
- schema:
- logistic regression using contextual word embeddings
- schema:
multi2convai-<domain>-<language>-logreg-<embedding, e.g. bert or xlmr>
- schema:
- finetuned transformers
- schema:
multi2convai-<domain>-<language>-<transformer name, e.g. bert>
- schema:
In order to set up the necessary environment:
- Create an environment
multi2convai
with the help of conda:conda env create -f environment.yml
- activate the new environment with:
conda activate multi2convai
NOTE: The conda environment will have multi2convai installed in editable mode. Some changes, e.g. in
setup.cfg
, might require you to runpip install -e .
again.
Optional and needed only once after git clone
:
-
install several pre-commit git hooks with:
pre-commit install # You might also want to run `pre-commit autoupdate`
and checkout the configuration under
.pre-commit-config.yaml
. The-n, --no-verify
flag ofgit commit
can be used to deactivate pre-commit hooks temporarily. -
install nbstripout git hooks to remove the output cells of committed notebooks with:
nbstripout --install --attributes notebooks/.gitattributes
This is useful to avoid large diffs due to plots in your notebooks. A simple
nbstripout --uninstall
will revert these changes.
Before running our models you'll need to download the required files. Which files you need depends on the model type:
- Download model repo from hugginface (all model types)
- Download and serialize fasttext embeddings (only
xxx-logreg-ft
models) - Download pretrained language models (only
xxx-logreg-<transformer, e.g. bert or xlmr>
)
# requires git-lfs installed
# see models/README.md for more details
cd models/corona
git clone https://huggingface.co/inovex/multi2convai-corona-de-logreg-ft
ls corona/multi2convai-corona-de-logreg-ft
>>> README.md label_dict.json model.pth
Only required for
multi2convai-<domain>-<language>-logreg-ft
models
# see models/embeddings/README.md for more details
# 1. Download fasttext embeddings
mkdir models/embeddings/fasttext/en
curl https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.de.vec --output fasttext/de/wiki.de.vec
ls models/embeddings/fasttext/en
>>> wiki.en.vec
# 2. Serialize fasttext embeddings
python serialize_fasttext.py --raw-path fasttext/en/wiki.en.vec --vocab-path fasttext/en/wiki.200k.en.vocab --embeddings-path fasttext/en/wiki.200k.en.embed -n 200000
ls fasttext/en
>>> wiki.200k.en.embed wiki.200k.en.vocab wiki.en.vec
Only required for
multi2convai-<domain>-<language>-logreg-<transformer, e.g. bert or xlmr>
models
# see models/embeddings/README.md for more details
from transformers import AutoTokenizer, AutoModelForMaskedLM
import os
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-dbmdz-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-german-dbmdz-uncased")
tokenizer.save_pretrained("models/embeddings/transformers/bert-base-german-dbmdz-uncased")
model.save_pretrained("models/embeddings/transformers/bert-base-german-dbmdz-uncased")
os.listdir("transformers/bert-base-german-dbmdz-uncased")
>>> ["config.json", "pytorch_model.bin", "special_tokens_map.json", "tokenizer_config.json", "vocab.txt"]
python scripts/run_inference.py -m multi2convai-corona-de-logreg-ft
Only works for
multi2convai-<domain>-<language>-<transformer, e.g. bert)
models (nologreg
in name)
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Loads from locally available files
tokenizer = AutoTokenizer.from_pretrained("models/logistics/multi2convai-logistics-en-bert")
model = AutoModelForSequenceClassification.from_pretrained("models/logistics/multi2convai-logistics-en-bert")
# Alternative: Loads directly from huggingface model hub
# tokenizer = AutoTokenizer.from_pretrained("inovex/multi2convai-logistics-en-bert")
# model = AutoModelForSequenceClassification.from_pretrained("inovex/multi2convai-logistics-en-bert")
We're still migrating our codebase to this github repo. The following steps are completed:
- Upload all models to huggingface model hub (https://huggingface.co/inovex)
- Migrate functionality to load and run logistic regression models with fasttext embeddings with
multi2convai
- Migrate functionality to load and run logistic regression models with contextual embeddings with
multi2convai
- Migrate functionality to load and run transformers with
multi2convai
- Publish documentation
βββ AUTHORS.md <- List of developers and maintainers.
βββ CHANGELOG.md <- Changelog to keep track of new features and fixes.
βββ CONTRIBUTING.md <- Guidelines for contributing to this project.
βββ LICENSE.txt <- License as chosen on the command-line.
βββ README.md <- The top-level README for developers.
βββ docs <- Directory for Sphinx documentation in rst or md.
βββ environment.yml <- The conda environment file for reproducibility.
βββ models <- Directory to which you can download models shared
β on the huggingface model hub.
βββ notebooks <- Jupyter notebooks.
βββ pyproject.toml <- Build configuration. Don't change! Use `pip install -e .`
β to install for development or to build `tox -e build`.
βββ scripts <- Scripts to e.g. serialize fasttext embeddings or run models.
βββ setup.cfg <- Declarative configuration of your project.
βββ setup.py <- [DEPRECATED] Use `python setup.py develop` to install for
β development or `python setup.py bdist_wheel` to build.
βββ src
β βββ multi2convai <- Actual Python package where the main functionality goes.
βββ tests <- Unit tests which can be run with `pytest`.
βββ .coveragerc <- Configuration for coverage reports of unit tests.
βββ .isort.cfg <- Configuration for git hook that sorts imports.
βββ .pre-commit-config.yaml <- Configuration of pre-commit git hooks.
This project has been set up using PyScaffold 4.1.2 and the dsproject extension 0.7.1.