-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: Björn Deiseroth <[email protected]>
- Loading branch information
Showing
109 changed files
with
540,991 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Ignore tests tmp folder | ||
.mypy_cache | ||
.pytest_cache | ||
tmp/* | ||
.git | ||
tests/ | ||
*experiments* | ||
*tools* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
*lightning_logs* | ||
*ckpts* | ||
|
||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
share/python-wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.nox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
*.py,cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
cover/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
db.sqlite3-journal | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
.pybuilder/ | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# IPython | ||
profile_default/ | ||
ipython_config.py | ||
|
||
# pyenv | ||
# For a library or package, you might want to ignore these files since the code is | ||
# intended to run in multiple environments; otherwise, check them in: | ||
# .python-version | ||
|
||
# pipenv | ||
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. | ||
# However, in case of collaboration, if having platform-specific dependencies or dependencies | ||
# having no cross-platform support, pipenv may install dependencies that don't work, or not | ||
# install all needed dependencies. | ||
#Pipfile.lock | ||
|
||
# PEP 582; used by e.g. github.com/David-OConnor/pyflow | ||
__pypackages__/ | ||
|
||
# Celery stuff | ||
celerybeat-schedule | ||
celerybeat.pid | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
.dmypy.json | ||
dmypy.json | ||
|
||
# Pyre type checker | ||
.pyre/ | ||
|
||
# pytype static type analyzer | ||
.pytype/ | ||
|
||
# IDE | ||
.vscode/ | ||
Todo.json | ||
|
||
# Cython debug symbols | ||
cython_debug/ | ||
|
||
# Temporary test files | ||
tests/artifacts/tmp/* | ||
!tests/artifacts/tmp/.gitkeep | ||
debug_logs/ | ||
wandb/ | ||
*index_cache* | ||
|
||
#environment | ||
envs* | ||
.DS_Store | ||
tmp/ | ||
checkpoints/ |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
repos: | ||
- repo: https://github.com/peterdemin/pip-compile-multi | ||
rev: v2.4.5 | ||
hooks: | ||
- id: pip-compile-multi-verify | ||
- repo: https://github.com/pre-commit/pre-commit-hooks | ||
rev: v3.4.0 | ||
hooks: | ||
- id: check-json | ||
# - id: pretty-format-json | ||
# args: | ||
# - --autofix | ||
- id: end-of-file-fixer | ||
exclude: '.bin|.meta.json' | ||
- id: trailing-whitespace | ||
exclude: '.bin|.meta.json' | ||
- repo: https://github.com/psf/black | ||
rev: 23.7.0 | ||
hooks: | ||
- id: black | ||
# - repo: https://github.com/PyCQA/flake8 | ||
# rev: '6.1.0' | ||
# hooks: | ||
# - id: flake8 | ||
# exclude: src/license_plate_annotator/models/ | ||
# args: ["--extend-ignore=E203,E501,P103,W503"] | ||
# additional_dependencies: [ | ||
# 'flake8-blind-except', | ||
# 'flake8-docstrings', | ||
# 'flake8-bugbear', | ||
# 'flake8-comprehensions', | ||
# 'flake8-docstrings', | ||
# 'flake8-implicit-str-concat', | ||
# 'flake8-fastapi', | ||
# 'pydocstyle>=5.0.0', | ||
# ] | ||
# - repo: https://github.com/kynan/nbstripout | ||
# rev: 0.4.0 | ||
# hooks: | ||
# - id: nbstripout | ||
# files: ".ipynb" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
The following applies to all files in this repository, unless otherwise noted: | ||
|
||
Copyright (c) 2024 IPAI Aleph Alpha Research GmbH | ||
|
||
Subject to the terms and conditions of this License, the Licensor grants you a non-exclusive, worldwide, | ||
non-transferable, non-sublicensable, and royalty-free limited right to use, copy, modify, distribute, make | ||
otherwise publicly available, and reproduce the Works and Derivative Works under Licensor’s copyright, | ||
for any Non-Commercial and Non-Administrative purpose. | ||
You may not use, copy, modify, distribute, make otherwise publicly available, reproduce, or sublicense the | ||
Works or Derivative Works except as expressly provided under and in accordance with this License. | ||
Your rights granted under this License will automatically terminate if you fail to comply with any of the | ||
terms of this License. | ||
|
||
EXCEPT FOR DAMAGES CAUSED BY INTENT OR FRAUDULENTLY CONCEALED | ||
DEFECTS, AND EXCEPT FOR DAMAGES RESULTING FROM BREACH OF ANY | ||
WARRANTY OR GUARANTEE EXPRESSLY GIVEN BY LICENSOR IN THE OPEN ALEPH LICENSE, | ||
IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY | ||
DAMAGES ARISING OUT OF THE OPEN ALEPH LICENSE OR THE USE OF THE WORK. ANY | ||
MANDATORY STATUTORY LIABILITY UNDER APPLICABLE LAW REMAINS | ||
UNAFFECTED. | ||
|
||
EXCEPT AS EXPRESSLY STATED IN THIS LICENSE OR REQUIRED BY APPLICABLE | ||
LAW, THE WORKS ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES | ||
OF ANY KIND INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES REGARDING | ||
THE CONTENTS, ACCURACY, OR FITNESS FOR A PARTICULAR PURPOSE. | ||
|
||
For additional information on the license terms, see the complete license at | ||
https://github.com/Aleph-Alpha/.github/blob/main/oal.pdf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# T-Free: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory Efficient Embeddings | ||
|
||
Paper link: https://arxiv.org/abs/2406.19223 | ||
|
||
 | ||
 | ||
 | ||
|
||
|
||
## Checkpoints | ||
- 7B trained on 1 epoch fineweb-edu (coming wk 35) | ||
|
||
|
||
|
||
## Install | ||
|
||
```console | ||
conda create --name t-free python=3.11 -y | ||
conda activate t-free | ||
|
||
conda install pytorch==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia | ||
|
||
pip install -e .[test] | ||
|
||
pre-commit install | ||
``` | ||
|
||
## Running some tests | ||
``` | ||
pip install determined==0.26.4 | ||
pytest tests/test_data | ||
pytest tests/test_tokenizer | ||
pytest tests/test_trainer | ||
pytest tests/test_inference | ||
``` | ||
|
||
## Training | ||
|
||
### Data preprocessing | ||
Training requires data to be in one of two formats: MemoryMap or AlignedMemoryMap. MemoryMaps simply store data in byte format with indices for fast random access. | ||
AlignedMemoryMaps build a second index above, that aggregate a minimum byte-count per entry. We use this for pre-training to avoid the need of aggregating a dynamic number of random entries to fill the full sequence length, which becomes hard when trying to ensure that each entry is will only be seen once during one epoch. | ||
|
||
An example of how to download and convert e.g. FinetuneWeb of HuggingFace into a MemoryMap is found in data/download_data.py. | ||
Note that the data is stored in string format (converted to bytes). Trigramification or classical Tokenization are executed on-demand during training, depending on the respective training configurations. | ||
|
||
To further convert the data into an AlignedMemoryMap execute e.g. data/convert_mmap_to_alignedmmap.py. | ||
Right now max_bytes is set to 20k, which we found to correlate well to roughly 4k trigram words -- our configured pre-training sequence length. | ||
Note: If sequence length is exceeded, overhead will simply be cut off. If it is not reached with the index, it will be filled with a random further item. | ||
|
||
Right now typical instruction finetunings jsonl's need to casted to MemoryMaps as in data/convert_instruct_jsonl_to_mmap.py. | ||
|
||
Data in MemoryMap format requires config.data.pretraining = False. | ||
AlignedMemoryMaps require config.data.pretraining = True. | ||
|
||
### Launch training | ||
We include example determined configs in determined_configs/ and example training model configs in configs/. | ||
You may launch a training through one of the determined experiments, or respectively convert the torch distributed launcher scripts of the following line to your environment: | ||
``` | ||
python3 -m determined.launch.torch_distributed python3 src/trigram_tokenizer/trainer/train_determined.py --config configs/7b_fineweb.yaml | ||
``` | ||
|
||
|
||
## Inference | ||
Run one of the prepared inference* scripts. | ||
|
||
Note that the inference decode works different to that of other LLMs. | ||
Currently we build a dictionary with word-patterns and compute the product with the models output logits to select e.g. the next top-word. This is shown in Fig. 4 below. | ||
|
||
As such you need/ may want to | ||
- InferencePipe.__init__: pass a path to a collections.Counter pckl file through the "top_word_dict" parameter - these will be converted to the patterns of the respective loaded checkpoint config. We counted once the entire fineweb-edu dataset, a top-k subset is found in data/en_fineweb_top_1m_counter.pckl. | ||
- InferencePipe.__init__: pass an integer through the "reduce_tokenizer_words_to" parameter - these cut off the top n words of aboves collection. | ||
- InferencePipe.generate: pass a "more_words" string - a string of words (separated by whitespaces) that will dynamically be added to the dictionary for on demand sampling. | ||
- InferencePipe.tokenizer.convert_weight_for_word_edge_overweight: will downweight edge-trigrams's as discussed in paper, to reduce artifacts (usually factor < .8). | ||
- call data/generate_tokenizer.py once with a trained checkpoint. This will preprocess once the patterns for the passed dictionary and store them next to the checkpoint. Subsequent calls of InferencePipe.__init__ then do not require to pass the dict but directly load the stored patterns. | ||
|
||
 | ||
|
||
## Known Issues/ Ongoing Research | ||
|
||
N/ A | ||
|
||
|
||
|
||
## Cite | ||
|
||
``` | ||
@article{deiseroth2024t, | ||
title={T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings}, | ||
author={Deiseroth, Bj{\"o}rn and Brack, Manuel and Schramowski, Patrick and Kersting, Kristian and Weinbach, Samuel}, | ||
journal={arXiv preprint arXiv:2406.19223}, | ||
year={2024} | ||
} | ||
``` |
Oops, something went wrong.