Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
Co-authored-by: Björn Deiseroth <[email protected]>
  • Loading branch information
sweinbach and bashFish committed Aug 7, 2024
0 parents commit 2114f45
Show file tree
Hide file tree
Showing 109 changed files with 540,991 additions and 0 deletions.
8 changes: 8 additions & 0 deletions .detignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Ignore tests tmp folder
.mypy_cache
.pytest_cache
tmp/*
.git
tests/
*experiments*
*tools*
157 changes: 157 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
*lightning_logs*
*ckpts*

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# IDE
.vscode/
Todo.json

# Cython debug symbols
cython_debug/

# Temporary test files
tests/artifacts/tmp/*
!tests/artifacts/tmp/.gitkeep
debug_logs/
wandb/
*index_cache*

#environment
envs*
.DS_Store
tmp/
checkpoints/
Empty file added .gitmodules
Empty file.
41 changes: 41 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
repos:
- repo: https://github.com/peterdemin/pip-compile-multi
rev: v2.4.5
hooks:
- id: pip-compile-multi-verify
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.4.0
hooks:
- id: check-json
# - id: pretty-format-json
# args:
# - --autofix
- id: end-of-file-fixer
exclude: '.bin|.meta.json'
- id: trailing-whitespace
exclude: '.bin|.meta.json'
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
# - repo: https://github.com/PyCQA/flake8
# rev: '6.1.0'
# hooks:
# - id: flake8
# exclude: src/license_plate_annotator/models/
# args: ["--extend-ignore=E203,E501,P103,W503"]
# additional_dependencies: [
# 'flake8-blind-except',
# 'flake8-docstrings',
# 'flake8-bugbear',
# 'flake8-comprehensions',
# 'flake8-docstrings',
# 'flake8-implicit-str-concat',
# 'flake8-fastapi',
# 'pydocstyle>=5.0.0',
# ]
# - repo: https://github.com/kynan/nbstripout
# rev: 0.4.0
# hooks:
# - id: nbstripout
# files: ".ipynb"
28 changes: 28 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
The following applies to all files in this repository, unless otherwise noted:

Copyright (c) 2024 IPAI Aleph Alpha Research GmbH

Subject to the terms and conditions of this License, the Licensor grants you a non-exclusive, worldwide,
non-transferable, non-sublicensable, and royalty-free limited right to use, copy, modify, distribute, make
otherwise publicly available, and reproduce the Works and Derivative Works under Licensor’s copyright,
for any Non-Commercial and Non-Administrative purpose.
You may not use, copy, modify, distribute, make otherwise publicly available, reproduce, or sublicense the
Works or Derivative Works except as expressly provided under and in accordance with this License.
Your rights granted under this License will automatically terminate if you fail to comply with any of the
terms of this License.

EXCEPT FOR DAMAGES CAUSED BY INTENT OR FRAUDULENTLY CONCEALED
DEFECTS, AND EXCEPT FOR DAMAGES RESULTING FROM BREACH OF ANY
WARRANTY OR GUARANTEE EXPRESSLY GIVEN BY LICENSOR IN THE OPEN ALEPH LICENSE,
IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY
DAMAGES ARISING OUT OF THE OPEN ALEPH LICENSE OR THE USE OF THE WORK. ANY
MANDATORY STATUTORY LIABILITY UNDER APPLICABLE LAW REMAINS
UNAFFECTED.

EXCEPT AS EXPRESSLY STATED IN THIS LICENSE OR REQUIRED BY APPLICABLE
LAW, THE WORKS ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES
OF ANY KIND INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES REGARDING
THE CONTENTS, ACCURACY, OR FITNESS FOR A PARTICULAR PURPOSE.

For additional information on the license terms, see the complete license at
https://github.com/Aleph-Alpha/.github/blob/main/oal.pdf
94 changes: 94 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# T-Free: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory Efficient Embeddings

Paper link: https://arxiv.org/abs/2406.19223

![Figure 1: Method comparison classical Tokenizer to T-Free](figures/fig_1.png)
![Table 1: Near-Duplicate and fertility metrics for several models](figures/tab_1.png)
![Figure 3: Continued transfer learning evaluations](figures/fig_3.png)


## Checkpoints
- 7B trained on 1 epoch fineweb-edu (coming wk 35)



## Install

```console
conda create --name t-free python=3.11 -y
conda activate t-free

conda install pytorch==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia

pip install -e .[test]

pre-commit install
```

## Running some tests
```
pip install determined==0.26.4
pytest tests/test_data
pytest tests/test_tokenizer
pytest tests/test_trainer
pytest tests/test_inference
```

## Training

### Data preprocessing
Training requires data to be in one of two formats: MemoryMap or AlignedMemoryMap. MemoryMaps simply store data in byte format with indices for fast random access.
AlignedMemoryMaps build a second index above, that aggregate a minimum byte-count per entry. We use this for pre-training to avoid the need of aggregating a dynamic number of random entries to fill the full sequence length, which becomes hard when trying to ensure that each entry is will only be seen once during one epoch.

An example of how to download and convert e.g. FinetuneWeb of HuggingFace into a MemoryMap is found in data/download_data.py.
Note that the data is stored in string format (converted to bytes). Trigramification or classical Tokenization are executed on-demand during training, depending on the respective training configurations.

To further convert the data into an AlignedMemoryMap execute e.g. data/convert_mmap_to_alignedmmap.py.
Right now max_bytes is set to 20k, which we found to correlate well to roughly 4k trigram words -- our configured pre-training sequence length.
Note: If sequence length is exceeded, overhead will simply be cut off. If it is not reached with the index, it will be filled with a random further item.

Right now typical instruction finetunings jsonl's need to casted to MemoryMaps as in data/convert_instruct_jsonl_to_mmap.py.

Data in MemoryMap format requires config.data.pretraining = False.
AlignedMemoryMaps require config.data.pretraining = True.

### Launch training
We include example determined configs in determined_configs/ and example training model configs in configs/.
You may launch a training through one of the determined experiments, or respectively convert the torch distributed launcher scripts of the following line to your environment:
```
python3 -m determined.launch.torch_distributed python3 src/trigram_tokenizer/trainer/train_determined.py --config configs/7b_fineweb.yaml
```


## Inference
Run one of the prepared inference* scripts.

Note that the inference decode works different to that of other LLMs.
Currently we build a dictionary with word-patterns and compute the product with the models output logits to select e.g. the next top-word. This is shown in Fig. 4 below.

As such you need/ may want to
- InferencePipe.__init__: pass a path to a collections.Counter pckl file through the "top_word_dict" parameter - these will be converted to the patterns of the respective loaded checkpoint config. We counted once the entire fineweb-edu dataset, a top-k subset is found in data/en_fineweb_top_1m_counter.pckl.
- InferencePipe.__init__: pass an integer through the "reduce_tokenizer_words_to" parameter - these cut off the top n words of aboves collection.
- InferencePipe.generate: pass a "more_words" string - a string of words (separated by whitespaces) that will dynamically be added to the dictionary for on demand sampling.
- InferencePipe.tokenizer.convert_weight_for_word_edge_overweight: will downweight edge-trigrams's as discussed in paper, to reduce artifacts (usually factor < .8).
- call data/generate_tokenizer.py once with a trained checkpoint. This will preprocess once the patterns for the passed dictionary and store them next to the checkpoint. Subsequent calls of InferencePipe.__init__ then do not require to pass the dict but directly load the stored patterns.

![Figure 4: T-Free vs Classical Decode](figures/fig_4.png)

## Known Issues/ Ongoing Research

N/ A



## Cite

```
@article{deiseroth2024t,
title={T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings},
author={Deiseroth, Bj{\"o}rn and Brack, Manuel and Schramowski, Patrick and Kersting, Kristian and Weinbach, Samuel},
journal={arXiv preprint arXiv:2406.19223},
year={2024}
}
```
Loading

0 comments on commit 2114f45

Please sign in to comment.