initial commit

Co-authored-by: Björn Deiseroth <[email protected]>
Aleph-Alpha · Aug 7, 2024 · 2114f45 · 2114f45
commit 2114f45
Show file tree

Hide file tree

Showing 109 changed files with 540,991 additions and 0 deletions.
diff --git a/.detignore b/.detignore
@@ -0,0 +1,8 @@
+# Ignore tests tmp folder
+.mypy_cache
+.pytest_cache
+tmp/*
+.git
+tests/
+*experiments*
+*tools*
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,157 @@
+*lightning_logs*
+*ckpts*
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# IDE
+.vscode/
+Todo.json
+
+# Cython debug symbols
+cython_debug/
+
+# Temporary test files
+tests/artifacts/tmp/*
+!tests/artifacts/tmp/.gitkeep
+debug_logs/
+wandb/
+*index_cache*
+
+#environment
+envs*
+.DS_Store
+tmp/
+checkpoints/
diff --git a/.gitmodules b/.gitmodules
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,41 @@
+repos:
+  - repo: https://github.com/peterdemin/pip-compile-multi
+    rev: v2.4.5
+    hooks:
+      - id: pip-compile-multi-verify
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v3.4.0
+    hooks:
+      - id: check-json
+      # - id: pretty-format-json
+      #   args:
+      #     - --autofix
+      - id: end-of-file-fixer
+        exclude: '.bin|.meta.json'
+      - id: trailing-whitespace
+        exclude: '.bin|.meta.json'
+  - repo: https://github.com/psf/black
+    rev: 23.7.0
+    hooks:
+      - id: black
+  # - repo: https://github.com/PyCQA/flake8
+  #   rev: '6.1.0'
+  #   hooks:
+  #     - id: flake8
+  #       exclude: src/license_plate_annotator/models/
+  #       args: ["--extend-ignore=E203,E501,P103,W503"]
+  #       additional_dependencies: [
+  #           'flake8-blind-except',
+  #           'flake8-docstrings',
+  #           'flake8-bugbear',
+  #           'flake8-comprehensions',
+  #           'flake8-docstrings',
+  #           'flake8-implicit-str-concat',
+  #           'flake8-fastapi',
+  #           'pydocstyle>=5.0.0',
+  #       ]
+  # - repo: https://github.com/kynan/nbstripout
+  #   rev: 0.4.0
+  #   hooks:
+  #     - id: nbstripout
+  #       files: ".ipynb"
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,28 @@
+The following applies to all files in this repository, unless otherwise noted:
+
+Copyright (c) 2024 IPAI Aleph Alpha Research GmbH
+
+Subject to the terms and conditions of this License, the Licensor grants you a non-exclusive, worldwide,
+non-transferable, non-sublicensable, and royalty-free limited right to use, copy, modify, distribute, make
+otherwise publicly available, and reproduce the Works and Derivative Works under Licensor’s copyright,
+for any Non-Commercial and Non-Administrative purpose.
+You may not use, copy, modify, distribute, make otherwise publicly available, reproduce, or sublicense the
+Works or Derivative Works except as expressly provided under and in accordance with this License.
+Your rights granted under this License will automatically terminate if you fail to comply with any of the
+terms of this License.
+
+EXCEPT FOR DAMAGES CAUSED BY INTENT OR FRAUDULENTLY CONCEALED
+DEFECTS, AND EXCEPT FOR DAMAGES RESULTING FROM BREACH OF ANY
+WARRANTY OR GUARANTEE EXPRESSLY GIVEN BY LICENSOR IN THE OPEN ALEPH LICENSE,
+IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY
+DAMAGES ARISING OUT OF THE OPEN ALEPH LICENSE OR THE USE OF THE WORK. ANY
+MANDATORY STATUTORY LIABILITY UNDER APPLICABLE LAW REMAINS
+UNAFFECTED.
+
+EXCEPT AS EXPRESSLY STATED IN THIS LICENSE OR REQUIRED BY APPLICABLE
+LAW, THE WORKS ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES
+OF ANY KIND INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES REGARDING
+THE CONTENTS, ACCURACY, OR FITNESS FOR A PARTICULAR PURPOSE.
+
+For additional information on the license terms, see the complete license at
+https://github.com/Aleph-Alpha/.github/blob/main/oal.pdf
diff --git a/README.md b/README.md
@@ -0,0 +1,94 @@
+# T-Free: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory Efficient Embeddings
+
+Paper link: https://arxiv.org/abs/2406.19223
+
+![Figure 1: Method comparison classical Tokenizer to T-Free](figures/fig_1.png)
+![Table 1: Near-Duplicate and fertility metrics for several models](figures/tab_1.png)
+![Figure 3: Continued transfer learning evaluations](figures/fig_3.png)
+
+
+## Checkpoints
+- 7B trained on 1 epoch fineweb-edu (coming wk 35)
+
+
+
+## Install
+
+```console
+conda create --name t-free python=3.11 -y
+conda activate t-free
+
+conda install pytorch==2.1.2 pytorch-cuda=12.1 -c pytorch -c nvidia
+
+pip install -e .[test]
+
+pre-commit install
+```
+
+## Running some tests
+```
+pip install determined==0.26.4
+
+pytest tests/test_data
+pytest tests/test_tokenizer
+pytest tests/test_trainer
+pytest tests/test_inference 
+```
+
+## Training
+
+### Data preprocessing
+Training requires data to be in one of two formats: MemoryMap or AlignedMemoryMap. MemoryMaps simply store data in byte format with indices for fast random access.
+AlignedMemoryMaps build a second index above, that aggregate a minimum byte-count per entry. We use this for pre-training to avoid the need of aggregating a dynamic number of random entries to fill the full sequence length, which becomes hard when trying to ensure that each entry is will only be seen once during one epoch.  
+
+An example of how to download and convert e.g. FinetuneWeb of HuggingFace into a MemoryMap is found in data/download_data.py. 
+Note that the data is stored in string format (converted to bytes). Trigramification or classical Tokenization are executed on-demand during training, depending on the respective training configurations.
+
+To further convert the data into an AlignedMemoryMap execute e.g. data/convert_mmap_to_alignedmmap.py.
+Right now max_bytes is set to 20k, which we found to correlate well to roughly 4k trigram words -- our configured pre-training sequence length.
+Note: If sequence length is exceeded, overhead will simply be cut off. If it is not reached with the index, it will be filled with a random further item.
+
+Right now typical instruction finetunings jsonl's need to casted to MemoryMaps as in data/convert_instruct_jsonl_to_mmap.py.
+
+Data in MemoryMap format requires config.data.pretraining = False.
+AlignedMemoryMaps require config.data.pretraining = True.
+
+### Launch training
+We include example determined configs in determined_configs/ and example training model configs in configs/.
+You may launch a training through one of the determined experiments, or respectively convert the torch distributed launcher scripts of the following line to your environment:
+```
+python3 -m determined.launch.torch_distributed python3 src/trigram_tokenizer/trainer/train_determined.py --config configs/7b_fineweb.yaml
+```
+
+
+## Inference
+Run one of the prepared inference* scripts.
+
+Note that the inference decode works different to that of other LLMs.
+Currently we build a dictionary with word-patterns and compute the product with the models output logits to select e.g. the next top-word. This is shown in Fig. 4 below.
+
+As such you need/ may want to
+- InferencePipe.__init__: pass a path to a collections.Counter pckl file through the "top_word_dict" parameter - these will be converted to the patterns of the respective loaded checkpoint config. We counted once the entire fineweb-edu dataset, a top-k subset is found in data/en_fineweb_top_1m_counter.pckl.
+- InferencePipe.__init__: pass an integer through the "reduce_tokenizer_words_to" parameter - these cut off the top n words of aboves collection.
+- InferencePipe.generate: pass a "more_words" string - a string of words (separated by whitespaces) that will dynamically be added to the dictionary for on demand sampling.
+- InferencePipe.tokenizer.convert_weight_for_word_edge_overweight: will downweight edge-trigrams's as discussed in paper, to reduce artifacts (usually factor < .8).
+- call data/generate_tokenizer.py once with a trained checkpoint. This will preprocess once the patterns for the passed dictionary and store them next to the checkpoint. Subsequent calls of InferencePipe.__init__  then do not require to pass the dict but directly load the stored patterns.  
+
+![Figure 4: T-Free vs Classical Decode](figures/fig_4.png)
+
+## Known Issues/ Ongoing Research
+
+N/ A
+
+
+
+## Cite
+
+```
+@article{deiseroth2024t,
+  title={T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings},
+  author={Deiseroth, Bj{\"o}rn and Brack, Manuel and Schramowski, Patrick and Kersting, Kristian and Weinbach, Samuel},
+  journal={arXiv preprint arXiv:2406.19223},
+  year={2024}
+}
+```