Virtual-environment setup on Noctua2

Virtual-environment setup on experiment server / VM

To use the the python environment as a notebook kernel (optional)

pip install ipykernel
python -m ipykernel install --user --name=venv-lola

Static TopKGate in MoE layer

To introduce the static gating the MoE layer, we have tweaked the logic in the Deepspeed library. At the moment, to use this logic we are using a workaround that involves replacing the original layer.py with gpt/overriden_classes/layer.py). The original file should be located in your virtual environment in a path like this: venv-lola/lib/<your-python-version>/site-packages/deepspeed/moe/layer.py. To replace, simply do the following:

# Backup the original file
mv venv-lola/lib/<your-python-version>/site-packages/deepspeed/moe/layer.py venv-lola/lib/<your-python-version>/site-packages/deepspeed/moe/layer.py_original
# Copy the modified file
cp lola_ws/gpt/overriden_classes/layer.py venv-lola/lib/<your-python-version>/site-packages/deepspeed/moe/

CulturaX

Note: The scripts provided below are written for noctua2 cluster and have hardcoded paths in them. Please go through them before reusing.

Downloading CulturaX

# This command might fail from time to time, rerunning it resumes the download
huggingface-cli download uonlp/CulturaX --repo-type dataset

Once the download is finished, create a symlink with "CulturaX" as directory name. pointing to your huggingface cache, e.g:

ln -s /scratch/hpc-prf-lola/nikit/.cache/huggingface/datasets--uonlp--CulturaX/snapshots/321a983f3fd2a929cc1f8ef6207834bab0bb9e25 /scratch/hpc-prf-lola/data/raw_datasets/CulturaX

Then run the following command to generate arrow files for all the languages:

# Note: This command will spawn 167 jobs on your cluster
bash run_process_culturax.sh

Pre-Processing CulturaX

We collected the CulturaX stats in this file: culturax-v1-0-0_data_stats.json.

We define the percentage of samples we would like to extract samples for prepocessing per language (default applies to non-mentioned languages): culturax-custom-data-split.json.

Afterwards, run the following script to submit preprocessing jobs for all languages (1 slurm job per language):

python3 preprocess_large_data.py

The processed datasets will be available at the mentioned DATA_PATH in preprocess_large_data.sh.

As per the discussion here: NVIDIA/Megatron-LM#452, merging the data into one big file makes sense for some filesystems. To merge the files, first copy all the *_text_document files with .bin and .idx extension into a single directory and then use the merge tool:

# copy files for merge
cp -r <path-to-processed-data>/data-*/meg-culturax-*_text_document* <path-to-collect-files-for-merge>
# merge the dataset
sbatch merge_datasets.sh <path-to-collected-files-for-merge> <path-to-output-dir>  meg-culturax

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Virtual-environment setup on Noctua2

Virtual-environment setup on experiment server / VM

To use the the python environment as a notebook kernel (optional)

Static TopKGate in MoE layer

CulturaX

Downloading CulturaX

Pre-Processing CulturaX

Files

README.md

Latest commit

History

README.md

File metadata and controls

Virtual-environment setup on Noctua2

Virtual-environment setup on experiment server / VM

To use the the python environment as a notebook kernel (optional)

Static TopKGate in MoE layer

CulturaX

Downloading CulturaX

Pre-Processing CulturaX