Training on large language models for genomics

Overview

In this repository, we will follow a training for large language models (LLMs) for genomics. The training comprises a short lecture and several lab classes.

Lecture notes

The lecture note is in the file:

CM_llm_genomics.pdf

Lab classes

Data to pretrain the model

The data can be found in the file:

data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz

The file contains 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome.

Pretraining of an LLM on DNA sequences

We will pretrain an LLM from scratch (a simplified mistral model, see folder data/models/Mixtral-8x7B-v0.1/) on the 100,000 DNA sequences from the human genome. The LLM is pretrained with causal language modeling using 200b DNA sequences from the human genome hg38 assembly.

Script on Google Colab

Finetuning of an LLM for DNA sequence classification

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-17M-hg38) and finetune it for DNA sequence classification. The aim is to classify a DNA sequence depending on whether it binds a protein or not (transcription factor), or if a histone mark is present, or if a promoter is active.

Script on Google Colab.

Zeroshot learning prediction of mutation effect

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-17M-hg38) to predict the impact of mutations with zeroshot learning (directly using the pretrained model for DNA sequences). Here, we compute the embedding of the wild type sequence and compare it to the embedding of the mutated sequence, and then compute a L2 distance between the two embeddings. We expect that the higher the distance, the larger the mutation effect.

Script on Google Colab.

Synthetic DNA sequence generation

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-138M-yeast) to generate artificial yeast DNA sequences.

Script on Google Colab.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
CM_llm_genomics.pdf		CM_llm_genomics.pdf
CM_llm_genomics.pptx		CM_llm_genomics.pptx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training on large language models for genomics

Overview

Lecture notes

Lab classes

Data to pretrain the model

Pretraining of an LLM on DNA sequences

Finetuning of an LLM for DNA sequence classification

Zeroshot learning prediction of mutation effect

Synthetic DNA sequence generation

About

Releases

Packages

raphaelmourad/LLM-for-genomics-training

Folders and files

Latest commit

History

Repository files navigation

Training on large language models for genomics

Overview

Lecture notes

Lab classes

Data to pretrain the model

Pretraining of an LLM on DNA sequences

Finetuning of an LLM for DNA sequence classification

Zeroshot learning prediction of mutation effect

Synthetic DNA sequence generation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages