Skip to content

Generative deep learning for molecules using transformers.

License

Notifications You must be signed in to change notification settings

akensert/molcraft

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

molcraft-logo

Transformers with TensorFlow and Keras. Focused on Molecule Generation and Chemistry Predictions.

Note

In progress.

Highlights

Aims to implement efficient models, samplers and [soon] reinforcement learning for SMILES generation and optimization.

  • Models / Layers
    • Implements key-value caching for efficient autoregression
  • Samplers
    • Samples Models for next tokens
    • Can generate a batch of sequences in parallel non-eagerly
    • Can generate a batch of sequences based on initial sequences of varying lengths
  • Tokenizers
    • Tokenizes data input for Models
    • Can be adapted to data via tokenizer.adapt(ds) to build vocabulary
    • Can be added as a layer to keras.Sequential
    • Can both tokenize and detokenize data

Code Examples

import tensorflow as tf
import keras
import random

from molcraft import tokenizers
from molcraft import models
from molcraft import samplers 

filename = './data/zinc250K.txt' # replace this with actual path

with open(filename, 'r') as fh:
    smiles = fh.read().splitlines()

random.shuffle(smiles)

# Adapt tokenizer (create vocabulary)
tokenizer = tokenizers.SMILESTokenizer(add_bos=True, add_eos=True)
tokenizer.adapt(smiles)

# Build dataset (input pipeline)
ds = tf.data.Dataset.from_tensor_slices(smiles)
ds = ds.shuffle(8192)
ds = ds.batch(256)
ds = ds.map(tokenizer)
ds = ds.map(lambda x: (x[:, :-1], x[:, 1:]))
ds = ds.prefetch(-1)

# Build, compile, and fit model
model = models.TransformerDecoder(
    num_layers=4,
    num_heads=8,
    embedding_dim=512,
    intermediate_dim=1024,
    vocabulary_size=tokenizer.vocabulary_size,
    sequence_length=tokenizer.sequence_length,
    dropout=0,
)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=3e-4), 
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True)
)
model.fit(ds, epochs=1)

# Generate 32 novel SMILES with sampler
sampler = samplers.TopKSampler(model, tokenizer)
smiles = sampler.sample([''] * 32)

Installation

Note

Project is under development, hence incomplete and subject to breaking changes.

For GPU users:

git clone [email protected]:akensert/molcraft.git
pip install -e .[gpu]

For CPU users:

git clone [email protected]:akensert/molcraft.git
pip install -e .

Releases

No releases published

Packages

No packages published

Languages