Skip to content

Commit

Permalink
README + MSA example updates
Browse files Browse the repository at this point in the history
  • Loading branch information
wukevin committed Nov 28, 2024
1 parent 31cebfa commit 4071f94
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 18 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ CHAI_DOWNLOADS_DIR=/tmp/downloads python ./examples/predict_structure.py

Chai-1 supports MSAs provided as an `aligned.pqt` file. This file format is similar to an `a3m` file, but has additional columns that provide metadata like the source database and sequence pairing keys. We provide code to convert `a3m` files to `aligned.pqt` files. For more information on how to provide MSAs to Chai-1, see [this documentation](examples/msas/README.md).

For user convenience, we also support automatic MSA generation via the ColabFold MMseqs2 server via the `--msa-server` flag. As detailed in the ColabFold [repository](https://github.com/sokrypton/ColabFold), please keep in mind that this is a shared resource.
For user convenience, we also support automatic MSA generation via the ColabFold [MMseqs2](https://github.com/soedinglab/MMseqs2) server via the `--msa-server` flag. As detailed in the ColabFold [repository](https://github.com/sokrypton/ColabFold), please keep in mind that this is a shared resource. Note that the results reported in our preprint and the webserver use a different MSA search strategy than MMseqs2, though we expect results to be broadly similar.

</p>
</details>
Expand Down
11 changes: 8 additions & 3 deletions examples/msas/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

While Chai-1 performs very well in "single-sequence mode," it can also be given additional evolutionary information to further improve performance. As in other folding methods, this evolutionary information is provided in the form of a multiple sequence alignment (MSA). This information is given in the form of a `MSAContext` object (see `chai_lab/data/dataset/msas/msa_context.py`); we provide code for building these `MSAContext` objects through `aligned.pqt` files, though you can play with building out an `MSAContext` yourself as well.

Multiple strategies can be used for generating MSAs. In our [technical report](https://chaiassets.com/chai-1/paper/technical_report_v1.pdf), we generated MSAs using [jackhmmer](https://github.com/EddyRivasLab/hmmer). Other algorithms such as [MMseqs2](https://github.com/soedinglab/MMseqs2) can also be used. We provide an example of how to generate MSAs using [ColabFold](https://github.com/sokrypton/ColabFold) in `examples/msas/predict_with_msas.py`. Performance will vary depending on the input MSA databases and search algorithms used.

## The `.aligned.pqt` file format

The easiest way to provide MSA information to Chai-1 is through the `.aligned.pqt` file format that we have defined. This file can be thought of as an augmented `a3m` file, and is essentially a dataframe saved in parquet format with the following four (required) columns:
Expand Down Expand Up @@ -58,4 +56,11 @@ import pandas as pd

aligned_pqt = pd.read_parquet("examples/msas/703adc2c74b8d7e613549b6efcf37126da7963522dc33852ad3c691eef1da06f.aligned.pqt")
aligned_pqt.head()
```
```


## Additional MSA generation strategies

Multiple strategies can be used for generating MSAs. In our [technical report](https://chaiassets.com/chai-1/paper/technical_report_v1.pdf), we generated MSAs using [jackhmmer](https://github.com/EddyRivasLab/hmmer). Other algorithms such as [MMseqs2](https://github.com/soedinglab/MMseqs2) can also be used. In this vein, we provide support for automatic MSA generation via the [ColabFold](https://github.com/sokrypton/ColabFold) server using `chai fold input.fasta output_directory --msa-server`. Performance will vary depending on the input MSA databases and search algorithms used.

In addition, people have found that tweaking MSA inputs can be a fruitful path to improving folding results -- we such exploration of this for Chai-1 as well!
15 changes: 1 addition & 14 deletions examples/msas/predict_with_msas.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,6 @@
import numpy as np

from chai_lab.chai1 import run_inference
from chai_lab.data.dataset.inference_dataset import read_inputs
from chai_lab.data.dataset.msas.colabfold import generate_colabfold_msas
from chai_lab.data.parsing.structure.entity_type import EntityType

tmp_dir = Path(tempfile.mkdtemp())

Expand All @@ -24,16 +21,6 @@
fasta_path = tmp_dir / "example.fasta"
fasta_path.write_text(example_fasta)

# Generate MSAs
msa_dir = tmp_dir / "msas"
msa_dir.mkdir()
protein_seqs = [
input.sequence
for input in read_inputs(fasta_path)
if input.entity_type == EntityType.PROTEIN.value
]
generate_colabfold_msas(protein_seqs=protein_seqs, msa_dir=msa_dir)


# Generate structure
output_dir = tmp_dir / "outputs"
Expand All @@ -46,7 +33,7 @@
seed=42,
device="cuda:0",
use_esm_embeddings=True,
msa_directory=msa_dir,
msa_directory=Path(__file__).parent,
)
cif_paths = candidates.cif_paths
scores = [rd.aggregate_score for rd in candidates.ranking_data]
Expand Down

0 comments on commit 4071f94

Please sign in to comment.