Skip to content

Commit

Permalink
update references to HATCHet to HATCHet2
Browse files Browse the repository at this point in the history
  • Loading branch information
mmyers1 committed Nov 25, 2024
1 parent 7115c38 commit 9c50da6
Show file tree
Hide file tree
Showing 48 changed files with 370 additions and 205 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
echo "GRB_LICENSE_FILE=${GUROBI_HOME}/gurobi.lic" >> $GITHUB_ENV
continue-on-error: true

- name: Install HATCHet with dev dependencies
- name: Install HATCHet2 with dev dependencies
run: |
python -m pip install .[dev]
env:
Expand Down Expand Up @@ -136,7 +136,7 @@ jobs:
tar zxvf 1000GP_Phase3.tgz --wildcards *chr22* *sample
echo "HATCHET_DOWNLOAD_PANEL_REFPANELDIR=$(pwd)" >> $GITHUB_ENV
- name: HATCHet Check
- name: HATCHet2 Check
run: |
hatchet check
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
cmake_minimum_required( VERSION 2.8 )

project( HATCHet )
project( HATCHet2 )

set( CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR} ${CMAKE_MODULE_PATH} )

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
![CI](https://github.com/raphael-group/hatchet/workflows/CI/badge.svg)
[![codecov](https://codecov.io/gh/raphael-group/hatchet/branch/master/graph/badge.svg)](https://codecov.io/gh/raphael-group/hatchet)

# HATCHet
# HATCHet2

HATCHet is an algorithm to infer allele and clone-specific copy-number aberrations (CNAs), clone proportions, and whole-genome duplications (WGD) for several tumor clones jointly from multiple bulk-tumor samples of the same patient or from a single bulk-tumor sample.
HATCHet2 is an algorithm to infer allele and clone-specific copy-number aberrations (CNAs), clone proportions, and whole-genome duplications (WGD) for several tumor clones jointly from multiple bulk-tumor samples of the same patient or from a single bulk-tumor sample.

Complete documentation for HATCHet is available at [https://raphael-group.github.io/hatchet/](https://raphael-group.github.io/hatchet/)
Complete documentation for HATCHet2 is available at [https://raphael-group.github.io/hatchet/](https://raphael-group.github.io/hatchet/)
34 changes: 17 additions & 17 deletions cloud/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Running HATCHet in the cloud
# Running HATCHet2 in the cloud

HATCHet is a Docerizable application and comes with a Dockerfile for easy deployment. We have also made HATCHet
HATCHet2 is a Docerizable application and comes with a Dockerfile for easy deployment. We have also made HATCHet2
available as a publicly accessible Docker image at the [Google Cloud Container Registry](https://cloud.google.com/container-registry).
This facilitates running HATCHet in the cloud without worrying about downloading large BAM files, and without having to
build and install HATCHet locally.
This facilitates running HATCHet2 in the cloud without worrying about downloading large BAM files, and without having to
build and install HATCHet2 locally.

This README provides details on how to run HATCHet entirely on the [Google Cloud Platform](https://cloud.google.com) (GCP)
This README provides details on how to run HATCHet2 entirely on the [Google Cloud Platform](https://cloud.google.com) (GCP)
on large datasets made available at [ISB-CGC](https://isb-cgc.appspot.com/).

## Running HATCHet on ISB-CGC Datasets
## Running HATCHet2 on ISB-CGC Datasets

### Setting up access at ISB-CGC

Expand All @@ -20,14 +20,14 @@ section and follow the steps to register your Google project with ISB-CGC.

Note that your PI will most likely have to grant you access to one or more of these controlled datasets using
[dbGap](https://dbgap.ncbi.nlm.nih.gov/). The steps in the walk-throughs and tutorials on the ISB-CGC website will
verify that you do have the appropriate access you will need to programmatically read these datasets in HATCHet.
verify that you do have the appropriate access you will need to programmatically read these datasets in HATCHet2.

Also note that access to controlled datasets is typically granted only for 24 hours, so you will have to extend your
access period on the ISB-CGC website if it has expired.

### Setting up your environment to run HATCHet on GCP
### Setting up your environment to run HATCHet2 on GCP

You do not need to build or install HATCHet locally, either as a python package or a Docker image. The only pre-requisite
You do not need to build or install HATCHet2 locally, either as a python package or a Docker image. The only pre-requisite
is that you have installed the [Google Cloud SDK](https://cloud.google.com/sdk/docs/quickstart).

This is most cleanly done by installing all required dependencies inside a new Python 3 Conda environment.
Expand All @@ -43,7 +43,7 @@ pip install oauth2client dsub
### Logging in to your GCP Account

After installing the required dependencies, make sure that you login to your Google account and set up your default
project. These are **one time steps** to make sure that HATCHet is able to correctly talk to your project.
project. These are **one time steps** to make sure that HATCHet2 is able to correctly talk to your project.

```
gcloud auth application-default login
Expand All @@ -61,7 +61,7 @@ that you linked with your ISB-CGC account.
### Preparing a bucket for output files

In the Google project that you used in the steps above, use the following command to create a new bucket where the results
of your HATCHet analysis will be saved:
of your HATCHet2 analysis will be saved:

```
gsutil mb gs://BUCKET_NAME
Expand All @@ -70,10 +70,10 @@ gsutil mb gs://BUCKET_NAME
Replace `BUCKET_NAME` with a globally-unique bucket name. This step can also be performed by logging in to the
[Google Cloud Console](https://console.cloud.google.com) and navigating to Home -> Storage -> Browser -> Create Bucket.

### Fine-tuning the HATCHet script
### Fine-tuning the HATCHet2 script

The `_run.sh` script provided with HATCHet is an end-end worflow of HATCHet. This will be familiar to you if you have
run HATCHet locally. You can comment out sections of this script to only run certain parts of HATCHet depending on your
The `_run.sh` script provided with HATCHet2 is an end-end worflow of HATCHet2. This will be familiar to you if you have
run HATCHet2 locally. You can comment out sections of this script to only run certain parts of HATCHet2 depending on your
needs, and specify the values of certain flags of the pipeline.

The part of the script that you will want to pay attention to is the `Reference Genome` section. Depending on the
Expand All @@ -83,8 +83,8 @@ or `.fa` file available through `wget`.

### Running the scripts

The `cloud_run.sh` script provided with HATCHet is a single [dsub](https://github.com/DataBiosphere/dsub) command that
will run HATCHet in the cloud. This command leverages the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest)
The `cloud_run.sh` script provided with HATCHet2 is a single [dsub](https://github.com/DataBiosphere/dsub) command that
will run HATCHet2 in the cloud. This command leverages the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest)
and internally performs the following series of steps:

<a name="cloud_steps"></a>
Expand Down Expand Up @@ -116,7 +116,7 @@ dsub \
```

In the above command, you will want to replace `PROJECT_ID` with your project id, `BUCKET_NAME` with the bucket name that
you created above, `RUN_NAME` with any unique name (no spaces!) that identifies your HATCHet run. In addition:
you created above, `RUN_NAME` with any unique name (no spaces!) that identifies your HATCHet2 run. In addition:

- The `NORMALBAM` parameter should be replaced with the `gs://..` path to the matched-normal sample of the patient.

Expand Down
4 changes: 2 additions & 2 deletions custom/GATK4-CNV/custom-gatk4-cnv.sh
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#!/usr/bin/bash

# This is a custom complete pipeline of HATCHet which considers in input segmented files for one or more samples from the same patient, produced by the GATK4 CNV pipeline.
# This is a custom complete pipeline of HATCHet2 which considers in input segmented files for one or more samples from the same patient, produced by the GATK4 CNV pipeline.

HATCHET_HOME="/path/to/hatchet_home" # Provide the full path to HATCHet's repository
HATCHET_HOME="/path/to/hatchet_home" # Provide the full path to HATCHet2's repository

CNVTOBB="${HATCHET_HOME}/custom/GATK4-CNV/gatk4cnsToBB.py"

Expand Down
20 changes: 10 additions & 10 deletions custom/GATK4-CNV/demo-gatk4-cnv.sh
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Demo of the custom pipeline for GATK4 CNV data
: ex: set ft=markdown ;:<<'```shell' # This line makes this file both a guieded and executable DEMO. The file can be both displayed as a Markdown file, where to read the instructions and descriptions of the demo and results, and a BASH script, which can be directly executed with BASH to execute the demo after setting the first requirements.

The following HATCHet's demo represents a guided example of the custom pipeline designed to start from the data produced by the [GATK4 CNV pipeline](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11147). This custom pipeline considers one or more tumor samples from the same patient which have been segmented through the GATK4 CNV pipeline, such that for each sample a **segmented file** is available. The expected format of each segmented file is first described in the following section. Next, the requirements for this demo are described and the guided demo is detailed across the different steps.
The following HATCHet22's demo represents a guided example of the custom pipeline designed to start from the data produced by the [GATK4 CNV pipeline](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11147). This custom pipeline considers one or more tumor samples from the same patient which have been segmented through the GATK4 CNV pipeline, such that for each sample a **segmented file** is available. The expected format of each segmented file is first described in the following section. Next, the requirements for this demo are described and the guided demo is detailed across the different steps.

## Input format

Expand All @@ -27,14 +27,14 @@ Two example segmented files in this format for two tumor samples from the same p

## Requirements and set up

The demo requires that HATCHet has been succesfully compiled and all the dependencies are available and functional. As such, the demo requires the user to properly set up the following paths:
The demo requires that HATCHet22 has been succesfully compiled and all the dependencies are available and functional. As such, the demo requires the user to properly set up the following paths:

```shell
PY="python3" # This id the full path to the version of PYTHON3 which contains the required modules. When this corresponds to the standard version, the user can keep the given value of `python3`
:<<'```shell' # Ignore this line
```
The following paths are consequently obtained to point to the required components of HATCHet
The following paths are consequently obtained to point to the required components of HATCHet22
```shell
CLUSTERBINS="${PY} -m hatchet cluster-bins"
Expand All @@ -55,7 +55,7 @@ PS4='[\t]'
## Generating input BB file
The first step of this custom pipeline aims to generate an input BB file for HATCHet starting from the given segmented files; in this case, we consider the two examples included with this demo `sample1.GATK4.CNV.seg` and `sample2.GATK4.CNV.seg`. The corresponding BB file can be easily obtained by using the custom python script [gatk4cnsToBB.py](gatk4cnsToBB.py) included in the custom pipeline. We apply the script by specifiying the two segmented files in a white-sperated list between apices and specifying the names of the samples in the same order with `--samples`. In addition, we consider the default values of the parameters and we run it as follows:
The first step of this custom pipeline aims to generate an input BB file for HATCHet22 starting from the given segmented files; in this case, we consider the two examples included with this demo `sample1.GATK4.CNV.seg` and `sample2.GATK4.CNV.seg`. The corresponding BB file can be easily obtained by using the custom python script [gatk4cnsToBB.py](gatk4cnsToBB.py) included in the custom pipeline. We apply the script by specifiying the two segmented files in a white-sperated list between apices and specifying the names of the samples in the same order with `--samples`. In addition, we consider the default values of the parameters and we run it as follows:
```shell
${GATK4CNSTOBB} "sample1.GATK4.CNV.seg sample2.GATK4.CNV.seg" --samples "Sample1 Sample2" > samples.GATK4.CNV.bb
Expand All @@ -66,14 +66,14 @@ In addition, one could consider different size of the resulting bins by using th
## Global custering
Having the input BB file, we can continue by executing the standard HATCHet pipeline and skipping the pre-processing steps (`count-reads`, `count-alleles`, and `combine-counts`). As such, the next main step of the demo performs the global clustering of HATCHet where genomic bins which have the same copy-number state in every tumor clone are clustered correspondingly. To do this, we use `cluster-bins`, i.e. the HATCHet's component designed for this purpose. At first, we attempt to run the clustering using the default values of the parameters as follows:
Having the input BB file, we can continue by executing the standard HATCHet22 pipeline and skipping the pre-processing steps (`count-reads`, `count-alleles`, and `combine-counts`). As such, the next main step of the demo performs the global clustering ofHATCHet2t2 where genomic bins which have the same copy-number state in every tumor clone are clustered correspondingly. To do this, we use `cluster-bins`, i.e. thHATCHet2et2's component designed for this purpose. At first, we attempt to run the clustering using the default values of the parameters as follows:
```shell
${CLUSTERBINS} samples.GATK4.CNV.bb -o samples.GATK4.CNV.seg -O samples.GATK4.CNV.bbc -e 12 -tB 0.03 -tR 0.15 -d 0.08
:<<'```shell' # Ignore this line
```
To assess the quality of the clustering we generate the cluster plot using the `CBB` command of `plot-bins`, i.e. the HATCHet's component designed for the analysis of the data. For simplicity, we also use the following option `-tS 0.001` which asks to plot only the clusters which cover at least the `0.1%` of the genome. This is useful to clean the figure and focus on the main components.
To assess the quality of the clustering we generate the cluster plot using the `CBB` command of `plot-bins`, i.e. the HATCHet22's component designed for the analysis of the data. For simplicity, we also use the following option `-tS 0.001` which asks to plot only the clusters which cover at least the `0.1%` of the genome. This is useful to clean the figure and focus on the main components.
```shell
${PLOTBINS} -c CBB samples.GATK4.CNV.bbc -tS 0.001
Expand All @@ -88,8 +88,8 @@ We can easily notice that the clustering is good and not tuning is needed as eve
## hatchet's step
Next we apply `hatchet`, i.e. the component of HATCHet which estimates fractional copy numbers, infers allele-and-clone specific copy numbers, and jointly predicts the number of clones (including the normal clone) and the presence of a WGD.
We apply the last step with default parameters and, for simplicity of this demo, we consider 6 clones, which can be easily considered by HATCHet in this case, and we only consider 100 restarts for the coordinate-descent method; these are the number of attempts to find the best solution. This number is sufficient in this small example but we reccommend to use at least 400 restarts in standard runs.
Next we apply `hatchet`, i.e. the component of HATCHet22 which estimates fractional copy numbers, infers allele-and-clone specific copy numbers, and jointly predicts the number of clones (including the normal clone) and the presence of a WGD.
We apply the last step with default parameters and, for simplicity of this demo, we consider 6 clones, which can be easily considered by HATCHet22 in this case, and we only consider 100 restarts for the coordinate-descent method; these are the number of attempts to find the best solution. This number is sufficient in this small example but we reccommend to use at least 400 restarts in standard runs.
```shell
${INFER} -i samples.GATK4.CNV -n2,6 -p 100 -v 2 -u 0.03 -r 12 -eD 6 -eT 12 -l 0.6 |& tee hatchet.log
Expand Down Expand Up @@ -117,11 +117,11 @@ We obtain the following summary of results:
## The related-tetraploid resulting files are copied to ./chosen.tetraploid.bbc.ucn and ./chosen.tetraploid.seg.ucn
# The chosen solution is diploid with 3 clones and is written in ./best.bbc.ucn and ./best.seg.ucn
HATCHet predicts the presence of 3 clones in the 2 tumor samples and, especially, predicts that a sample contains two distinct tumor clones, according to the true clonal composition, and one of these clones is shared with the other sample.
HATCHet22 predicts the presence of 3 clones in the 2 tumor samples and, especially, predicts that a sample contains two distinct tumor clones, according to the true clonal composition, and one of these clones is shared with the other sample.
## Analyzing inferred results
Finally, we obtain useful plots to summarize and analyze the inferred results by using `plot-cn`, which is the last component of HATCHet. We run `plot-cn` as follows
Finally, we obtain useful plots to summarize and analyze the inferred results by using `plot-cn`, which is the last component of HATCHet22. We run `plot-cn` as follows
```shell
${PLOTCN} best.bbc.ucn
Expand Down
4 changes: 2 additions & 2 deletions custom/GATK4-CNV/gatk4cnsToBB.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@
def parse_args():
description = (
"This method takes in input multiple samples from the same patient, where each sample is a "
"segmented CNV file produced by GATK4 CNV pipeline, and produces a BB input file for HATCHet."
"segmented CNV file produced by GATK4 CNV pipeline, and produces a BB input file for HATCHet2."
)
parser = argparse.ArgumentParser(description=description)
parser.add_argument(
"INPUT",
type=str,
help=(
"A white-space-separated list between apices where each element is a segmented CNV file produced by "
"GATK4 CNV pipeline. The file format is describe in the HATCHet's repository."
"GATK4 CNV pipeline. The file format is describe in the HATCHet2's repository."
),
)
parser.add_argument(
Expand Down
4 changes: 2 additions & 2 deletions docs/buildDocs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ git checkout -b gh-pages

# Add README
cat > README.md <<EOF
# HATCHet
# HATCHet2
HATCHet documentation is available at [http://compbio.cs.brown.edu/hatchet/](http://compbio.cs.brown.edu/hatchet/)
HATCHet2 documentation is available at [http://compbio.cs.brown.edu/hatchet/](http://compbio.cs.brown.edu/hatchet/)
EOF

Expand Down
Loading

0 comments on commit 9c50da6

Please sign in to comment.