update references to HATCHet to HATCHet2

raphael-group · Nov 25, 2024 · 9c50da6 · 9c50da6
1 parent 7115c38
commit 9c50da6
Show file tree

Hide file tree

Showing 48 changed files with 370 additions and 205 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -50,7 +50,7 @@ jobs:
           echo "GRB_LICENSE_FILE=${GUROBI_HOME}/gurobi.lic" >> $GITHUB_ENV
         continue-on-error: true
 
-      - name: Install HATCHet with dev dependencies
+      - name: Install HATCHet2 with dev dependencies
         run: |
           python -m pip install .[dev]
         env:
@@ -136,7 +136,7 @@ jobs:
           tar zxvf 1000GP_Phase3.tgz --wildcards *chr22* *sample
           echo "HATCHET_DOWNLOAD_PANEL_REFPANELDIR=$(pwd)" >> $GITHUB_ENV
 
-      - name: HATCHet Check
+      - name: HATCHet2 Check
         run: |
           hatchet check
 

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,6 +1,6 @@
 cmake_minimum_required( VERSION 2.8 )
 
-project( HATCHet )
+project( HATCHet2 )
 
 set( CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR} ${CMAKE_MODULE_PATH} )
 

diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 ![CI](https://github.com/raphael-group/hatchet/workflows/CI/badge.svg)
 [![codecov](https://codecov.io/gh/raphael-group/hatchet/branch/master/graph/badge.svg)](https://codecov.io/gh/raphael-group/hatchet)
 
-# HATCHet
+# HATCHet2
 
-HATCHet is an algorithm to infer allele and clone-specific copy-number aberrations (CNAs), clone proportions, and whole-genome duplications (WGD) for several tumor clones jointly from multiple bulk-tumor samples of the same patient or from a single bulk-tumor sample.
+HATCHet2 is an algorithm to infer allele and clone-specific copy-number aberrations (CNAs), clone proportions, and whole-genome duplications (WGD) for several tumor clones jointly from multiple bulk-tumor samples of the same patient or from a single bulk-tumor sample.
 
-Complete documentation for HATCHet is available at [https://raphael-group.github.io/hatchet/](https://raphael-group.github.io/hatchet/)
+Complete documentation for HATCHet2 is available at [https://raphael-group.github.io/hatchet/](https://raphael-group.github.io/hatchet/)
diff --git a/cloud/README.md b/cloud/README.md
@@ -1,14 +1,14 @@
-# Running HATCHet in the cloud
+# Running HATCHet2 in the cloud
 
-HATCHet is a Docerizable application and comes with a Dockerfile for easy deployment. We have also made HATCHet
+HATCHet2 is a Docerizable application and comes with a Dockerfile for easy deployment. We have also made HATCHet2
 available as a publicly accessible Docker image at the [Google Cloud Container Registry](https://cloud.google.com/container-registry).
-This facilitates running HATCHet in the cloud without worrying about downloading large BAM files, and without having to
-build and install HATCHet locally.
+This facilitates running HATCHet2 in the cloud without worrying about downloading large BAM files, and without having to
+build and install HATCHet2 locally.
 
-This README provides details on how to run HATCHet entirely on the [Google Cloud Platform](https://cloud.google.com) (GCP)
+This README provides details on how to run HATCHet2 entirely on the [Google Cloud Platform](https://cloud.google.com) (GCP)
 on large datasets made available at [ISB-CGC](https://isb-cgc.appspot.com/).
 
-## Running HATCHet on ISB-CGC Datasets
+## Running HATCHet2 on ISB-CGC Datasets
 
 ### Setting up access at ISB-CGC
 
@@ -20,14 +20,14 @@ section and follow the steps to register your Google project with ISB-CGC.
 
 Note that your PI will most likely have to grant you access to one or more of these controlled datasets using
 [dbGap](https://dbgap.ncbi.nlm.nih.gov/). The steps in the walk-throughs and tutorials on the ISB-CGC website will
-verify that you do have the appropriate access you will need to programmatically read these datasets in HATCHet.
+verify that you do have the appropriate access you will need to programmatically read these datasets in HATCHet2.
 
 Also note that access to controlled datasets is typically granted only for 24 hours, so you will have to extend your
 access period on the ISB-CGC website if it has expired.
 
-### Setting up your environment to run HATCHet on GCP
+### Setting up your environment to run HATCHet2 on GCP
 
-You do not need to build or install HATCHet locally, either as a python package or a Docker image. The only pre-requisite
+You do not need to build or install HATCHet2 locally, either as a python package or a Docker image. The only pre-requisite
 is that you have installed the [Google Cloud SDK](https://cloud.google.com/sdk/docs/quickstart).
 
 This is most cleanly done by installing all required dependencies inside a new Python 3 Conda environment.
@@ -43,7 +43,7 @@ pip install oauth2client dsub
 ### Logging in to your GCP Account
 
 After installing the required dependencies, make sure that you login to your Google account and set up your default
-project. These are **one time steps** to make sure that HATCHet is able to correctly talk to your project.
+project. These are **one time steps** to make sure that HATCHet2 is able to correctly talk to your project.
 
 ```
 gcloud auth application-default login
@@ -61,7 +61,7 @@ that you linked with your ISB-CGC account.
 ### Preparing a bucket for output files
 
 In the Google project that you used in the steps above, use the following command to create a new bucket where the results
-of your HATCHet analysis will be saved:
+of your HATCHet2 analysis will be saved:
 
 ```
 gsutil mb gs://BUCKET_NAME
@@ -70,10 +70,10 @@ gsutil mb gs://BUCKET_NAME
 Replace `BUCKET_NAME` with a globally-unique bucket name. This step can also be performed by logging in to the
 [Google Cloud Console](https://console.cloud.google.com) and navigating to Home -> Storage -> Browser -> Create Bucket.
 
-### Fine-tuning the HATCHet script
+### Fine-tuning the HATCHet2 script
 
-The `_run.sh` script provided with HATCHet is an end-end worflow of HATCHet. This will be familiar to you if you have
-run HATCHet locally. You can comment out sections of this script to only run certain parts of HATCHet depending on your
+The `_run.sh` script provided with HATCHet2 is an end-end worflow of HATCHet2. This will be familiar to you if you have
+run HATCHet2 locally. You can comment out sections of this script to only run certain parts of HATCHet2 depending on your
 needs, and specify the values of certain flags of the pipeline.
 
 The part of the script that you will want to pay attention to is the `Reference Genome` section. Depending on the
@@ -83,8 +83,8 @@ or `.fa` file available through `wget`.
 
 ### Running the scripts
 
-The `cloud_run.sh` script provided with HATCHet is a single [dsub](https://github.com/DataBiosphere/dsub) command that
-will run HATCHet in the cloud. This command leverages the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest)
+The `cloud_run.sh` script provided with HATCHet2 is a single [dsub](https://github.com/DataBiosphere/dsub) command that
+will run HATCHet2 in the cloud. This command leverages the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest)
 and internally performs the following series of steps:
 
 <a name="cloud_steps"></a>
@@ -116,7 +116,7 @@ dsub \
 ```
 
 In the above command, you will want to replace `PROJECT_ID` with your project id, `BUCKET_NAME` with the bucket name that
-you created above, `RUN_NAME` with any unique name (no spaces!) that identifies your HATCHet run. In addition:
+you created above, `RUN_NAME` with any unique name (no spaces!) that identifies your HATCHet2 run. In addition:
 
 - The `NORMALBAM` parameter should be replaced with the `gs://..` path to the matched-normal sample of the patient.
 

diff --git a/custom/GATK4-CNV/custom-gatk4-cnv.sh b/custom/GATK4-CNV/custom-gatk4-cnv.sh
@@ -1,8 +1,8 @@
 #!/usr/bin/bash
 
-# This is a custom complete pipeline of HATCHet which considers in input segmented files for one or more samples from the same patient, produced by the GATK4 CNV pipeline.
+# This is a custom complete pipeline of HATCHet2 which considers in input segmented files for one or more samples from the same patient, produced by the GATK4 CNV pipeline.
 
-HATCHET_HOME="/path/to/hatchet_home" # Provide the full path to HATCHet's repository
+HATCHET_HOME="/path/to/hatchet_home" # Provide the full path to HATCHet2's repository
 
 CNVTOBB="${HATCHET_HOME}/custom/GATK4-CNV/gatk4cnsToBB.py"
 

diff --git a/custom/GATK4-CNV/demo-gatk4-cnv.sh b/custom/GATK4-CNV/demo-gatk4-cnv.sh
@@ -1,7 +1,7 @@
 # Demo of the custom pipeline for GATK4 CNV data
 : ex: set ft=markdown ;:<<'```shell' # This line makes this file both a guieded and executable DEMO. The file can be both displayed as a Markdown file, where to read the instructions and descriptions of the demo and results, and a BASH script, which can be directly executed with BASH to execute the demo after setting the first requirements.
 
-The following HATCHet's demo represents a guided example of the custom pipeline designed to start from the data produced by the [GATK4 CNV pipeline](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11147). This custom pipeline considers one or more tumor samples from the same patient which have been segmented through the GATK4 CNV pipeline, such that for each sample a **segmented file** is available. The expected format of each segmented file is first described in the following section. Next, the requirements for this demo are described and the guided demo is detailed across the different steps.
+The following HATCHet22's demo represents a guided example of the custom pipeline designed to start from the data produced by the [GATK4 CNV pipeline](https://software.broadinstitute.org/gatk/best-practices/workflow?id=11147). This custom pipeline considers one or more tumor samples from the same patient which have been segmented through the GATK4 CNV pipeline, such that for each sample a **segmented file** is available. The expected format of each segmented file is first described in the following section. Next, the requirements for this demo are described and the guided demo is detailed across the different steps.
 
 ## Input format
 
@@ -27,14 +27,14 @@ Two example segmented files in this format for two tumor samples from the same p
 
 ## Requirements and set up
 
-The demo requires that HATCHet has been succesfully compiled and all the dependencies are available and functional. As such, the demo requires the user to properly set up the following paths:
+The demo requires that HATCHet22 has been succesfully compiled and all the dependencies are available and functional. As such, the demo requires the user to properly set up the following paths:
 
 ```shell
 PY="python3" # This id the full path to the version of PYTHON3 which contains the required modules. When this corresponds to the standard version, the user can keep the given value of `python3`
 :<<'```shell' # Ignore this line
 ```
 
-The following paths are consequently obtained to point to the required components of HATCHet
+The following paths are consequently obtained to point to the required components of HATCHet22
 
 ```shell
 CLUSTERBINS="${PY} -m hatchet cluster-bins"
@@ -55,7 +55,7 @@ PS4='[\t]'
 
 ## Generating input BB file
 
-The first step of this custom pipeline aims to generate an input BB file for HATCHet starting from the given segmented files; in this case, we consider the two examples included with this demo `sample1.GATK4.CNV.seg` and `sample2.GATK4.CNV.seg`. The corresponding BB file can be easily obtained by using the custom python script [gatk4cnsToBB.py](gatk4cnsToBB.py) included in the custom pipeline. We apply the script by specifiying the two segmented files in a white-sperated list between apices and specifying the names of the samples in the same order with `--samples`. In addition, we consider the default values of the parameters and we run it as follows:
+The first step of this custom pipeline aims to generate an input BB file for HATCHet22 starting from the given segmented files; in this case, we consider the two examples included with this demo `sample1.GATK4.CNV.seg` and `sample2.GATK4.CNV.seg`. The corresponding BB file can be easily obtained by using the custom python script [gatk4cnsToBB.py](gatk4cnsToBB.py) included in the custom pipeline. We apply the script by specifiying the two segmented files in a white-sperated list between apices and specifying the names of the samples in the same order with `--samples`. In addition, we consider the default values of the parameters and we run it as follows:
 
 ```shell
 ${GATK4CNSTOBB} "sample1.GATK4.CNV.seg sample2.GATK4.CNV.seg" --samples "Sample1 Sample2" > samples.GATK4.CNV.bb
@@ -66,14 +66,14 @@ In addition, one could consider different size of the resulting bins by using th
 
 ## Global custering
 
-Having the input BB file, we can continue by executing the standard HATCHet pipeline and skipping the pre-processing steps (`count-reads`, `count-alleles`, and `combine-counts`). As such, the next main step of the demo performs the global clustering of HATCHet where genomic bins which have the same copy-number state in every tumor clone are clustered correspondingly. To do this, we use `cluster-bins`, i.e. the HATCHet's component designed for this purpose. At first, we attempt to run the clustering using the default values of the parameters as follows:
+Having the input BB file, we can continue by executing the standard HATCHet22 pipeline and skipping the pre-processing steps (`count-reads`, `count-alleles`, and `combine-counts`). As such, the next main step of the demo performs the global clustering ofHATCHet2t2 where genomic bins which have the same copy-number state in every tumor clone are clustered correspondingly. To do this, we use `cluster-bins`, i.e. thHATCHet2et2's component designed for this purpose. At first, we attempt to run the clustering using the default values of the parameters as follows:
 
 ```shell
 ${CLUSTERBINS} samples.GATK4.CNV.bb -o samples.GATK4.CNV.seg -O samples.GATK4.CNV.bbc -e 12 -tB 0.03 -tR 0.15 -d 0.08
 :<<'```shell' # Ignore this line
 ```
 
-To assess the quality of the clustering we generate the cluster plot using the `CBB` command of `plot-bins`, i.e. the HATCHet's component designed for the analysis of the data. For simplicity, we also use the following option `-tS 0.001` which asks to plot only the clusters which cover at least the `0.1%` of the genome. This is useful to clean the figure and focus on the main components.
+To assess the quality of the clustering we generate the cluster plot using the `CBB` command of `plot-bins`, i.e. the HATCHet22's component designed for the analysis of the data. For simplicity, we also use the following option `-tS 0.001` which asks to plot only the clusters which cover at least the `0.1%` of the genome. This is useful to clean the figure and focus on the main components.
 
 ```shell
 ${PLOTBINS} -c CBB samples.GATK4.CNV.bbc -tS 0.001
@@ -88,8 +88,8 @@ We can easily notice that the clustering is good and not tuning is needed as eve
 
 ## hatchet's step
 
-Next we apply `hatchet`, i.e. the component of HATCHet which estimates fractional copy numbers, infers allele-and-clone specific copy numbers, and jointly predicts the number of clones (including the normal clone) and the presence of a WGD.
-We apply the last step with default parameters and, for simplicity of this demo, we consider 6 clones, which can be easily considered by HATCHet in this case, and we only consider 100 restarts for the coordinate-descent method; these are the number of attempts to find the best solution. This number is sufficient in this small example but we reccommend to use at least 400 restarts in standard runs.
+Next we apply `hatchet`, i.e. the component of HATCHet22 which estimates fractional copy numbers, infers allele-and-clone specific copy numbers, and jointly predicts the number of clones (including the normal clone) and the presence of a WGD.
+We apply the last step with default parameters and, for simplicity of this demo, we consider 6 clones, which can be easily considered by HATCHet22 in this case, and we only consider 100 restarts for the coordinate-descent method; these are the number of attempts to find the best solution. This number is sufficient in this small example but we reccommend to use at least 400 restarts in standard runs.
 
 ```shell
 ${INFER} -i samples.GATK4.CNV -n2,6 -p 100 -v 2 -u 0.03 -r 12 -eD 6 -eT 12 -l 0.6 |& tee hatchet.log
@@ -117,11 +117,11 @@ We obtain the following summary of results:
 	## The related-tetraploid resulting files are copied to ./chosen.tetraploid.bbc.ucn and ./chosen.tetraploid.seg.ucn
 	# The chosen solution is diploid with 3 clones and is written in ./best.bbc.ucn and ./best.seg.ucn
 
-HATCHet predicts the presence of 3 clones in the 2 tumor samples and, especially, predicts that a sample contains two distinct tumor clones, according to the true clonal composition, and one of these clones is shared with the other sample.
+HATCHet22 predicts the presence of 3 clones in the 2 tumor samples and, especially, predicts that a sample contains two distinct tumor clones, according to the true clonal composition, and one of these clones is shared with the other sample.
 
 ## Analyzing inferred results
 
-Finally, we obtain useful plots to summarize and analyze the inferred results by using `plot-cn`, which is the last component of HATCHet. We run `plot-cn` as follows
+Finally, we obtain useful plots to summarize and analyze the inferred results by using `plot-cn`, which is the last component of HATCHet22. We run `plot-cn` as follows
 
 ```shell
 ${PLOTCN} best.bbc.ucn

diff --git a/custom/GATK4-CNV/gatk4cnsToBB.py b/custom/GATK4-CNV/gatk4cnsToBB.py
@@ -11,15 +11,15 @@
 def parse_args():
     description = (
         "This method takes in input multiple samples from the same patient, where each sample is a "
-        "segmented CNV file produced by GATK4 CNV pipeline, and produces a BB input file for HATCHet."
+        "segmented CNV file produced by GATK4 CNV pipeline, and produces a BB input file for HATCHet2."
     )
     parser = argparse.ArgumentParser(description=description)
     parser.add_argument(
         "INPUT",
         type=str,
         help=(
             "A white-space-separated list between apices where each element is a segmented CNV file produced by "
-            "GATK4 CNV pipeline. The file format is describe in the HATCHet's repository."
+            "GATK4 CNV pipeline. The file format is describe in the HATCHet2's repository."
         ),
     )
     parser.add_argument(

diff --git a/docs/buildDocs.sh b/docs/buildDocs.sh
@@ -27,9 +27,9 @@ git checkout -b gh-pages
 
 # Add README
 cat > README.md <<EOF
-# HATCHet
+# HATCHet2
 
-HATCHet documentation is available at [http://compbio.cs.brown.edu/hatchet/](http://compbio.cs.brown.edu/hatchet/)
+HATCHet2 documentation is available at [http://compbio.cs.brown.edu/hatchet/](http://compbio.cs.brown.edu/hatchet/)
 
 EOF