Building Custom Genome Instructions #449

imk1 · 2024-09-16T20:05:27Z

I have been using the pipeline with singularity for data coming from human and mouse, and I would now like to use it for data coming from a custom species. How do I build a custom genome with the latest version of the pipeline? Thanks so much!

JWJ13164328557 · 2025-02-18T00:49:24Z

How to build genome database
Install Conda.

Install pipeline's Conda environment.

$ bash scripts/uninstall_conda_env.sh # to remove any existing pipeline env
$ bash scripts/install_conda_env.sh
Choose from , , and and specify a destination directory. This will take several hours. We recommend not to run this installer on a login node of your cluster. It will take >8GB memory and >2h time.GENOMEhg19hg38mm9mm10

$ conda activate encd-atac
$ bash scripts/build_genome_data.sh [GENOME] [DESTINATION_DIR]
Find a TSV file on the destination directory and use it for in your input JSON."atac.genome_tsv"

How to build genome database for your own genome
You can build your own genome database if your reference genome has one of the following file types.

.fasta.gz
.fa.gz
.fasta.bz2
.fa.gz2
.2bit
Get a URL for your reference genome. You may need to upload it to somewhere on the internet.

Get a URL for a gzipped blacklist BED file for your genome. If you don't have one then skip this step. An example blacklist for hg38 is here.

Find the following lines in and modify them as follows. Give a good name for your genome. For use a correct mitochondrial chromosome name of your genome (e.g. or ). For Perl style regular expression must be used to keep regular chromosome names only in a blacklist filtered () peaks files. This peak files are considered final peaks output of the pipeline and peaks BED files for genome browser tracks ( and ) are converted from these peaks files. Chromosome name filtering with will be done even without the blacklist itself.scripts/build_genome_data.sh[YOUR_OWN_GENOME]MITO_CHR_NAMEchrMMTREGEX_BFILT_PEAK_CHR_NAME.bfilt..bfilt..bigBed.hammock.gz.bfilt.REGEX_BFILT_PEAK_CHR_NAME

...

elif [[ $GENOME == "YOUR_OWN_GENOME" ]]; then

Perl style regular expression to keep regular chromosomes only.

this reg-ex will be applied to peaks after blacklist filtering (b-filt) with "grep -P".

so that b-filt peak file (.bfilt.*Peak.gz) will only have chromosomes matching with this pattern

this reg-ex will work even without a blacklist.

you will still be able to find a .bfilt. peak file

REGEX_BFILT_PEAK_CHR_NAME="chr[\dXY]+"

mitochondrial chromosome name (e.g. chrM, MT)

MITO_CHR_NAME="chrM"

URL for your reference FASTA (fasta, fasta.gz, fa, fa.gz, 2bit)

REF_FA="https://some.where.com/your.genome.fa.gz"

3-col blacklist BED file to filter out overlapping peaks from b-filt peak file (.bfilt.*Peak.gz file).

leave it empty if you don't have one

BLACKLIST=
...
Specify a destination directory for your genome database and run the installer. This will take several hours.

$ bash scripts/build_genome_data.sh [YOUR_OWN_GENOME] [DESTINATION_DIR]
Find a TSV file in the destination directory and use it for in your input JSON."atac.genome_tsv"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building Custom Genome Instructions #449

Building Custom Genome Instructions #449

imk1 commented Sep 16, 2024

JWJ13164328557 commented Feb 18, 2025

Building Custom Genome Instructions #449

Building Custom Genome Instructions #449

Comments

imk1 commented Sep 16, 2024

JWJ13164328557 commented Feb 18, 2025

Perl style regular expression to keep regular chromosomes only.

this reg-ex will be applied to peaks after blacklist filtering (b-filt) with "grep -P".

so that b-filt peak file (.bfilt.*Peak.gz) will only have chromosomes matching with this pattern

this reg-ex will work even without a blacklist.

you will still be able to find a .bfilt. peak file

mitochondrial chromosome name (e.g. chrM, MT)

URL for your reference FASTA (fasta, fasta.gz, fa, fa.gz, 2bit)

3-col blacklist BED file to filter out overlapping peaks from b-filt peak file (.bfilt.*Peak.gz file).

leave it empty if you don't have one