Skip to content

take transcript GTF as input and return transcripts fasta sequence extracted from MAF file

Notifications You must be signed in to change notification settings

bioinfocao/pymaf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Take known or novel transcripts GTF as input,
Return transcripts' fasta sequence extracted from multiple species whole genome alignments MAF file (such as UCSC multiz100way)

Requirements: ucsc-mafsinregion. Download and install as:
http://hgdownload.cse.ucsc.edu/admin/exe/
https://anaconda.org/bioconda/ucsc-mafsinregion

    Usage: python GTF2Fasta.py
    -i inout known/novel GTF file
    -m maf files directory # multiz MAF files dir (chr1.maf.gz ... chrY.maf.gz)
    -n yes/no filterChrNNN # limit to chrN,chrNN,chrNNN. Any chromosome name longer than 6 letters are skipped. default is yes.
    -z yes/no # the MAF file in zipped format (as *.maf.gz this is default) file or in unzipped foramt .maf (no)
    -o output_fasta_dir # the dir for transcripts fasta output, if omitted create extracted_fasta_files folder under input GTF file.
    -t transcript_catMAF_dir # if not specified will put in output_fasta_dir folder
    -f yes/no # make fresh transcript_catMAF file? if no, will skip maf file that already exist (size>1). default is no.
    -s mm10,.../all/multiz60way29mammal/multiz100way58mammal # selected species names, default is keep all species.
       multiz60way29mammal means 29 mammals, multiz100way58mammal means 58 mammals(see below), all means keep all species, which is default.
       Please note that not all species' data will exist for every maf file
    -c transcripts_gtf_to_fasta_cmds.sh # output .sh cmd file, without this argument cmds will be excuate automatically.
    -g yes/no # only keep MAF files that is at least have another two species' fasta sequence longer than 1/10 of the frist(reference) species, default is yes.
    -d yes/no # delete temporary nonoverlapExons_bedFiles_ files, default is yes
    -v 0/1/2 # verbose level default is 1, 0 means nothing, 2 is detail.
    Example:
    python GTF2Fasta.py -i hg38_RefSeq.gtf -m ~/data/refgenomes/multiz100way_hg38 -o ~/data/hg38_transcript_catMAF_fasta/ -c hg38_RefSeq_gtf_to_fasta_cmds.sh

About

take transcript GTF as input and return transcripts fasta sequence extracted from MAF file

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages