Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about readLength parameter during reference creation and rMATS analysis #445

Open
Motoufiq opened this issue Oct 27, 2024 · 2 comments

Comments

@Motoufiq
Copy link

Motoufiq commented Oct 27, 2024

Hi,

I’m currently working on the RNA-Seq splicing analysis and have a question about parameter used to build reference STAR (https://github.com/alexdobin/STAR/tree/master) using GENCODE genome (https://www.gencodegenes.org/human/) which then would be related to input parameter for rMATS analysis. My paired end fastq files are 151 bp read length. As this is short read RNA-Seq splicing analysis with 151 bp, I was wondering if I should input --sjdboverhang 150 parameter was to build this STAR reference. NOTE: --sjdboverhang in the STAR specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database (in my case the PE read length is 151 bp), then my --sjdboverhang 150 (ReadLength-1: 151-1 = 150)

--runMode genomeGenerate
--genomeDir /path/to/genomeDir
--genomeFastaFiles /path/to/genome/fasta
--sjdbGTFfile /path/to/annotations.gtf
--sjdbOverhang ReadLength-1

I want to set the below paramater in rMATS:

--readLength READLENGTH
The length of each read. Required parameter, with the value set according to the RNA-seq read length

It seems It’s quite common for Illumina sequencers to produce paired-end reads with a read length of 151 bp instead of the expected 150 bp. Is it OK to build STAR index --sjdboverhang 150, and does rMATS take input of read length of 151 base pair? or to build STAR index --sjdboverhang 149 instead and rMATS take input of read length of 150 bp ?

Additionally, if I don't input --sjdboverhang paramter, while building STAR INDEX, can I run RNA-Seq splicing analysis, and splicing analysis makes sense?

Thank you,
Toufiq

@EricKutschera
Copy link
Contributor

If you have 151bp reads then ideally you would create the STAR reference with --sjdbOverhang 150 and then run rMATS with --readLength 151. From the STAR manual, the exact value of --sjdbOverhang doesn't seem to matter too much: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value

For rMATS, --readLength needs to be the actual read length or the reads will get filtered out. However if you use --variable-read-length then the exact value of --readLength doesn't matter too much. This post has details about --readLength in rMATS: #83 (comment)

@Motoufiq
Copy link
Author

@EricKutschera thank you for the response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants