Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameters for featureCounts #35

Open
Dries-B opened this issue Jan 11, 2024 · 4 comments
Open

Parameters for featureCounts #35

Dries-B opened this issue Jan 11, 2024 · 4 comments

Comments

@Dries-B
Copy link

Dries-B commented Jan 11, 2024

Currently the rule featurecount looks as follows:

    threads: 4
    params:
        setting=lambda wildcards: "-Q 10 -B -p" if wildcards.seq_type == "pe" else ""
    shell:
        """
        mkdir -p {params.tmpdir}
        featureCounts -a {input.gff} -o {output[0]} -t CDS -g gene_id -M  \
            {params.setting} -T {threads} --tmpDir {params.tmpdir} {input.bam} > {log} 2>&1
        """

I have some questions about this:

  • -M counts multi-mapping reads, but wouldn't it make more sense to use --primary instead, which only counts primary alignments?
  • -B only counts read pairs that have both ends aligned, but won't this risk losing the reads at the ends of contigs?
  • Wouldn't it make more sense to allow fragments to overlap multiple features by setting -O?


  • (I assume using -f for using feature-level instead of meta-feature level does not make a difference when analyzing Prodigal output.)
  • (I agree with the use of -p for counting fragments instead of reads, because otherwise single-read fragments would count less than paired-read fragments.)
  • (Less importantly: perhaps threads: 1 could be used, because this rule took very little time for me when running.)
@johnne
Copy link
Collaborator

johnne commented Jan 11, 2024

Hi Dries,
I haven't looked at this rule specifically in a long time. I think that when I first wrote it I had been working with metagenomic samples where a lot of reads mapped to more than one location and I didn't want to disregard these reads as it was likely that they were 'real' due to highly similar genomes in the samples.

You're right about the -B flag, it may have been added to increase the stringency somewhat.

Regarding -O: yes but also I think for a metagenomic setting we may in fact want to count reads mapping to entire contigs instead of the features defined in the prodigal GFF file. Then give features on contigs the abundances of reads mapped to entire contigs. This should be handled differently for metatranscriptomics though.

I agree that the settings for this rule could be made more configurable by the user. Easiest may be to simply add a featureCounts: section to the config file where all flags (except perhaps -p can be configured):

featureCounts:
  settings: "--primary -O"
rule featurecount:
    input:
        gff=results+"/annotation/{assembly}/final_contigs.features.gff",
        bam=results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}"+POSTPROCESS+".bam"
    output:
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.tsv",
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.tsv.summary"
    log:
        results+"/assembly/{assembly}/mapping/{sample}_{unit}_{seq_type}.fc.log"
    threads: 4
    params:
        tmpdir=config["paths"]["temp"],
        p=lambda wildcards: "-p" if wildcards.seq_type == "pe" else "",
        setting=config["featureCounts"]["settings"],
    resources:
        runtime=lambda wildcards, attempt: attempt**2*30
    conda:
        "../envs/quantify.yml"
    shell:
        """
        mkdir -p {params.tmpdir}
        featureCounts -a {input.gff} -o {output[0]} -t CDS -g gene_id {params.p} \
            {params.setting} -T {threads} --tmpDir {params.tmpdir} {input.bam} > {log} 2>&1
        """

@Dries-B
Copy link
Author

Dries-B commented Jan 15, 2024

Hi John,

Thanks for your reply! I should add that my questions are not necessarily proposals for improvements but rather 'assessments' of your opinion regarding changes in my repository, as well as sharing my understanding of the featureCounts documentation.

[Regarding -M] I think that when I first wrote it I had been working with metagenomic samples where a lot of reads mapped to more than one location and I didn't want to disregard these reads as it was likely that they were 'real' due to highly similar genomes in the samples.

Okay, I understand your reasoning there, although I think -M could risk duplicating counts compared to --primary.

I agree that the settings for this rule could be made more configurable by the user. Easiest may be to simply add a featureCounts: section to the config file where all flags (except perhaps -p can be configured)

Indeed this might improve the easy of use. This is not really necessary for my own use however, as I can modify the source code while looking at it.

@Dries-B
Copy link
Author

Dries-B commented Jan 15, 2024

Regarding -O: yes but also I think for a metagenomic setting we may in fact want to count reads mapping to entire contigs instead of the features defined in the prodigal GFF file. Then give features on contigs the abundances of reads mapped to entire contigs. This should be handled differently for metatranscriptomics though.

Okay, but for such a contig analyses, don't other simpler tools exist? (I myself am specifically interested in the Prodigal output.)

@johnne
Copy link
Collaborator

johnne commented Jan 15, 2024

Okay, but for such a contig analyses, don't other simpler tools exist? (I myself am specifically interested in the Prodigal output.)

Sure, something like pileup.sh for instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants