Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set bwa-mem2 batch size for reproducibility #21

Merged
merged 1 commit into from
May 31, 2024
Merged

Conversation

scwatts
Copy link
Collaborator

@scwatts scwatts commented May 3, 2024

  • improve reproducibility for bwa-mem2 alignments by setting -K
  • this ensures that the number of input bases in each batch is consistent
  • the parameter value was selected to match configuration used in Sarek

Parameter value set to match Sarek default configuration.
@scwatts scwatts requested a review from charlesshale May 3, 2024 06:28
Copy link

github-actions bot commented May 3, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 010c5cd

+| ✅ 155 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗  57 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • nextflow_config - Config manifest.version should end in dev: 0.3.1
  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in output.md: Write this documentation describing your workflow's output
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • schema_params - Schema param panel not found from nextflow config
  • schema_params - Schema param genome_version not found from nextflow config
  • schema_params - Schema param genome_type not found from nextflow config
  • schema_params - Schema param ref_data_hmf_data_path not found from nextflow config
  • schema_params - Schema param ref_data_panel_data_path not found from nextflow config
  • schema_params - Schema param ref_data_virusbreakenddb_path not found from nextflow config
  • schema_params - Schema param ref_data_hla_slice_bed not found from nextflow config
  • system_exit - System.exit in main.nf: System.exit(1) [line 44]
  • system_exit - System.exit in main.nf: System.exit(1) [line 45]
  • system_exit - System.exit in Processes.groovy: System.exit(1) [line 33]
  • system_exit - System.exit in Processes.groovy: System.exit(1) [line 49]
  • system_exit - System.exit in WorkflowOncoanalyser.groovy: System.exit(1) [line 62]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 29]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 39]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 47]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 55]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 63]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 68]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 84]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 89]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 108]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 113]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 121]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 182]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 264]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 276]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 284]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 291]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 299]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 314]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 335]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 346]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 355]
  • system_exit - System.exit in Utils.groovy: System.exit(1) [line 385]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 118]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 125]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 132]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 146]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 156]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 161]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 174]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 190]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 199]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 210]
  • system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 220]

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.13.1
  • Run at 2024-05-03 06:29:58

@scwatts
Copy link
Collaborator Author

scwatts commented May 7, 2024

I've undertaken a brief investigation of the impact that batch size has on output BAMs to better understand the need to set a consistent batch size (i.e. applying the -K option).

For further background, by default bwa-mem2 loads some number of reads into memory (known as a batch) so that the input data can be processed in manageable chunks. The number of reads loaded into memory is determined by a batch size parameter, which by default scales with the number of threads used. It has been reported that running bwa-mem2 on the same input data with different thread counts (and hence batch size) can introduce discrepancies in the output BAMs. However, using the -K option to set a constant batch size is said to ameliorate these discrepancies when re-running bwa-mem2 on the same data even when setting different thread counts.

To investigate and explore this at a high-level I have done the following:

  • simulated 500,000 reads from the GRCh38_hmf genome
  • aligned with bwa-mem2 as we do in oncoanalyser but varying thread count and running with or without -K
  • compared at a high-level with checksums of resulting BAMs

I found BAMs with different checksums were generated when varying thread count without setting a constant batch size (Table 1). When applying a constant batch size, checksums of all BAMs were identical regardless of thread count. The BAMs did not differ within replicates, ruling out any other confounding effect.

Table 1. MD5 checksums of BAMs generated from the same input data with varying thread counts and -K

batch_size threads iteration md5_chksm
default 4 1 b93eaa49ade96ef6f04daace9d383f96
default 4 2 b93eaa49ade96ef6f04daace9d383f96
default 8 1 e103d1007a18af7deebf7d50effacafe
default 8 2 e103d1007a18af7deebf7d50effacafe
100000000 4 1 29f237e841d4adb51d42bce163941acf
100000000 4 2 29f237e841d4adb51d42bce163941acf
100000000 8 1 29f237e841d4adb51d42bce163941acf
100000000 8 2 29f237e841d4adb51d42bce163941acf

Given that -K reduces variability in output BAMs and improves reproducibility, it is clear that this option is beneficial in that regard.

One important consideration is that if a batch size is set, it must be reasonably done as not to be so small that inefficiencies are introduced nor too great that relatively large amounts of memory are used for low thread counts. From observations made during the above investigation, a batch size of 100,000,000 appears to fit well for our use case.

Commands used to generate data (click to expand)

Download required data

mkdir -p reference_data/

curl https://pub-cf6ba01919994c3cbd354659947f74d8.r2.dev/genomes/GRCh38_hmf/24.1/bwa_index/2.2.1.tar.gz | tar --strip-components 1 -xzvf - -C reference_data/
wget -P reference_data/ https://pub-cf6ba01919994c3cbd354659947f74d8.r2.dev/genomes/GRCh38_hmf/24.0/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

Init work directory

mkdir -p 1_reads/ 2_alignments/

Simulate reads

wgsim -S 0 -1 151 -2 151 -N 500000 reference_data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna 1_reads/sim.500_000.R1.fastq 1_reads/sim.500_000.R2.fastq 1>/dev/null

Generate read alignments

align_reads() {
  reads_fwd=${1};
  reads_rev=${2};
  threads=${3};
  batch_size=${4};
  replicate=${5};

  reads_fwd_fn=${reads_fwd##*/};
  name=${reads_fwd_fn/.R1.fastq/};

  k_arg='';
  if [[ ${batch_size} != none ]]; then k_arg="-K ${batch_size}"; fi;

  output_fn=${name}.${threads}.${batch_size}.${replicate}.bam

  bwa-mem2 mem 2>2_alignments/${output_fn}.log \
    -Y \
    -t ${threads} \
    ${k_arg} \
    reference_data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    ${reads_fwd} \
    ${reads_rev} | \
    \
    sambamba view \
      --sam-input \
      --format bam \
      --compression-level 0 \
      --nthreads 1 \
      /dev/stdin | \
    \
    sambamba sort \
      --nthreads 1 \
      --out 2_alignments/${output_fn} \
      /dev/stdin
}

align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 none 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 none 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 none 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 none 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 100000000 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 100000000 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 100000000 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 100000000 2

Compare output BAMs

parallel 'echo {} $(samtools view {} | md5sum)' ::: *500_000*bam | sort -nk2,2 | column -t

# sim.500_000.4.none.1.bam       b93eaa49ade96ef6f04daace9d383f96  -
# sim.500_000.4.none.2.bam       b93eaa49ade96ef6f04daace9d383f96  -
# sim.500_000.8.none.1.bam       e103d1007a18af7deebf7d50effacafe  -
# sim.500_000.8.none.2.bam       e103d1007a18af7deebf7d50effacafe  -
# sim.500_000.4.100000000.1.bam  29f237e841d4adb51d42bce163941acf  -
# sim.500_000.4.100000000.2.bam  29f237e841d4adb51d42bce163941acf  -
# sim.500_000.8.100000000.1.bam  29f237e841d4adb51d42bce163941acf  -
# sim.500_000.8.100000000.2.bam  29f237e841d4adb51d42bce163941acf  -

/cc @charlesshale

@scwatts scwatts merged commit ce2bf00 into dev May 31, 2024
4 checks passed
@maxulysse
Copy link
Member

@scwatts I would recommend using the nf-core modules for it and setting up the -k with args

@scwatts scwatts deleted the bwamem2-input-batch-size branch May 31, 2024 10:08
@scwatts
Copy link
Collaborator Author

scwatts commented May 31, 2024

Thanks @maxulysse, I'll take a look at doing this next week.

We're using a fairly different approach to sorting/compressing and read group assignment, and the nf-core bwa-mem2/mem module would need to be heavily patched to maintain that functionality - I have been a bit hesitant to apply that much patching to any nf-core module since I felt that it defeated their purpose, opting to use local modules in those cases instead. Did you have any thoughts around that?

I totally agree with implementing this via ext.args

@maxulysse
Copy link
Member

oh, I see you have added sambamba within the same process, haven't looked at the full module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants