Set bwa-mem2 batch size for reproducibility #21

scwatts · 2024-05-03T06:28:58Z

improve reproducibility for bwa-mem2 alignments by setting -K
this ensures that the number of input bases in each batch is consistent
the parameter value was selected to match configuration used in Sarek

Parameter value set to match Sarek default configuration.

github-actions · 2024-05-03T06:30:14Z

`nf-core lint` overall result: Passed ✅ ⚠️

Posted for pipeline commit 010c5cd

+| ✅ 155 tests passed       |+
#| ❔   6 tests were ignored |#
!| ❗  57 tests had warnings |!

❗ Test warnings:

files_exist - File not found: assets/multiqc_config.yml
nextflow_config - Config manifest.version should end in dev: 0.3.1
readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
pipeline_todos - TODO string in output.md: Write this documentation describing your workflow's output
pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
schema_params - Schema param panel not found from nextflow config
schema_params - Schema param genome_version not found from nextflow config
schema_params - Schema param genome_type not found from nextflow config
schema_params - Schema param ref_data_hmf_data_path not found from nextflow config
schema_params - Schema param ref_data_panel_data_path not found from nextflow config
schema_params - Schema param ref_data_virusbreakenddb_path not found from nextflow config
schema_params - Schema param ref_data_hla_slice_bed not found from nextflow config
system_exit - System.exit in main.nf: System.exit(1) [line 44]
system_exit - System.exit in main.nf: System.exit(1) [line 45]
system_exit - System.exit in Processes.groovy: System.exit(1) [line 33]
system_exit - System.exit in Processes.groovy: System.exit(1) [line 49]
system_exit - System.exit in WorkflowOncoanalyser.groovy: System.exit(1) [line 62]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 29]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 39]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 47]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 55]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 63]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 68]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 84]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 89]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 108]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 113]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 121]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 182]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 264]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 276]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 284]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 291]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 299]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 314]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 335]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 346]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 355]
system_exit - System.exit in Utils.groovy: System.exit(1) [line 385]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 118]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 125]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 132]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 146]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 156]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 161]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 174]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 190]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 199]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 210]
system_exit - System.exit in WorkflowMain.groovy: System.exit(1) [line 220]

❔ Tests ignored:

files_exist - File is ignored: lib/NfcoreTemplate.groovy
files_exist - File is ignored: lib/Utils.groovy
files_exist - File is ignored: lib/WorkflowMain.groovy
files_exist - File is ignored: lib/WorkflowOncoanalyser.groovy
actions_ci - actions_ci
multiqc_config - 'assets/multiqc_config.yml' not found

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: CODE_OF_CONDUCT.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/config.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: assets/nf-core-oncoanalyser_logo_light.png
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/images/nf-core-oncoanalyser_logo_light.png
files_exist - File found: docs/images/nf-core-oncoanalyser_logo_dark.png
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: main.nf
files_exist - File found: conf/base.config
files_exist - File found: conf/igenomes.config
files_exist - File found: .github/workflows/awstest.yml
files_exist - File found: .github/workflows/awsfulltest.yml
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-oncoanalyser_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: lib/nfcore_external_java_deps.jar
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.name
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: manifest.homePage
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config manifest.name began with nf-core/
nextflow_config - Config variable manifest.homePage began with https://github.com/nf-core/
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
nextflow_config - Config default value correct: params.force_genome= false
nextflow_config - Config default value correct: params.prepare_reference_only= false
nextflow_config - Config default value correct: params.create_stub_placeholders= false
nextflow_config - Config default value correct: params.isofox_functions= TRANSCRIPT_COUNTS;ALT_SPLICE_JUNCTIONS;FUSIONS;RETAINED_INTRONS
nextflow_config - Config default value correct: params.custom_config_version= master
nextflow_config - Config default value correct: params.custom_config_base= https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Config default value correct: params.max_cpus= 16
nextflow_config - Config default value correct: params.max_memory= 128.GB
nextflow_config - Config default value correct: params.max_time= 240.h
nextflow_config - Config default value correct: params.publish_dir_mode= copy
nextflow_config - Config default value correct: params.validate_params= true
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - CODE_OF_CONDUCT.md matches the template
files_unchanged - LICENSE matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/bug_report.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/config.yml matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/PULL_REQUEST_TEMPLATE.md matches the template
files_unchanged - .github/workflows/branch.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - .github/workflows/linting.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - assets/nf-core-oncoanalyser_logo_light.png matches the template
files_unchanged - docs/images/nf-core-oncoanalyser_logo_light.png matches the template
files_unchanged - docs/images/nf-core-oncoanalyser_logo_dark.png matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
files_unchanged - pyproject.toml matches the template
actions_awstest - '.github/workflows/awstest.yml' is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml is triggered correctly
actions_awsfulltest - .github/workflows/awsfulltest.yml does not use -profile test
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
pipeline_name_conventions - Name adheres to nf-core convention
template_strings - Did not find any Jinja template strings (285 files)
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
actions_schema_validation - Workflow validation passed: awstest.yml
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: awsfulltest.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: download_pipeline.yml
merge_markers - No merge markers found in pipeline files
modules_json - Only installed modules found in modules.json
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.13.1
Run at 2024-05-03 06:29:58

scwatts · 2024-05-07T05:19:05Z

I've undertaken a brief investigation of the impact that batch size has on output BAMs to better understand the need to set a consistent batch size (i.e. applying the -K option).

For further background, by default bwa-mem2 loads some number of reads into memory (known as a batch) so that the input data can be processed in manageable chunks. The number of reads loaded into memory is determined by a batch size parameter, which by default scales with the number of threads used. It has been reported that running bwa-mem2 on the same input data with different thread counts (and hence batch size) can introduce discrepancies in the output BAMs. However, using the -K option to set a constant batch size is said to ameliorate these discrepancies when re-running bwa-mem2 on the same data even when setting different thread counts.

To investigate and explore this at a high-level I have done the following:

simulated 500,000 reads from the GRCh38_hmf genome
aligned with bwa-mem2 as we do in oncoanalyser but varying thread count and running with or without -K
compared at a high-level with checksums of resulting BAMs

I found BAMs with different checksums were generated when varying thread count without setting a constant batch size (Table 1). When applying a constant batch size, checksums of all BAMs were identical regardless of thread count. The BAMs did not differ within replicates, ruling out any other confounding effect.

Table 1. MD5 checksums of BAMs generated from the same input data with varying thread counts and -K

batch_size	threads	iteration	md5_chksm
default	4	1	b93eaa49ade96ef6f04daace9d383f96
default	4	2	b93eaa49ade96ef6f04daace9d383f96
default	8	1	e103d1007a18af7deebf7d50effacafe
default	8	2	e103d1007a18af7deebf7d50effacafe
100000000	4	1	29f237e841d4adb51d42bce163941acf
100000000	4	2	29f237e841d4adb51d42bce163941acf
100000000	8	1	29f237e841d4adb51d42bce163941acf
100000000	8	2	29f237e841d4adb51d42bce163941acf

Given that -K reduces variability in output BAMs and improves reproducibility, it is clear that this option is beneficial in that regard.

One important consideration is that if a batch size is set, it must be reasonably done as not to be so small that inefficiencies are introduced nor too great that relatively large amounts of memory are used for low thread counts. From observations made during the above investigation, a batch size of 100,000,000 appears to fit well for our use case.

Commands used to generate data (click to expand)

Download required data

mkdir -p reference_data/

curl https://pub-cf6ba01919994c3cbd354659947f74d8.r2.dev/genomes/GRCh38_hmf/24.1/bwa_index/2.2.1.tar.gz | tar --strip-components 1 -xzvf - -C reference_data/
wget -P reference_data/ https://pub-cf6ba01919994c3cbd354659947f74d8.r2.dev/genomes/GRCh38_hmf/24.0/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

Init work directory

mkdir -p 1_reads/ 2_alignments/

Simulate reads

wgsim -S 0 -1 151 -2 151 -N 500000 reference_data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna 1_reads/sim.500_000.R1.fastq 1_reads/sim.500_000.R2.fastq 1>/dev/null

Generate read alignments

align_reads() {
  reads_fwd=${1};
  reads_rev=${2};
  threads=${3};
  batch_size=${4};
  replicate=${5};

  reads_fwd_fn=${reads_fwd##*/};
  name=${reads_fwd_fn/.R1.fastq/};

  k_arg='';
  if [[ ${batch_size} != none ]]; then k_arg="-K ${batch_size}"; fi;

  output_fn=${name}.${threads}.${batch_size}.${replicate}.bam

  bwa-mem2 mem 2>2_alignments/${output_fn}.log \
    -Y \
    -t ${threads} \
    ${k_arg} \
    reference_data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
    ${reads_fwd} \
    ${reads_rev} | \
    \
    sambamba view \
      --sam-input \
      --format bam \
      --compression-level 0 \
      --nthreads 1 \
      /dev/stdin | \
    \
    sambamba sort \
      --nthreads 1 \
      --out 2_alignments/${output_fn} \
      /dev/stdin
}

align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 none 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 none 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 none 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 none 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 100000000 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 100000000 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 100000000 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 100000000 2

Compare output BAMs

parallel 'echo {} $(samtools view {} | md5sum)' ::: *500_000*bam | sort -nk2,2 | column -t

# sim.500_000.4.none.1.bam       b93eaa49ade96ef6f04daace9d383f96  -
# sim.500_000.4.none.2.bam       b93eaa49ade96ef6f04daace9d383f96  -
# sim.500_000.8.none.1.bam       e103d1007a18af7deebf7d50effacafe  -
# sim.500_000.8.none.2.bam       e103d1007a18af7deebf7d50effacafe  -
# sim.500_000.4.100000000.1.bam  29f237e841d4adb51d42bce163941acf  -
# sim.500_000.4.100000000.2.bam  29f237e841d4adb51d42bce163941acf  -
# sim.500_000.8.100000000.1.bam  29f237e841d4adb51d42bce163941acf  -
# sim.500_000.8.100000000.2.bam  29f237e841d4adb51d42bce163941acf  -

/cc @charlesshale

maxulysse · 2024-05-31T08:28:25Z

@scwatts I would recommend using the nf-core modules for it and setting up the -k with args

scwatts · 2024-05-31T10:09:30Z

Thanks @maxulysse, I'll take a look at doing this next week.

We're using a fairly different approach to sorting/compressing and read group assignment, and the nf-core bwa-mem2/mem module would need to be heavily patched to maintain that functionality - I have been a bit hesitant to apply that much patching to any nf-core module since I felt that it defeated their purpose, opting to use local modules in those cases instead. Did you have any thoughts around that?

I totally agree with implementing this via ext.args

maxulysse · 2024-05-31T11:16:57Z

oh, I see you have added sambamba within the same process, haven't looked at the full module.

Set bwa-mem2 batch size for reproducibility

010c5cd

Parameter value set to match Sarek default configuration.

scwatts requested a review from charlesshale May 3, 2024 06:28

charlesshale approved these changes May 31, 2024

View reviewed changes

scwatts merged commit ce2bf00 into dev May 31, 2024
4 checks passed

scwatts deleted the bwamem2-input-batch-size branch May 31, 2024 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set bwa-mem2 batch size for reproducibility #21

Set bwa-mem2 batch size for reproducibility #21

scwatts commented May 3, 2024

github-actions bot commented May 3, 2024 •

edited

Loading

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

scwatts commented May 7, 2024

maxulysse commented May 31, 2024

scwatts commented May 31, 2024 •

edited

Loading

maxulysse commented May 31, 2024

Set bwa-mem2 batch size for reproducibility #21

Set bwa-mem2 batch size for reproducibility #21

Conversation

scwatts commented May 3, 2024

github-actions bot commented May 3, 2024 • edited Loading

nf-core lint overall result: Passed ✅ ⚠️

❗ Test warnings:

❔ Tests ignored:

✅ Tests passed:

Run details

scwatts commented May 7, 2024

maxulysse commented May 31, 2024

scwatts commented May 31, 2024 • edited Loading

maxulysse commented May 31, 2024

github-actions bot commented May 3, 2024 •

edited

Loading

`nf-core lint` overall result: Passed ✅ ⚠️

scwatts commented May 31, 2024 •

edited

Loading