-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set bwa-mem2 batch size for reproducibility #21
Conversation
Parameter value set to match Sarek default configuration.
|
I've undertaken a brief investigation of the impact that batch size has on output BAMs to better understand the need to set a consistent batch size (i.e. applying the For further background, by default bwa-mem2 loads some number of reads into memory (known as a batch) so that the input data can be processed in manageable chunks. The number of reads loaded into memory is determined by a batch size parameter, which by default scales with the number of threads used. It has been reported that running bwa-mem2 on the same input data with different thread counts (and hence batch size) can introduce discrepancies in the output BAMs. However, using the To investigate and explore this at a high-level I have done the following:
I found BAMs with different checksums were generated when varying thread count without setting a constant batch size (Table 1). When applying a constant batch size, checksums of all BAMs were identical regardless of thread count. The BAMs did not differ within replicates, ruling out any other confounding effect. Table 1. MD5 checksums of BAMs generated from the same input data with varying thread counts and
Given that One important consideration is that if a batch size is set, it must be reasonably done as not to be so small that inefficiencies are introduced nor too great that relatively large amounts of memory are used for low thread counts. From observations made during the above investigation, a batch size of 100,000,000 appears to fit well for our use case. Commands used to generate data (click to expand)Download required data mkdir -p reference_data/
curl https://pub-cf6ba01919994c3cbd354659947f74d8.r2.dev/genomes/GRCh38_hmf/24.1/bwa_index/2.2.1.tar.gz | tar --strip-components 1 -xzvf - -C reference_data/
wget -P reference_data/ https://pub-cf6ba01919994c3cbd354659947f74d8.r2.dev/genomes/GRCh38_hmf/24.0/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna Init work directory mkdir -p 1_reads/ 2_alignments/ Simulate reads wgsim -S 0 -1 151 -2 151 -N 500000 reference_data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna 1_reads/sim.500_000.R1.fastq 1_reads/sim.500_000.R2.fastq 1>/dev/null Generate read alignments align_reads() {
reads_fwd=${1};
reads_rev=${2};
threads=${3};
batch_size=${4};
replicate=${5};
reads_fwd_fn=${reads_fwd##*/};
name=${reads_fwd_fn/.R1.fastq/};
k_arg='';
if [[ ${batch_size} != none ]]; then k_arg="-K ${batch_size}"; fi;
output_fn=${name}.${threads}.${batch_size}.${replicate}.bam
bwa-mem2 mem 2>2_alignments/${output_fn}.log \
-Y \
-t ${threads} \
${k_arg} \
reference_data/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
${reads_fwd} \
${reads_rev} | \
\
sambamba view \
--sam-input \
--format bam \
--compression-level 0 \
--nthreads 1 \
/dev/stdin | \
\
sambamba sort \
--nthreads 1 \
--out 2_alignments/${output_fn} \
/dev/stdin
}
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 none 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 none 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 none 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 none 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 100000000 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 4 100000000 2
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 100000000 1
align_reads 1_reads/sim.500_000.R{1,2}.fastq 8 100000000 2 Compare output BAMs parallel 'echo {} $(samtools view {} | md5sum)' ::: *500_000*bam | sort -nk2,2 | column -t
# sim.500_000.4.none.1.bam b93eaa49ade96ef6f04daace9d383f96 -
# sim.500_000.4.none.2.bam b93eaa49ade96ef6f04daace9d383f96 -
# sim.500_000.8.none.1.bam e103d1007a18af7deebf7d50effacafe -
# sim.500_000.8.none.2.bam e103d1007a18af7deebf7d50effacafe -
# sim.500_000.4.100000000.1.bam 29f237e841d4adb51d42bce163941acf -
# sim.500_000.4.100000000.2.bam 29f237e841d4adb51d42bce163941acf -
# sim.500_000.8.100000000.1.bam 29f237e841d4adb51d42bce163941acf -
# sim.500_000.8.100000000.2.bam 29f237e841d4adb51d42bce163941acf - /cc @charlesshale |
@scwatts I would recommend using the nf-core modules for it and setting up the -k with args |
Thanks @maxulysse, I'll take a look at doing this next week. We're using a fairly different approach to sorting/compressing and read group assignment, and the nf-core bwa-mem2/mem module would need to be heavily patched to maintain that functionality - I have been a bit hesitant to apply that much patching to any nf-core module since I felt that it defeated their purpose, opting to use local modules in those cases instead. Did you have any thoughts around that? I totally agree with implementing this via |
oh, I see you have added sambamba within the same process, haven't looked at the full module. |
-K