Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TASK-6722 - Variant Walker to enable user defined variant analysis #2522

Merged
merged 68 commits into from
Dec 20, 2024
Merged
Changes from 1 commit
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
4b8dad2
storage: Add variant-walker tool #TASK-6722
j-coll Oct 8, 2024
9ea00eb
storage: Add STDERR to exception thrown. Fix max_bytes_per_map. #TASK…
j-coll Oct 10, 2024
7558a26
storage: Add satus details when throwing exceptions. #TASK-6722
j-coll Oct 10, 2024
bc7c6ae
storage: Fix walker output file name #TASK-6722
j-coll Oct 11, 2024
ab4dff5
storage: Properly configure task java heap #TASK-6722
j-coll Oct 15, 2024
7af8020
storage: Run docker image prune on cleanup. #TASK-6722
j-coll Oct 16, 2024
c5375ea
storage: Ensure walker output is sorted. #TASK-6722
j-coll Oct 24, 2024
663c03a
storage: Extract walker STDERR file from MR execution. #TASK-6722
j-coll Oct 25, 2024
154befa
storage: Do not write multiple headers. #TASK-6722
j-coll Oct 25, 2024
85aac6d
storage: Fix NoSuchMethodError creating StopWatch. #TASK-6722
j-coll Oct 25, 2024
697b08b
storage: Ensure stderr file is moved from scratch dir. #TASK-6722
j-coll Oct 25, 2024
356567e
storage: Fix stderr sorting. #TASK-6722
j-coll Oct 25, 2024
6253da3
storage: Write `\n` after the json header #TASK-6722
j-coll Oct 25, 2024
5789628
storage: Do not interrupt header with empty records. #TASK-6722
j-coll Oct 29, 2024
4ff0655
storage: Add a custom Partitioner to ensure sorted data with multiple…
j-coll Oct 29, 2024
8268266
storage: Fix partitioner. #TASK-6722
j-coll Oct 29, 2024
4147d01
storage: Restart process when changing chromosome to ensure correct s…
j-coll Oct 29, 2024
7fd439a
storage: Fix GenomeHellper generateBootPreSplits. #TASK-6722
j-coll Oct 29, 2024
e6128b0
storage: Do not interrupt header with empty lines while concat. #TASK…
j-coll Oct 30, 2024
100fecf
storage: Replace ImmutableBytesWritable with VariantLocusKey as map o…
j-coll Oct 31, 2024
0df69dc
storage: Use VariantLocusKey and VariantLocusPartitioner in VariantEx…
j-coll Oct 31, 2024
f6fd3d4
storage: Fix VariantLocusKey serialization. #TASK-6722
j-coll Nov 1, 2024
fa3c9f2
storage: Fix "Request body si too large" #TASK-6722
j-coll Nov 4, 2024
b528c03
analysis: Do not try to close twice the same ERM. #TASK-6722
j-coll Nov 4, 2024
96e5679
storage: Do not use flush on outputstream. HADOOP-16548 #TASK-6722
j-coll Nov 7, 2024
bcd8185
storage: Add VariantExporterDirectMultipleOutputsMapper to ensure sor…
j-coll Nov 7, 2024
c4c3d3b
storage: Do not use reduce step on variant-walker. #TASK-6722
j-coll Nov 7, 2024
0100097
storage: Fix VariantRecordWriter bytes_written counter. #TASK-6722
j-coll Nov 7, 2024
b52ca27
storage: Reduce number of intermediate mapper files. #TASK-6722
j-coll Nov 8, 2024
ad3521e
storage: Use SNAPPY as intermediate compression algorithm. #TASK-6722
j-coll Nov 8, 2024
ab50d6e
storage: Disable flush on AbfsOutputStream. HADOOP-16548 #TASK-6722
j-coll Nov 11, 2024
212f8ce
storage: Centralize variantMapperJob initialitation. #TASK-6722
j-coll Nov 11, 2024
2a39303
storage: Fix NoClassDefFoundError tephra. #TASK-7194 #TASK-6722
j-coll Nov 12, 2024
ae26598
storage: Fix NPE exporting from sampleindex. #TASK-6722
j-coll Nov 12, 2024
b000ec7
storage: Ensure variant-exports are sorted even from Phoenix. #TASK-6722
j-coll Nov 18, 2024
0a741d5
storage: Use HDFS to store intermediate MapReduce files. Concat local…
j-coll Nov 25, 2024
cd50a3c
storage: Improve MapReduceOutputFile concatMrOutputToLocal. #TASK-6722
j-coll Nov 25, 2024
d430391
storage: Increase mapreduce.task.timeout to 30min #TASK-6722
j-coll Nov 25, 2024
e35ee83
storage: Fix temporary mapreduce outdir. #TASK-6722
j-coll Nov 25, 2024
0c48603
storage: Do not double copy hdfs files #TASK-6722
j-coll Nov 26, 2024
ccf7438
storage: Use reducer to concat binary files #TASK-6722
j-coll Nov 26, 2024
f87686e
storage: Do not fail vairant-walker if no output is produced. #TASK-6722
j-coll Nov 27, 2024
a389e10
storage: Split PhoenixInputSplits into smaller splits. #TASK-6722
j-coll Nov 27, 2024
f453090
storage: Improve log message. #TASK-6722
j-coll Nov 27, 2024
47535c1
storage: Add HadoopVariantWalkerTest. #TASK-6722
j-coll Nov 28, 2024
003e467
storage: Rename some variant-walker params. Add descriptions #TASK-6722
j-coll Nov 28, 2024
48e1592
storage: Fix NPE running SampleVariantStats #TASK-6722
j-coll Nov 28, 2024
1d86756
storage: Fix CustomPhoenixInputFormat generateSplit for first and las…
j-coll Nov 29, 2024
5141031
analysis: Fix NPE at relatedness tool. #TASK-6722
j-coll Nov 29, 2024
c48ce0a
Merge branch 'release-3.x.x' into TASK-6722
j-coll Nov 29, 2024
f2bc782
cicd: Upload tests logs as artifacts. Reduce action log size. #TASK-6722
j-coll Nov 29, 2024
dd684aa
storage: Fix NPE at CohortVariantStatsDriver. #TASK-6722
j-coll Nov 29, 2024
9795c6a
cicd: Fix NPE. #TASK-6722
j-coll Nov 29, 2024
923651c
storage: Fix AIOOBE SampleVariantStatsDriver #TASK-6722
j-coll Nov 29, 2024
90010ac
storage: Do not produce a .crc checksum file copying from hdfs. #TASK…
j-coll Nov 29, 2024
9f326d9
storage: Improve docker process failure. Do not close the stdin twice…
j-coll Nov 29, 2024
627e56a
storage: Fix AIOOBE SampleVariantStatsDriver #TASK-6722
j-coll Nov 29, 2024
98ce6f8
storage: Do not produce a .crc checksum file copying from hdfs. #TASK…
j-coll Nov 29, 2024
14c07d9
analysis: Do not use the scratchDir as intermediate folder for export…
j-coll Nov 29, 2024
050c1ee
storage: Improve collections usage in SampleVariantStatsDriver. #TASK…
j-coll Nov 29, 2024
a0c2a5f
analysis: Fix VariantAnalysisTest. #TASK-6722
j-coll Dec 2, 2024
3853c63
app: Regenerate cli. #TASK-6722
j-coll Dec 3, 2024
eb61609
storage: Fix junit tests. #TASK-6722
j-coll Dec 3, 2024
54acc28
cicd: Increase "Publish Test Report on GitHub" memory #TASK-6722
j-coll Dec 4, 2024
4e96492
core: Fix NumberFormatException from IOUtils. #TASK-6722
j-coll Dec 4, 2024
0851ea9
Merge branch 'release-3.x.x' into TASK-6722
j-coll Dec 19, 2024
852ffca
core: Remove unused method. #TASK-6722
j-coll Dec 19, 2024
005855f
storage: Do not add new abstract methods to VariantStorageEngine. #TA…
j-coll Dec 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
storage: Do not interrupt header with empty records. #TASK-6722
j-coll committed Oct 29, 2024
commit 5789628871a47ebadc502a48e5137aa18d8d283e
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package org.opencb.opencga.storage.hadoop.variant.mr;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
@@ -44,8 +45,13 @@ protected void reduce(ImmutableBytesWritable key, Iterable<Text> values,
context.getCounter(VariantsTableMapReduceHelper.COUNTER_GROUP_NAME, "header_records").increment(1);
}
} else {
// No more header, assume all header is written
headerWritten = true;
if (value.getLength() < 3 && StringUtils.isBlank(value.toString())) {
context.getCounter(VariantsTableMapReduceHelper.COUNTER_GROUP_NAME, "stdout_records_empty").increment(1);
// Do not interrupt header with empty records
} else {
// No more header, assume all header is written
headerWritten = true;
}
mos.write("stdout", key, value);
context.getCounter(VariantsTableMapReduceHelper.COUNTER_GROUP_NAME, "body_records").increment(1);
}