Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create tools for SV VCF cleaning #8996

Draft
wants to merge 62 commits into
base: master
Choose a base branch
from
Draft
Changes from 1 commit
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
d8fb88c
Initial commit
kjaisingh Oct 9, 2024
0c72bd2
Silenced SV type processing
kjaisingh Oct 11, 2024
05501ce
Created initial commit for 1b
kjaisingh Oct 16, 2024
40295e1
Reformatting per GATK style guide
kjaisingh Oct 16, 2024
5cd8ea3
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
75f476b
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
8d31a61
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
12cc930
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
4ecd525
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/sv/S…
kjaisingh Oct 16, 2024
df375cc
PR feedback
kjaisingh Oct 17, 2024
6f34672
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Oct 17, 2024
2e28e13
WIP commit - prior to deprecating BED file input in 1b
kjaisingh Oct 17, 2024
f0c0e0f
Updated to no longer ingest BED file
kjaisingh Oct 18, 2024
50c4eaf
Cleaned up scripts...
kjaisingh Oct 18, 2024
74de1b0
SVCleanPt2 WIP
kjaisingh Oct 18, 2024
354857f
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Oct 18, 2024
74f0d73
Working version of SVCleanPt2
kjaisingh Oct 22, 2024
54f5f96
Code cleanup
kjaisingh Oct 22, 2024
18da350
Added sorting and better formatting of outputs
kjaisingh Oct 22, 2024
f7e14c6
Initial commit for CleanPt4
kjaisingh Oct 23, 2024
5f5a6cd
WIP
kjaisingh Oct 24, 2024
52a9049
WIP - up till CleanVcf5 (first task complete)
kjaisingh Oct 25, 2024
31f7032
Reformatting & restructuring
kjaisingh Oct 26, 2024
2b97b7b
Completed CleanVcf4 / implemented skeleton walker for CleanVcf5
kjaisingh Oct 28, 2024
45443f9
Revert SVCleanPt2 to use overlap buffer
kjaisingh Oct 30, 2024
7fbc3a2
Working implementation of SVCleanPt5
kjaisingh Oct 30, 2024
181b352
Modified param name for chrX/chrY
kjaisingh Oct 31, 2024
3eb5c3d
Changes to test
kjaisingh Oct 31, 2024
52daa21
Changes to test
kjaisingh Oct 31, 2024
d438bb9
CleanVcf4 added exit if not in range
kjaisingh Oct 31, 2024
9aaf2f1
Updated type of EV format field
kjaisingh Oct 31, 2024
1de3e2f
Minor changes - replaced EV type
kjaisingh Oct 31, 2024
4e353a0
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Nov 1, 2024
a57601f
Skip no-call genotypes
kjaisingh Nov 1, 2024
2f17eec
Changes post-debugging: modify .getEnd() to use SVLEN
kjaisingh Nov 4, 2024
4c60608
Undo use of getEnd()
kjaisingh Nov 5, 2024
56ecaeb
Furhter debugging: modified SVTYPE update & corresponding genotype as…
kjaisingh Nov 5, 2024
82ecea8
Handled no-call rewriting bug
kjaisingh Nov 5, 2024
d96d810
Modifications to large del/dup event logic
kjaisingh Nov 6, 2024
0fa7738
Modified conditions for biallelic filtering
kjaisingh Nov 6, 2024
a281d65
Modified filter to only use distinct pairs
kjaisingh Nov 6, 2024
f7215ee
Moved svtype modification to cleanpt4
kjaisingh Nov 7, 2024
383a01c
Reverted svtype changes
kjaisingh Nov 7, 2024
c9d1020
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Nov 12, 2024
ee16037
Overlap logic change to align with GATK-SV outputs
kjaisingh Nov 12, 2024
3f901f4
Modified overlap logic to only be unidirectional
kjaisingh Nov 12, 2024
e8ce4c5
Standardized variant overlap code
kjaisingh Nov 12, 2024
29dc96f
Modified overlap logic - minor bug
kjaisingh Nov 12, 2024
a68dcc6
Modified CleanPt4 imports
kjaisingh Nov 14, 2024
8ba660e
Minor change to remove redundant size/svtype check
kjaisingh Nov 18, 2024
657e601
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Nov 22, 2024
707ada6
Minor file creation
kjaisingh Nov 25, 2024
1e30b4b
New tools - modular implementations
kjaisingh Dec 11, 2024
726144d
Updated to include necessary imports only
kjaisingh Dec 11, 2024
0ab56eb
Updated header writing
kjaisingh Dec 13, 2024
8fbd27e
Added sex revisions for male GT
kjaisingh Jan 13, 2025
a40b193
Concatenated overlapping cnv tasks into one
kjaisingh Jan 17, 2025
63838fe
Merge branch 'master' into kj_sv_cleanvcf
kjaisingh Jan 17, 2025
e60fc9c
Bug-fixed to match results from 5-pass version
kjaisingh Jan 21, 2025
a4db400
Standardized overlap logic across OverlappingCnv task methods
kjaisingh Jan 21, 2025
7f9b22d
Used caching to improve runtime
kjaisingh Jan 29, 2025
f46e501
Minor structural changes to overlap code
kjaisingh Feb 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
WIP commit - prior to deprecating BED file input in 1b
kjaisingh committed Oct 17, 2024
commit 2e28e1355c621afe2f8e619c540c3376f0d1226e
Original file line number Diff line number Diff line change
@@ -155,18 +155,16 @@ public enum ComplexVariantSubtype {
public static final List<String> evValues = Arrays.asList(
null, "RD", "PE", "RD,PE", "SR", "RD,SR", "PE,SR", "RD,PE,SR"
);
public static final String ME = ":ME";
public static final String ME = "ME";
public static final String VAR_GQ = "varGQ";
public static final String MULTIALLELIC = "MULTIALLELIC";
public static final String UNRESOLVED = "UNRESOLVED";
public static final String HIGH_SR_BACKGROUND = "HIGH_SR_BACKGROUND";
public static final String BOTHSIDES_SUPPORT = "BOTHSIDES_SUPPORT";
public static final String END = "END";
public static final String REVISED_EVENT = "REVISED_EVENT";
public static final String RD_CN = "RD_CN";

// CleanPt1b
public static final String GT = "GT";
public static final String GQ = "GQ";
public static final String RD_GQ = "RD_GQ";
public static final String CNVS_DEFAULT_FILE = "multi.cnvs.txt";
public static final String BLANK_SAMPLES = "blanksample";
Original file line number Diff line number Diff line change
@@ -20,13 +20,16 @@
import org.broadinstitute.hellbender.cmdline.programgroups.StructuralVariantDiscoveryProgramGroup;
import org.broadinstitute.hellbender.engine.*;
import org.broadinstitute.hellbender.tools.spark.sv.utils.GATKSVVCFConstants;
import org.broadinstitute.hellbender.utils.MathUtils;
import org.broadinstitute.hellbender.utils.tsv.TableUtils;
import org.broadinstitute.hellbender.utils.tsv.TableReader;
import org.broadinstitute.hellbender.utils.variant.GATKSVVariantContextUtils;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Path;
import java.nio.file.Files;
import java.nio.file.Paths;

@@ -38,6 +41,7 @@
import java.util.HashSet;
import java.util.HashMap;
import java.util.stream.Collectors;
import java.util.function.Function;

/**
* Completes an initial series of cleaning steps for a VCF produced by the GATK-SV pipeline.
@@ -96,26 +100,21 @@ public final class SVCleanPt1a extends VariantWalker {
public static final String CHRY_LONG_NAME = "chrY";
public static final String FAIL_LIST_LONG_NAME = "fail-list";
public static final String PASS_LIST_LONG_NAME = "pass-list";
public static final String OUTPUT_SAMPLES_LIST_LONG_NAME = "sample-list";
public static final String OUTPUT_REVISED_EVENTS_LIST_LONG_NAME = "revised-list";

@Argument(
fullName = PED_FILE_LONG_NAME,
doc = "Sample PED file"
)
private GATKPath pedFile;

@Argument(
fullName = CHRX_LONG_NAME,
doc = "chrX column name"
doc = "chrX column name",
optional = true
)
private String chrX;
private String chrX = "chrX";

@Argument(
fullName = CHRY_LONG_NAME,
doc = "chrY column name"
doc = "chrY column name",
optional = true
)
private String chrY;
private String chrY = "chrY";

@Argument(
fullName = FAIL_LIST_LONG_NAME,
@@ -136,27 +135,18 @@ public final class SVCleanPt1a extends VariantWalker {
)
private GATKPath outputVcf;

@Argument(
fullName = OUTPUT_REVISED_EVENTS_LIST_LONG_NAME,
doc="Output list of revised genotyped events"
)
private GATKPath outputRevisedEventsList;

private VariantContextWriter vcfWriter;
private BufferedWriter revisedEventsWriter;

private Map<String, Integer> sampleSexMap;
private Set<String> failSet;
private Set<String> passSet;
private final Set<String> writtenRevisedEvents = new HashSet<>();
private final Set<String> revisedSet = new HashSet<>();

private static final int MIN_ALLOSOME_EVENT_SIZE = 5000;


@Override
public void onTraversalStart() {
// Read supporting files
sampleSexMap = readPedFile(pedFile);
failSet = readLastColumn(failList);
passSet = readLastColumn(passList);

@@ -174,30 +164,17 @@ public void onTraversalStart() {
newHeader.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.HIGH_SR_BACKGROUND, 0, VCFHeaderLineType.Flag, "High number of SR splits in background samples indicating messy region"));
newHeader.addMetaDataLine(new VCFFilterHeaderLine(GATKSVVCFConstants.UNRESOLVED, "Variant is unresolved"));
newHeader.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.BOTHSIDES_SUPPORT, 0, VCFHeaderLineType.Flag, "Variant has read-level support for both sides of breakpoint"));
newHeader.addMetaDataLine(new VCFInfoHeaderLine(GATKSVVCFConstants.REVISED_EVENT, 0, VCFHeaderLineType.Flag, "Variant has been revised due to a copy number mismatch"));

// Write header
vcfWriter = createVCFWriter(outputVcf);
vcfWriter.writeHeader(newHeader);

// Create output writers
try {
revisedEventsWriter = new BufferedWriter(new FileWriter(outputRevisedEventsList.toPath().toFile()));
} catch (IOException e) {
throw new RuntimeException("Error creating output file", e);
}
}

@Override
public void closeTool() {
try {
if (vcfWriter != null) {
vcfWriter.close();
}
if (revisedEventsWriter != null) {
revisedEventsWriter.close();
}
} catch (IOException e) {
throw new RuntimeException("Error closing output file", e);
if (vcfWriter != null) {
vcfWriter.close();
}
}

@@ -229,6 +206,7 @@ private void processVariant(final VariantContext variant, final VariantContextBu
processUnresolved(variant, builder);
processNoisyEvents(variant, builder);
processBothsidesSupportEvents(variant, builder);
processAllosomes(variant, builder);
}

private void processEVGenotype(final Genotype genotype, final GenotypeBuilder genotypeBuilder) {
@@ -243,7 +221,11 @@ private void processEVGenotype(final Genotype genotype, final GenotypeBuilder ge

private void processSVTypeGenotype(final VariantContext variant, final GenotypeBuilder genotypeBuilder) {
final String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, null);
if (svType != null && variant.getAlleles().stream().noneMatch(allele -> allele.getDisplayString().contains(GATKSVVCFConstants.ME))) {
boolean hasMobileElement = variant.getAlleles().stream()
.map(allele -> GATKSVVariantContextUtils.getSymbolicAlleleSymbols(allele))
.flatMap(Arrays::stream)
.anyMatch(symbol -> symbol.equals(GATKSVVCFConstants.ME));
if (svType != null && !hasMobileElement) {
List<Allele> newGenotypeAlleles = Arrays.asList(
variant.getReference(),
Allele.create("<" + svType + ">", false)
@@ -252,17 +234,18 @@ private void processSVTypeGenotype(final VariantContext variant, final GenotypeB
}
}

private void processAllosomesGenotype(VariantContext variant, Genotype genotype, GenotypeBuilder genotypeBuilder) {
private void processAllosomesGenotype(final VariantContext variant, final Genotype genotype, final GenotypeBuilder genotypeBuilder) {
final String chromosome = variant.getContig();
if (chromosome.equals(chrX) || chromosome.equals(chrY)) {
final boolean isY = chromosome.equals(chrY);
final String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, "");
if ((svType.equals(GATKSVVCFConstants.SYMB_ALT_STRING_DEL) || svType.equals(GATKSVVCFConstants.SYMB_ALT_STRING_DUP)) &&
(variant.getEnd() - variant.getStart() >= MIN_ALLOSOME_EVENT_SIZE)) {
final String sampleName = genotype.getSampleName();
final int sex = sampleSexMap.get(sampleName);
if (sex == 1 && isRevisableEvent(variant, isY)) { // Male
writeRevisedEvents(variant);
final boolean isY = chromosome.equals(chrY);
final int chrCN = (int) genotype.getExtendedAttribute(GATKSVVCFConstants.EXPECTED_COPY_NUMBER_FORMAT);
final int sex = chrCN == 1 ? 1 : 2;

if (sex == 1 && isRevisableEvent(variant, isY, sex)) { // Male
revisedSet.add(variant.getID());
adjustMaleGenotype(genotype, genotypeBuilder, svType);
} else if (sex == 2 && isY) { // Female
genotypeBuilder.alleles(Arrays.asList(Allele.NO_CALL, Allele.NO_CALL));
@@ -273,7 +256,7 @@ private void processAllosomesGenotype(VariantContext variant, Genotype genotype,
}
}

private void adjustMaleGenotype(Genotype genotype, GenotypeBuilder genotypeBuilder, String svType) {
private void adjustMaleGenotype(final Genotype genotype, final GenotypeBuilder genotypeBuilder, final String svType) {
if (genotype.hasExtendedAttribute(GATKSVVCFConstants.RD_CN)) {
final int rdCN = Integer.parseInt(genotype.getExtendedAttribute(GATKSVVCFConstants.RD_CN).toString());
genotypeBuilder.attribute(GATKSVVCFConstants.RD_CN, rdCN + 1);
@@ -291,15 +274,11 @@ private void adjustMaleGenotype(Genotype genotype, GenotypeBuilder genotypeBuild
}
}

private boolean isRevisableEvent(VariantContext variant, boolean isY) {
private boolean isRevisableEvent(final VariantContext variant, final boolean isY, final int sex) {
final List<Genotype> genotypes = variant.getGenotypes();
final int[] maleCounts = new int[4];
final int[] femaleCounts = new int[4];
for (final Genotype genotype : genotypes) {
final String sampleName = genotype.getSampleName();
final Integer sex = sampleSexMap.get(sampleName);
if (sex == null) continue;

final int rdCN = (int) genotype.getExtendedAttribute(GATKSVVCFConstants.RD_CN, -1);
final int rdCNVal = Math.min(rdCN, 3);
if (rdCNVal == -1) continue;
@@ -316,7 +295,7 @@ private boolean isRevisableEvent(VariantContext variant, boolean isY) {
return maleMedian == 2 && (isY ? femaleMedian == 0 : femaleMedian == 4);
}

private int calcMedianDistribution(int[] counts) {
private int calcMedianDistribution(final int[] counts) {
final int total = Arrays.stream(counts).sum();
if (total == 0) return -1;

@@ -333,7 +312,7 @@ private int calcMedianDistribution(int[] counts) {
throw new RuntimeException("Error calculating median");
}

private void processSVType(VariantContext variant, VariantContextBuilder builder) {
private void processSVType(final VariantContext variant, final VariantContextBuilder builder) {
final String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, null);
if (svType != null && variant.getAlleles().stream().noneMatch(allele -> allele.getDisplayString().contains(GATKSVVCFConstants.ME))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to parse the alleles with GATKSVVariantContextUtils.getSymbolicAlleleSymbols() and check for ME. Sometimes the alt is just <INS:ME>.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for the pointer.

final Allele refAllele = variant.getReference();
@@ -343,80 +322,57 @@ private void processSVType(VariantContext variant, VariantContextBuilder builder
}
}

private void processVarGQ(VariantContext variant, VariantContextBuilder builder) {
private void processVarGQ(final VariantContext variant, final VariantContextBuilder builder) {
if (variant.hasAttribute(GATKSVVCFConstants.VAR_GQ)) {
final double varGQ = variant.getAttributeAsDouble(GATKSVVCFConstants.VAR_GQ, 0);
builder.rmAttribute(GATKSVVCFConstants.VAR_GQ);
builder.log10PError(varGQ / -10.0);
}
}

private void processMultiallelic(VariantContextBuilder builder) {
private void processMultiallelic(final VariantContextBuilder builder) {
builder.rmAttribute(GATKSVVCFConstants.MULTIALLELIC);
}

private void processUnresolved(VariantContext variant, VariantContextBuilder builder) {
private void processUnresolved(final VariantContext variant, final VariantContextBuilder builder) {
if (variant.hasAttribute(GATKSVVCFConstants.UNRESOLVED)) {
builder.rmAttribute(GATKSVVCFConstants.UNRESOLVED);
builder.filter(GATKSVVCFConstants.UNRESOLVED);
}
}

private void processNoisyEvents(VariantContext variant, VariantContextBuilder builder) {
private void processNoisyEvents(final VariantContext variant, final VariantContextBuilder builder) {
if (failSet.contains(variant.getID())) {
builder.attribute(GATKSVVCFConstants.HIGH_SR_BACKGROUND, true);
}
}

private void processBothsidesSupportEvents(VariantContext variant, VariantContextBuilder builder) {
private void processBothsidesSupportEvents(final VariantContext variant, final VariantContextBuilder builder) {
if (passSet.contains(variant.getID())) {
builder.attribute(GATKSVVCFConstants.BOTHSIDES_SUPPORT, true);
}
}

private Set<String> readLastColumn(GATKPath filePath) {
try {
return Files.lines(Paths.get(filePath.toString()))
.filter(line -> !line.trim().isEmpty() && !line.startsWith("#"))
.map(line -> {
int lastTabIndex = line.lastIndexOf('\t');
return lastTabIndex != -1 ? line.substring(lastTabIndex + 1).trim() : line.trim();
})
.collect(Collectors.toSet());
} catch (IOException e) {
throw new RuntimeException("Error reading variant list file: " + filePath, e);
private void processAllosomes(final VariantContext variant, final VariantContextBuilder builder) {
if (revisedSet.contains(variant.getID())) {
builder.attribute(GATKSVVCFConstants.REVISED_EVENT, true);
}
}

private Map<String, Integer> readPedFile(GATKPath pedFile) {
Map<String, Integer> sampleSexMap = new HashMap<>();
try (BufferedReader reader = new BufferedReader(new FileReader(pedFile.toPath().toFile()))) {
String line;
while ((line = reader.readLine()) != null) {
if (line.startsWith("#")) continue;
final String[] fields = line.split("\t");
if (fields.length >= 5) {
final String sampleName = fields[1];
final int sex = Integer.parseInt(fields[4]);
sampleSexMap.put(sampleName, sex);
}
}
} catch (IOException e) {
throw new RuntimeException("Error reading PED file", e);
}
return sampleSexMap;
}
private Set<String> readLastColumn(final GATKPath filePath) {
try {
final Path path = filePath.toPath();
final TableReader<String> reader = TableUtils.reader(path, (columns, exceptionFactory) ->
(dataline) -> {
return dataline.get(columns.columnCount() - 1);
}
);

private void writeRevisedEvents(VariantContext variant) {
final String variantId = variant.getID();
if (!writtenRevisedEvents.contains(variantId)) {
try {
revisedEventsWriter.write(variantId);
revisedEventsWriter.newLine();
writtenRevisedEvents.add(variantId);
} catch (IOException e) {
throw new RuntimeException("Error writing to revised events output file", e);
}
Set<String> result = reader.stream().collect(Collectors.toSet());
reader.close();
return result;
} catch (IOException e) {
throw new RuntimeException("Error reading variant list file: " + filePath, e);
}
}
}
Loading