Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tabix indexing of gzipped VCFs in VariantBuilder #214

Merged
merged 3 commits into from
Jan 21, 2025

Conversation

tfenne
Copy link
Member

@tfenne tfenne commented Jan 17, 2025

No description provided.

@tfenne tfenne requested review from nh13 and clintval as code owners January 17, 2025 20:06
Copy link

codecov bot commented Jan 17, 2025

Codecov Report

Attention: Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.96%. Comparing base (951ec0d) to head (808bafb).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
fgpyo/vcf/builder.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #214      +/-   ##
==========================================
+ Coverage   90.86%   90.96%   +0.09%     
==========================================
  Files          18       18              
  Lines        2289     2292       +3     
  Branches      339      340       +1     
==========================================
+ Hits         2080     2085       +5     
+ Misses        137      136       -1     
+ Partials       72       71       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

coderabbitai bot commented Jan 17, 2025

Walkthrough

The pull request introduces modifications to the VCF file handling in the VariantBuilder class within the fgpyo/vcf/builder.py module. Key changes include adding support for gzipped VCF files, which involves creating temporary files with a .vcf.gz suffix and implementing indexing functionality using pysam.tabix_index. Additionally, a new test case, test_indexing_gzipped_vcf, has been added to tests/fgpyo/vcf/test_builder.py to validate the indexing of gzipped VCF files across various genomic ranges.

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
fgpyo/vcf/builder.py (1)

283-285: Add error handling for tabix indexing.

Add try-except to handle potential indexing failures gracefully.

 if str(path.suffix) == ".gz":
+    try:
         pysam.tabix_index(str(path), preset="vcf")
+    except Exception as e:
+        raise RuntimeError(f"Failed to create tabix index: {e}")
tests/fgpyo/vcf/test_builder.py (1)

255-272: Enhance test coverage with edge cases.

Add tests for:

  • Empty VCF file
  • Invalid coordinates (e.g., negative positions)
  • Regions spanning chromosome boundaries
+    # Test empty VCF
+    empty_vcf = temp_path / "empty.vcf.gz"
+    empty_builder = VariantBuilder()
+    empty_builder.to_path(empty_vcf)
+    with vcf_reader(empty_vcf) as reader:
+        assert len(list(reader.fetch(contig="chr1", start=1, end=1000))) == 0
+
+    # Test invalid coordinates
+    with pytest.raises(ValueError):
+        with vcf_reader(vcf) as reader:
+            list(reader.fetch(contig="chr1", start=-1, end=1000))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0b4227 and 24122f4.

📒 Files selected for processing (2)
  • fgpyo/vcf/builder.py (3 hunks)
  • tests/fgpyo/vcf/test_builder.py (2 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/vcf/builder.py

[warning] 298-298: fgpyo/vcf/builder.py#L298
Added line #L298 was not covered by tests

🔇 Additional comments (2)
fgpyo/vcf/builder.py (2)

17-19: LGTM!

Required imports for VCF handling.


298-298: Add test coverage for temporary file creation.

Line not covered by tests.

✅ Verification successful

Temporary file creation is covered by integration tests.

Tests test_zero_sample_vcf_round_trip and test_indexing_gzipped_vcf verify both compressed and uncompressed VCF handling.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify test coverage for temporary file creation
rg -A 5 "NamedTemporaryFile" tests/

Length of output: 4339


Script:

#!/bin/bash
# Find and examine VCF builder tests
fd test_builder.py tests/fgpyo/vcf/ --exec cat {}

Length of output: 14018

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 298-298: fgpyo/vcf/builder.py#L298
Added line #L298 was not covered by tests

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to find one comment right? Thank-you for adding this!

@@ -279,6 +280,9 @@ def to_path(self, path: Optional[Path] = None) -> Path:
for variant in self.to_sorted_list():
writer.write(variant)

if str(path.suffix) == ".gz":
pysam.tabix_index(str(path), preset="vcf")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit because this is on the builder class and we control the file :P

Will this fail if the file is already indexed (idk)? If it isn't block-compressed (probably)?

Can you add to the docstring that indexing will occur if the VCF is gz?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added to the docstring.

We only partially control the path - a user can request it be written to anywhere, and may specifically ask for a non-gzipped path, so I think the test has to remain.

I added force=True which will cause it to override any existing index (e.g. if the user repeatedly write different VCFs to the same path). It will fail if the file isn't block-compressed ... but samtools' VCF writing writes bgzip for .gz extensions, so yay?

@tfenne tfenne force-pushed the tf_vcf_builder_add_indexing branch from 24122f4 to 75523b7 Compare January 21, 2025 20:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
fgpyo/vcf/builder.py (1)

271-276: Add index file location to docstring.

Add that the index file is created at <path>.tbi.

 If the path given ends in ".gz" then the generated file will be bgzipped and
-a tabix index generated for the file with the suffix ".gz.tbi".
+a tabix index will be generated at <path>.tbi.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24122f4 and 75523b7.

📒 Files selected for processing (2)
  • fgpyo/vcf/builder.py (4 hunks)
  • tests/fgpyo/vcf/test_builder.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/fgpyo/vcf/test_builder.py
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/vcf/builder.py

[warning] 303-303: fgpyo/vcf/builder.py#L303
Added line #L303 was not covered by tests

🔇 Additional comments (2)
fgpyo/vcf/builder.py (2)

17-19: LGTM! Clean import structure.


303-303: Add test coverage for temporary file creation.

Line creating temporary gzipped VCF isn't tested.

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 303-303: fgpyo/vcf/builder.py#L303
Added line #L303 was not covered by tests

Comment on lines 288 to 290
if str(path.suffix) == ".gz":
pysam.tabix_index(str(path), preset="vcf")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for tabix indexing.

Indexing will fail if file isn't block-compressed. Add try-catch.

 if str(path.suffix) == ".gz":
-    pysam.tabix_index(str(path), preset="vcf")
+    try:
+        pysam.tabix_index(str(path), preset="vcf")
+    except Exception as e:
+        raise ValueError(f"Failed to create tabix index: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if str(path.suffix) == ".gz":
pysam.tabix_index(str(path), preset="vcf")
if str(path.suffix) == ".gz":
try:
pysam.tabix_index(str(path), preset="vcf")
except Exception as e:
raise ValueError(f"Failed to create tabix index: {e}")

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
fgpyo/vcf/builder.py (1)

288-290: ⚠️ Potential issue

Add error handling for tabix indexing.

Add try-catch to handle non-block-compressed files.

 if str(path.suffix) == ".gz":
-    pysam.tabix_index(str(path), preset="vcf", force=True)
+    try:
+        pysam.tabix_index(str(path), preset="vcf", force=True)
+    except Exception as e:
+        raise ValueError(f"Failed to create tabix index: {e}")
🧹 Nitpick comments (1)
fgpyo/vcf/builder.py (1)

271-276: Add .tbi extension to docstring.

Mention that the tabix index will have .tbi extension.

-        If the path given ends in ".gz" then the generated file will be bgzipped and
-        a tabix index generated for the file with the suffix ".gz.tbi".
+        If the path given ends in ".gz" then the generated file will be bgzipped and
+        a tabix index will be generated as a companion file with the suffix ".tbi".
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75523b7 and 808bafb.

📒 Files selected for processing (1)
  • fgpyo/vcf/builder.py (4 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/vcf/builder.py

[warning] 303-303: fgpyo/vcf/builder.py#L303
Added line #L303 was not covered by tests

🔇 Additional comments (2)
fgpyo/vcf/builder.py (2)

17-19: LGTM! Clean imports for new functionality.


303-303: Add test coverage for temporary file creation.

Line creating temporary .vcf.gz file needs test coverage.

Run this to verify current coverage:

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 303-303: fgpyo/vcf/builder.py#L303
Added line #L303 was not covered by tests

@tfenne tfenne merged commit c5eb469 into main Jan 21, 2025
7 of 8 checks passed
@tfenne tfenne deleted the tf_vcf_builder_add_indexing branch January 21, 2025 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants