Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing VCF files #17

Open
Peter-J-Freeman opened this issue Jan 28, 2020 · 7 comments
Open

Processing VCF files #17

Peter-J-Freeman opened this issue Jan 28, 2020 · 7 comments

Comments

@Peter-J-Freeman
Copy link
Collaborator

We need to think about how VCf files are processed. Currently we truncate the job after 50 variants.

An alternative approach, which Teri will describe, would allow us to handle large VCF processing in smaller chunks, which will be re-assembled and sent to the user once completed (As a single email this time, DOH!)

Pros - We can support larger VCFs which may be useful to our customers

Cons - Since we don't as such annotate VCFs why bulk processing. Is this really what the tool is for. We also don't have the capacity of EBI. What are our limitations?

@leicray
Copy link

leicray commented Jan 28, 2020

What was the reasoning behind truncating after 50 variants? Is it a gut-feeling that this limit will not inhibit users' ability to analyse their VCF data? Is it a necessary limit to avoid overloading the server? Do we have any data regarding the number of variants in submitted VCF files?

In a way, we need to answer each of these questions to allow us to have an opinion on how to proceed. If server capacity is not an issue, we can chunk the processing of large VCF files and better support our customers.

@Peter-J-Freeman
Copy link
Collaborator Author

50,000 sorry. Typo

@Peter-J-Freeman
Copy link
Collaborator Author

Yes, we don't want to overload the system and have to consider other customers. We have asynchronous scheduling but huge jobs take up a lot of capacity. Also, how big a job realistically can we handle before we cannot email the job back

We used to cut off at 25000 variants output. Now at 50,000 variants input which for some genes can be a huge output.

I have had VCFs from 1 variant up to several million submitted

@leicray
Copy link

leicray commented Jan 28, 2020

When users submit VCFs containing several million variants, you have to wonder whether they understand what they are doing. Sounds like no filtering has been applied.

@Peter-J-Freeman
Copy link
Collaborator Author

That's exactly what I was thinking. The tool is not for annotating huge VCFs aka VEP. It is for extracting HGVS descriptions. I'm kind of thinking 50000 variants is overly generous, but that doesn't mean we don't consider upping it. I'm torn(ish)

@leicray
Copy link

leicray commented Jan 28, 2020

Perhaps the vcf2hgvs entry page needs to spell out it's purpose more clearly to steer away those users who perhaps should be using VEP instead.

@Peter-J-Freeman
Copy link
Collaborator Author

Also a valid statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants