-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden vg file readers and memory-mapping systems against maliciously-crafted files #4530
Comments
I've thought about this occasionally. As long as we care about performance, some file formats are fundamentally unsafe. Some file formats store the data explicitly and are easy enough to validate. That includes at least FASTA, FASTQ, GFA, SAM, and GAF. Maybe also BAM and VCF, but I don't really understand them. At the other extreme, there are file formats such as GBWT, GBZ, and GCSA that store information in the combinatorial structure and compress it well. Loading the file into memory may take seconds and we may complete the actual task in minutes, but full validation may take hours. This includes pretty much everything that uses the BWT. Then there are some files where the validation is at least as expensive as rebuilding them. This category includes at least the minimizer index and the haplotype index used in haplotype sampling. The Rust versions of GBWT and GBZ are potentially unsafe. Like the C++ versions, they do some simple checks when loading the file. But it's not feasible to check all logical invariants. It could be that structures that break logical invariants are safe and the only consequences are crashes due to failed bounds checks or excessive memory allocations, incorrect results, infinite loops, and things like that. But they could also violate some safety invariants. I don't understand the data structures well enough to be sure. We could check the relevant logical invariants with each operation, but that would lead to a noticeable loss of performance. |
For the normal use case where people get all their graphs straight from the HPRC, or from people at least as trustworthy as the vg team, trusting the input data is fine. For the Tube Map demo server, we want people to come by and upload random files, but we don't expect them to be large. So some kind of full-validation slow mode might be worth doing there. The thing that got me thinking about this is we had a user report an issue on their own graph, and send in what I think were their own indexes, which we then need to run vg against to try and replicate their issues. We might need to make sure we always rebuild absolutely everything from "safe" (text/Protobuf?) formats, and always pay the cost in compute and response time, if we think the index files are full of holes. Or we as the developers need to then trust users a lot before we try and replicate their problems. I don't think we necessarily need full semantic validation of files, if we shake all the undefined behavior out of the graph and index code. If Like, if we know there's no way for |
We do a lot of file reading in vg, sometimes through SDSL or Simple-SDS, sometimes through Protobuf, and sometimes (like for the distance index) through direct memory mapping.
The memory-mapped distance index code especially is not designed to let you safely work with an untrusted index file; we let
this
point into mapped memory, and although we do some bounds checking for dealing with multiple mappings when following pointers in the mapped file and we don't mark itPROT_EXEC
and we try to keep vtable pointers out of it, I would not be surprised to learn that it's possible to craft a distance index that can execute arbitrary code. The SDSL and Simple-SDS readers, and especially the Protobuf reader, should be more robust by default, but it's definitely possible to cause crashes with internally-inconsistent files, and there might be some way to exploit them.We should try sprinkling some security on vg, and go through our reader implementations to make sure they contemplate dealing with untrusted inputs. We might also consider swapping out some of the reader code with Rust versions, or abandoning the memory-mapped IO approach, since we see not-much-worse results from the GBZ approach of just reading the file into vector objects.
The text was updated successfully, but these errors were encountered: