Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden vg file readers and memory-mapping systems against maliciously-crafted files #4530

Open
adamnovak opened this issue Feb 26, 2025 · 2 comments

Comments

@adamnovak
Copy link
Member

We do a lot of file reading in vg, sometimes through SDSL or Simple-SDS, sometimes through Protobuf, and sometimes (like for the distance index) through direct memory mapping.

The memory-mapped distance index code especially is not designed to let you safely work with an untrusted index file; we let this point into mapped memory, and although we do some bounds checking for dealing with multiple mappings when following pointers in the mapped file and we don't mark it PROT_EXEC and we try to keep vtable pointers out of it, I would not be surprised to learn that it's possible to craft a distance index that can execute arbitrary code. The SDSL and Simple-SDS readers, and especially the Protobuf reader, should be more robust by default, but it's definitely possible to cause crashes with internally-inconsistent files, and there might be some way to exploit them.

We should try sprinkling some security on vg, and go through our reader implementations to make sure they contemplate dealing with untrusted inputs. We might also consider swapping out some of the reader code with Rust versions, or abandoning the memory-mapped IO approach, since we see not-much-worse results from the GBZ approach of just reading the file into vector objects.

@jltsiren
Copy link
Contributor

I've thought about this occasionally. As long as we care about performance, some file formats are fundamentally unsafe.

Some file formats store the data explicitly and are easy enough to validate. That includes at least FASTA, FASTQ, GFA, SAM, and GAF. Maybe also BAM and VCF, but I don't really understand them.

At the other extreme, there are file formats such as GBWT, GBZ, and GCSA that store information in the combinatorial structure and compress it well. Loading the file into memory may take seconds and we may complete the actual task in minutes, but full validation may take hours. This includes pretty much everything that uses the BWT.

Then there are some files where the validation is at least as expensive as rebuilding them. This category includes at least the minimizer index and the haplotype index used in haplotype sampling.

The Rust versions of GBWT and GBZ are potentially unsafe. Like the C++ versions, they do some simple checks when loading the file. But it's not feasible to check all logical invariants. It could be that structures that break logical invariants are safe and the only consequences are crashes due to failed bounds checks or excessive memory allocations, incorrect results, infinite loops, and things like that. But they could also violate some safety invariants. I don't understand the data structures well enough to be sure. We could check the relevant logical invariants with each operation, but that would lead to a noticeable loss of performance.

@adamnovak
Copy link
Member Author

For the normal use case where people get all their graphs straight from the HPRC, or from people at least as trustworthy as the vg team, trusting the input data is fine.

For the Tube Map demo server, we want people to come by and upload random files, but we don't expect them to be large. So some kind of full-validation slow mode might be worth doing there.

The thing that got me thinking about this is we had a user report an issue on their own graph, and send in what I think were their own indexes, which we then need to run vg against to try and replicate their issues. We might need to make sure we always rebuild absolutely everything from "safe" (text/Protobuf?) formats, and always pay the cost in compute and response time, if we think the index files are full of holes. Or we as the developers need to then trust users a lot before we try and replicate their problems.

I don't think we necessarily need full semantic validation of files, if we shake all the undefined behavior out of the graph and index code. If get_sequence() and similar can be guaranteed to return garbage satisfactory to Valgrind, or crash in a controlled way, when presented with a handle for a node that doesn't really exist, instead of doing an out-of-bounds read (or write!), or if we have some non-unsafe Rust between the algorithms and the bits on the wire, we can dramatically reduce the attack surface.

Like, if we know there's no way for PackedGraph to do an out of bounds read or write because all the vector accesses are bounds-checked and all the string-key-not-there cases are handled, we can then safely load any bits we want as a graph, and someone trying to break in then needs to find a way to corrupt the heap inside an actual algorithm, which is likely to be harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants