Harden vg file readers and memory-mapping systems against maliciously-crafted files #4530

adamnovak · 2025-02-26T20:47:58Z

We do a lot of file reading in vg, sometimes through SDSL or Simple-SDS, sometimes through Protobuf, and sometimes (like for the distance index) through direct memory mapping.

The memory-mapped distance index code especially is not designed to let you safely work with an untrusted index file; we let this point into mapped memory, and although we do some bounds checking for dealing with multiple mappings when following pointers in the mapped file and we don't mark it PROT_EXEC and we try to keep vtable pointers out of it, I would not be surprised to learn that it's possible to craft a distance index that can execute arbitrary code. The SDSL and Simple-SDS readers, and especially the Protobuf reader, should be more robust by default, but it's definitely possible to cause crashes with internally-inconsistent files, and there might be some way to exploit them.

We should try sprinkling some security on vg, and go through our reader implementations to make sure they contemplate dealing with untrusted inputs. We might also consider swapping out some of the reader code with Rust versions, or abandoning the memory-mapped IO approach, since we see not-much-worse results from the GBZ approach of just reading the file into vector objects.

The text was updated successfully, but these errors were encountered:

jltsiren · 2025-02-27T02:07:57Z

I've thought about this occasionally. As long as we care about performance, some file formats are fundamentally unsafe.

Some file formats store the data explicitly and are easy enough to validate. That includes at least FASTA, FASTQ, GFA, SAM, and GAF. Maybe also BAM and VCF, but I don't really understand them.

At the other extreme, there are file formats such as GBWT, GBZ, and GCSA that store information in the combinatorial structure and compress it well. Loading the file into memory may take seconds and we may complete the actual task in minutes, but full validation may take hours. This includes pretty much everything that uses the BWT.

Then there are some files where the validation is at least as expensive as rebuilding them. This category includes at least the minimizer index and the haplotype index used in haplotype sampling.

The Rust versions of GBWT and GBZ are potentially unsafe. Like the C++ versions, they do some simple checks when loading the file. But it's not feasible to check all logical invariants. It could be that structures that break logical invariants are safe and the only consequences are crashes due to failed bounds checks or excessive memory allocations, incorrect results, infinite loops, and things like that. But they could also violate some safety invariants. I don't understand the data structures well enough to be sure. We could check the relevant logical invariants with each operation, but that would lead to a noticeable loss of performance.

adamnovak · 2025-02-27T16:14:52Z

For the normal use case where people get all their graphs straight from the HPRC, or from people at least as trustworthy as the vg team, trusting the input data is fine.

For the Tube Map demo server, we want people to come by and upload random files, but we don't expect them to be large. So some kind of full-validation slow mode might be worth doing there.

The thing that got me thinking about this is we had a user report an issue on their own graph, and send in what I think were their own indexes, which we then need to run vg against to try and replicate their issues. We might need to make sure we always rebuild absolutely everything from "safe" (text/Protobuf?) formats, and always pay the cost in compute and response time, if we think the index files are full of holes. Or we as the developers need to then trust users a lot before we try and replicate their problems.

I don't think we necessarily need full semantic validation of files, if we shake all the undefined behavior out of the graph and index code. If get_sequence() and similar can be guaranteed to return garbage satisfactory to Valgrind, or crash in a controlled way, when presented with a handle for a node that doesn't really exist, instead of doing an out-of-bounds read (or write!), or if we have some non-unsafe Rust between the algorithms and the bits on the wire, we can dramatically reduce the attack surface.

Like, if we know there's no way for PackedGraph to do an out of bounds read or write because all the vector accesses are bounds-checked and all the string-key-not-there cases are handled, we can then safely load any bits we want as a graph, and someone trying to break in then needs to find a way to corrupt the heap inside an actual algorithm, which is likely to be harder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden vg file readers and memory-mapping systems against maliciously-crafted files #4530

Harden vg file readers and memory-mapping systems against maliciously-crafted files #4530

adamnovak commented Feb 26, 2025

jltsiren commented Feb 27, 2025

adamnovak commented Feb 27, 2025

Harden vg file readers and memory-mapping systems against maliciously-crafted files #4530

Harden vg file readers and memory-mapping systems against maliciously-crafted files #4530

Comments

adamnovak commented Feb 26, 2025

jltsiren commented Feb 27, 2025

adamnovak commented Feb 27, 2025