-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the number of pieces downloaded by the object fetcher in subspace-gateway #3318
Comments
We could add an optional “hole” to each mapping. A mapping at the end of the segment would have one hole covering the end of segment padding, the next segment’s “parent segment header”, and the block continuation header. All other mappings would be the same as they currently are, without any holes. This would reduce the downloaded data for end of segment objects from 256 pieces to 2 pieces, and substantially simplify the object fetching code. The archiver object mapping generation code would only need a few extra lines, because it already has the information needed to calculate the size of the hole. The most efficient encoding for a hole is a pair of u32s: the length of the data before the hole, and the length of the hole. (The length of the object is encoded in the first few bytes of its data.) Currently, the archiver doesn’t make any assumptions about object size, because some domains might have blocks larger than a segment. So to handle that case, the mapping format should have a list of holes (even if objects in our current domains only ever have one hole). |
This can be handled in the gateway RPC server by adding a subspace/crates/subspace-core-primitives/src/objects.rs Lines 135 to 142 in 3d5483e
@shamil-gadelshin @clostao the gateway HTTP server doesn’t have versioned object mappings, maybe we could introduce a version to make future format upgrades easier? subspace/crates/subspace-gateway/src/commands/http/server.rs Lines 24 to 33 in 3d5483e
|
Second segment is unnecessary, it is 1 segment + 1 piece. This should also not happen very frequently for actually small objects. The reason for downloading the first segment is to know the layout of the segment. In retrospect we could have created records in such a way that they all start with a content length, meaning we'd fragment data not only across segment, but also within segment. Then it'd be possible to decode any object starting from any piece, but at this point it'd be a breaking change and/or require support for both old and new way of doing this.
This would work, but doesn't feel like an elegant solution. I don't know the scope of the gateway, but would it be possible to simple cache (or even persist) the end of data for each segment once it is discovered for the first time? Yes, the first time you'll download the whole segment, but after that you'll know where data ends in that segment. Might potentially cache some other small bits of information about segments too. |
In the current object retrieval implementation, we download the first segment to find the size of the padding, and the second segment to find the size of the parent segment header at the start of that segment. |
I do not understand this part, why would you need the segment for this if one piece is sufficient and you only really need to download the second one if it contains the data of the object, you will rarely need to download more than that, let along the whole segment. |
This doesn’t seem like an elegant solution either, because it complicates the already complicated object retrieval code, making it stateful. I’d rather persist exactly the same information in the mapping itself, because we already have that information in the archiver. And then the object retrieval code becomes much simpler, we just download source pieces and skip any holes. Mappings have a versioning scheme, so we could support both versions for a while, then remove support for the older version in a few months. (We can easily re-generate all the mappings in the new version, and so can anyone else if they happen to be using that code.) |
I find introduction of "holes" really not elegant, it feels like a hack instead of a natural extension of the existing implementation. The mappings were supposed to be compact and possible to store efficiently. Adding dynamic (meaning likely heap-allocated) is something I'd rather avoid on the archiver level. This is also the reason why mapping as such doesn't have the size of the object and even the hash is technically only there for verification purposes, it is strictly speaking not needed. All that is needed is piece index and an offset in it. Ideally there would be a better way. The perfect solution would be to chunk in a way I described above, but we can't change it for existing segments. |
To find the size of the parent segment header in the next segment, which is always the first item in a segment: subspace/crates/subspace-archiving/src/archiver.rs Lines 858 to 861 in 3d5483e
We currently download and deserialise the entire second segment, but as you say, we wouldn’t need to if we calculated the size of the data to skip. We would need to calculate the size of the segment prefix (version enum, item count), the entire parent segment header item, and the block continuation item prefix (item enum variant): subspace/shared/subspace-data-retrieval/src/object_fetcher.rs Lines 497 to 523 in 3d5483e
But the archiver already knows that information, so we could also just add it to the mapping, and the code would be simpler. |
It is really more complex the way you describe it than it should be. Each segment starts with one or more headers, followed by the data you actually care. You don't need "holes" to process that, just download the first piece, decode the headers and throw them away, consume the data you actually need. There is no need to calculate the data to skip, you simple collecting the object from a stream of pieces one by one. |
That would work for the second segment, but we’d need to manually deserialize its version and item count. (The standard deseralizer keeps going until it has read all the segment items, so if we only have the first piece, then it will error with an unexpected end of input.) Or we could lazily fetch pieces as needed by the deseralizer. But for the first segment’s padding, we’re still stuck with either:
|
Correct, that is what we are already doing
Yes, and the padding is what I would have assumed if we were to do it, not a sequence of "holes". But even then it does give me a vibe of something bolted on top rather than well designed from the very beginning. Not saying we'll not end up having to do it though. |
Not quite sure what you mean here, but in As part of the optimisations in this ticket, we could stream the pieces into the deseralizer instead, and stop fetching pieces when we’ve fetched the whole object. (As an optimisation, we could guess how many pieces we need for the remainder of the object and the segment header, and start fetching them all immediately.)
I don’t think there is ever any padding when an object is split between segments: subspace/crates/subspace-archiving/src/archiver.rs Lines 503 to 506 in 3d5483e
If that’s the case, then the padding length would be zero in every mapping. (And the code we have to skip padding is useless, and can be removed as part of this ticket.) The only time we have padding is when an item finishes near the end of the segment, and the next item won’t fit in the segment once it is serialized. This code only runs when working out how many items to put in the segment: subspace/crates/subspace-archiving/src/archiver.rs Lines 401 to 403 in 3d5483e
It’s the code above that actually runs when splitting objects, and it splits on the exact segment size, which will never create any padding. |
I see. Then indeed we have optimizations to do here. We can quite accurately estimate how many pieces need to be downloaded (and precisely at some point) and will only need the whole segment in case of reconstruction.
Length are encoded in variable length integers, meaning sometimes adding 1 byte of data (to fill the last byte in the segment) will increase the size of the length encoding by one byte too, meaning the actual encoding increase will be 2 bytes, which no longer fits into the segment. I believe there should even be a test for this, I was trying to cover all edge-cases, including assertions on both sides of the "threshold" to make sure we catch possible regressions. It should be possible to run code mutation fuzzer and hypothetically there should be no code changes that do not lead to test failures (but likely there are some). |
Yes, I understand. But in that case we move the entire item (and all of its objects) to the next segment (leaving a large amount of padding), rather than splitting before those last 2 bytes (and leaving one byte of padding): subspace/crates/subspace-archiving/src/archiver.rs Lines 458 to 462 in 3d5483e
In the code that splits items, we always split at exactly the segment length (without any padding): subspace/crates/subspace-archiving/src/archiver.rs Lines 503 to 606 in 3d5483e
|
Ah right, I see what you’re saying. When we split an object at the exact length of the segment, we could end up reducing the length of its encoded size. Which would leave padding. There is no custom logic to handle this padding, it just happens. So even if the encoded size goes down 2 bytes, we don’t try to fit another byte in there. By the way, this means there’s a bug in the object retrieval code, because it assumes the padding is 2 bytes or less. But the archiver code can generate 3 byte padding if it splits a block that’s 2^14 bytes (or larger), and the length in the first segment is 63 bytes or less. https://wentelteefje.github.io/parity-scale-codec/encode/#243-four-byte-mode |
Indeed
Exactly. I actually though about this some more and there is a hacky way to work around this for small objects. Here is the thinking: we have exactly 4 cases when crossing the boundary:
If we have the hash of the object we're supposed to get and know the hash function, we can simply hash it 3 times with all 3 cases and pick one that matches expected hash output without downloading other pieces of the first segment. For objects that span more than 2 segments this can be skipped since the cost of downloading of extra segment in that segment is less noticeable. This is a hack and I am not excited about it, but it will get the job done with no changes anywhere else except the retrieval code. |
I am both mildly horrified and impressed 🙂 Did you mean all 4 cases? (Not 3.) But we know that the full block (or remaining block continuation) has to be larger than the object, and the part of the block in the first segment has to be larger than the part of the object in the first segment (excluding padding). So it should be possible to eliminate some of these cases, then check the rest in order of likelihood. (Smaller to larger padding.) For example, if the object has more than 63 bytes in this segment, then the only possible paddings are 0 and 2: moving from a 4 byte length (greater than or equal to 2^14 but less than 2^30) to a 2 byte length (greater than 63 but less than 2^14). And if the object has more than 2^14 bytes in this segment, we know there’s no padding. And if the object length is greater than or equal to 2^14, then the block must have a 4 byte length, so the only possible paddings are 0, 2, and 3. |
🤣
I originally wrote the message before you commented about 3 bytes padding, but ultimately it can still be 3 cases because if 3 fail then 4th is the one we probably want 🙈 (that is unless we have a bug, probability of which is never zero) |
If I understand correctly you plan to add a new version of object mappings. We can support the versioned mappings, however, we need to know how we plan to add a new mappings format. Do we plan to support the old one or demand to reindex the objects from scratch. |
We haven’t made any specific plans yet. But in general, we would support the old format for a while, then start returning an error if the old format was used. But we don’t need specific plans to add a version enum to the JSON format used by the HTTP gateway and indexer. We could use the same design as the object mappings in the node. Maybe you could do it when you change the block number from a string to an integer? |
Draft AlgorithmHere’s an algorithm for this implementation: Simple object reconstruction
Object reconstruction with potential segment paddingIf the object is near the end of a segment, there might be 0-3 bytes of zero-filled segment padding. So we use a slightly more complex process from step 2 onwards.
Using this algorithm, for valid requests, we only ever download pieces that contain actual object data. There are no pieces which are only downloaded to determine segment structure, then thrown away. SecurityIt is possible for the object length to vary up to 256 times. (Since the padding is always zero-valued, any non-zero lengths have to start outside the padding, so those lengths will always have the same number of length bytes.) This makes the maximum object size into a security parameter, because an attacker can:
Using a maximum object size equal to the maximum block size in any domain (5MB) stops the second attack after it has downloaded 6 pieces, which is the standard download for a 4MB object (when it starts at the end of a piece or segment). I don’t think there’s anything we can do about downloading 4MB (6 pieces) instead of 16 kB (2 pieces). But that’s only an amplification factor of 3x, rather than 256x. Piece caching on a local node might help. But as far as I know, our current infrastructure doesn’t allow arbitrary mapping requests. So as part of this ticket, I’ll add a security section to the gateway docs, so if anyone else deploys it, they can manage those risks. |
I think that'll work |
Some objects will be split between segments and the current implementation of
subspace-gateway
downloads both segments to serve even small objects. We need to optimize this case.cc @teor2345
Tasks
The text was updated successfully, but these errors were encountered: