Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of pieces downloaded by the object fetcher in subspace-gateway #3318

Open
1 of 7 tasks
shamil-gadelshin opened this issue Dec 17, 2024 · 22 comments
Open
1 of 7 tasks
Assignees
Labels
improvement it is already working, but can be better networking Subspace networking (DSN)

Comments

@shamil-gadelshin
Copy link
Contributor

shamil-gadelshin commented Dec 17, 2024

Some objects will be split between segments and the current implementation of subspace-gateway downloads both segments to serve even small objects. We need to optimize this case.

cc @teor2345

Tasks

  • Instead of downloading the first segment, try hashing the data with all possible padding lengths (0-3), and return the hash that matches
    • Move hash verification from RPC/HTTP to the fetcher
  • Instead of downloading the second segment, stream the pieces into the segment deserializer, discard the parent segment header, then read the block continuation until the entire object has been read
    • To partially read the block continuation (rather than the entire block), we might need custom deserialization code
  • Add tests with 0-3 bytes of segment padding
    • Add tests with 0-3 bytes of zero-valued non-padding at the end of a segment
  • Document the maximum object size as a security parameter, and the risk of amplification attacks via arbitrary mapping requests
@teor2345 teor2345 added improvement it is already working, but can be better networking Subspace networking (DSN) labels Dec 18, 2024
@teor2345
Copy link
Member

We could add an optional “hole” to each mapping. A mapping at the end of the segment would have one hole covering the end of segment padding, the next segment’s “parent segment header”, and the block continuation header.

All other mappings would be the same as they currently are, without any holes.

This would reduce the downloaded data for end of segment objects from 256 pieces to 2 pieces, and substantially simplify the object fetching code. The archiver object mapping generation code would only need a few extra lines, because it already has the information needed to calculate the size of the hole.

The most efficient encoding for a hole is a pair of u32s: the length of the data before the hole, and the length of the hole. (The length of the object is encoded in the first few bytes of its data.)

Currently, the archiver doesn’t make any assumptions about object size, because some domains might have blocks larger than a segment. So to handle that case, the mapping format should have a list of holes (even if objects in our current domains only ever have one hole).

@teor2345
Copy link
Member

This can be handled in the gateway RPC server by adding a V1 to:

pub enum GlobalObjectMapping {
/// V0 of object mapping data structure.
#[codec(index = 0)]
V0 {
/// Objects stored in the history of the blockchain.
objects: Vec<GlobalObject>,
},
}

@shamil-gadelshin @clostao the gateway HTTP server doesn’t have versioned object mappings, maybe we could introduce a version to make future format upgrades easier?

/// Object mapping format from the indexer service.
#[derive(Serialize, Deserialize, Debug, Default)]
#[serde(rename_all = "camelCase")]
struct ObjectMapping {
hash: Blake3Hash,
piece_index: PieceIndex,
piece_offset: u32,
#[serde(deserialize_with = "string_to_u32")]
block_number: BlockNumber,
}

@nazar-pc
Copy link
Member

Some objects will be split between segments and the current implementation of subspace-gateway downloads both segments to serve even small objects.

Second segment is unnecessary, it is 1 segment + 1 piece. This should also not happen very frequently for actually small objects.

The reason for downloading the first segment is to know the layout of the segment. In retrospect we could have created records in such a way that they all start with a content length, meaning we'd fragment data not only across segment, but also within segment. Then it'd be possible to decode any object starting from any piece, but at this point it'd be a breaking change and/or require support for both old and new way of doing this.

We could add an optional “hole” to each mapping. A mapping at the end of the segment would have one hole covering the end of segment padding, the next segment’s “parent segment header”, and the block continuation header.

This would work, but doesn't feel like an elegant solution. I don't know the scope of the gateway, but would it be possible to simple cache (or even persist) the end of data for each segment once it is discovered for the first time? Yes, the first time you'll download the whole segment, but after that you'll know where data ends in that segment. Might potentially cache some other small bits of information about segments too.

@teor2345
Copy link
Member

The reason for downloading the first segment is to know the layout of the segment.

In the current object retrieval implementation, we download the first segment to find the size of the padding, and the second segment to find the size of the parent segment header at the start of that segment.

@nazar-pc
Copy link
Member

and the second segment to find the size of the parent segment header at the start of that segment

I do not understand this part, why would you need the segment for this if one piece is sufficient and you only really need to download the second one if it contains the data of the object, you will rarely need to download more than that, let along the whole segment.

@teor2345
Copy link
Member

We could add an optional “hole” to each mapping. A mapping at the end of the segment would have one hole covering the end of segment padding, the next segment’s “parent segment header”, and the block continuation header.

This would work, but doesn't feel like an elegant solution. I don't know the scope of the gateway, but would it be possible to simple cache (or even persist) the end of data for each segment once it is discovered for the first time?

This doesn’t seem like an elegant solution either, because it complicates the already complicated object retrieval code, making it stateful.

I’d rather persist exactly the same information in the mapping itself, because we already have that information in the archiver. And then the object retrieval code becomes much simpler, we just download source pieces and skip any holes.

Mappings have a versioning scheme, so we could support both versions for a while, then remove support for the older version in a few months. (We can easily re-generate all the mappings in the new version, and so can anyone else if they happen to be using that code.)

@nazar-pc
Copy link
Member

I find introduction of "holes" really not elegant, it feels like a hack instead of a natural extension of the existing implementation.

The mappings were supposed to be compact and possible to store efficiently. Adding dynamic (meaning likely heap-allocated) is something I'd rather avoid on the archiver level. This is also the reason why mapping as such doesn't have the size of the object and even the hash is technically only there for verification purposes, it is strictly speaking not needed. All that is needed is piece index and an offset in it.

Ideally there would be a better way. The perfect solution would be to chunk in a way I described above, but we can't change it for existing segments.

@teor2345
Copy link
Member

and the second segment to find the size of the parent segment header at the start of that segment

I do not understand this part, why would you need the segment for this if one piece is sufficient and you only really need to download the second one if it contains the data of the object, you will rarely need to download more than that, let along the whole segment.

To find the size of the parent segment header in the next segment, which is always the first item in a segment:

// Add segment header to the beginning of the buffer to be the first thing included in the
// next segment
self.buffer
.push_front(SegmentItem::ParentSegmentHeader(segment_header));

We currently download and deserialise the entire second segment, but as you say, we wouldn’t need to if we calculated the size of the data to skip. We would need to calculate the size of the segment prefix (version enum, item count), the entire parent segment header item, and the block continuation item prefix (item enum variant):

// We need to read extra segments to get the object length, or the full object.
// We don't attempt to calculate the number of segments needed, because it involves
// headers and optional padding.
loop {
segment_index += SegmentIndex::ONE;
let items = self.read_segment(segment_index).await?.into_items();
for segment_item in items {
match segment_item {
SegmentItem::BlockContinuation { bytes, .. } => {
data.extend_from_slice(&bytes);
if let Some(data_length) =
self.decode_data_length(data.as_slice(), piece_index, piece_offset)?
{
if data.len() >= data_length {
return Ok(Vec::<u8>::decode(&mut data.as_slice())?);
}
}
}
// Padding at the end of segments can be skipped, it's not part of the object data
SegmentItem::Padding => {}
// We should not see these items while collecting data for a single object
SegmentItem::Block { .. }
| SegmentItem::BlockStart { .. }
| SegmentItem::ParentSegmentHeader(_) => {

But the archiver already knows that information, so we could also just add it to the mapping, and the code would be simpler.

@nazar-pc
Copy link
Member

It is really more complex the way you describe it than it should be. Each segment starts with one or more headers, followed by the data you actually care. You don't need "holes" to process that, just download the first piece, decode the headers and throw them away, consume the data you actually need. There is no need to calculate the data to skip, you simple collecting the object from a stream of pieces one by one.

@teor2345
Copy link
Member

It is really more complex the way you describe it than it should be. Each segment starts with one or more headers, followed by the data you actually care. You don't need "holes" to process that, just download the first piece, decode the headers and throw them away, consume the data you actually need. There is no need to calculate the data to skip, you simple collecting the object from a stream of pieces one by one.

That would work for the second segment, but we’d need to manually deserialize its version and item count. (The standard deseralizer keeps going until it has read all the segment items, so if we only have the first piece, then it will error with an unexpected end of input.) Or we could lazily fetch pieces as needed by the deseralizer.

But for the first segment’s padding, we’re still stuck with either:

  • fetching the entire segment to calculate the size of the padding, or
  • adding the size of the padding to the mapping (it would be one small integer field, which seems to meet all your design criteria).

@nazar-pc
Copy link
Member

That would work for the second segment, but we’d need to manually deserialize its version and item count. (The standard deseralizer keeps going until it has read all the segment items, so if we only have the first piece, then it will error with an unexpected end of input.)

Correct, that is what we are already doing

But for the first segment’s padding, we’re still stuck with either:

Yes, and the padding is what I would have assumed if we were to do it, not a sequence of "holes". But even then it does give me a vibe of something bolted on top rather than well designed from the very beginning. Not saying we'll not end up having to do it though.

@teor2345
Copy link
Member

That would work for the second segment, but we’d need to manually deserialize its version and item count. (The standard deseralizer keeps going until it has read all the segment items, so if we only have the first piece, then it will error with an unexpected end of input.)

Correct, that is what we are already doing

Not quite sure what you mean here, but in subspace-data-retrieval, we always get the full first and second segment when using the slow path.

As part of the optimisations in this ticket, we could stream the pieces into the deseralizer instead, and stop fetching pieces when we’ve fetched the whole object. (As an optimisation, we could guess how many pieces we need for the remainder of the object and the segment header, and start fetching them all immediately.)

But for the first segment’s padding, we’re still stuck with either:

Yes, and the padding is what I would have assumed if we were to do it, not a sequence of "holes". But even then it does give me a vibe of something bolted on top rather than well designed from the very beginning. Not saying we'll not end up having to do it though.

I don’t think there is ever any padding when an object is split between segments:

// Check if there is an excess of data that should be spilled over into the next segment
let spill_over = segment_size
.checked_sub(RecordedHistorySegment::SIZE)
.unwrap_or_default();

If that’s the case, then the padding length would be zero in every mapping. (And the code we have to skip padding is useless, and can be removed as part of this ticket.)

The only time we have padding is when an item finishes near the end of the segment, and the next item won’t fit in the segment once it is serialized. This code only runs when working out how many items to put in the segment:

// `-2` because even the smallest segment item will take 2 bytes to encode, so it makes
// sense to stop earlier here
while segment_size < (RecordedHistorySegment::SIZE - 2) {

It’s the code above that actually runs when splitting objects, and it splits on the exact segment size, which will never create any padding.

@nazar-pc
Copy link
Member

Not quite sure what you mean here, but in subspace-data-retrieval, we always get the full first and second segment when using the slow path.

I see. Then indeed we have optimizations to do here. We can quite accurately estimate how many pieces need to be downloaded (and precisely at some point) and will only need the whole segment in case of reconstruction.

I don’t think there is ever any padding when an object is split between segments:

Length are encoded in variable length integers, meaning sometimes adding 1 byte of data (to fill the last byte in the segment) will increase the size of the length encoding by one byte too, meaning the actual encoding increase will be 2 bytes, which no longer fits into the segment.

I believe there should even be a test for this, I was trying to cover all edge-cases, including assertions on both sides of the "threshold" to make sure we catch possible regressions. It should be possible to run code mutation fuzzer and hypothetically there should be no code changes that do not lead to test failures (but likely there are some).

@teor2345
Copy link
Member

teor2345 commented Dec 18, 2024

I don’t think there is ever any padding when an object is split between segments:

Length are encoded in variable length integers, meaning sometimes adding 1 byte of data (to fill the last byte in the segment) will increase the size of the length encoding by one byte too, meaning the actual encoding increase will be 2 bytes, which no longer fits into the segment.

Yes, I understand. But in that case we move the entire item (and all of its objects) to the next segment (leaving a large amount of padding), rather than splitting before those last 2 bytes (and leaving one byte of padding):

if spill_over > inner_bytes_size {
self.buffer.push_front(segment_item);
segment_size -= segment_item_encoded_size;
break;
}

In the code that splits items, we always split at exactly the segment length (without any padding):

// Check if there is an excess of data that should be spilled over into the next segment
let spill_over = segment_size
.checked_sub(RecordedHistorySegment::SIZE)
.unwrap_or_default();
if spill_over > 0 {
let items = segment.items_mut();
let segment_item = items
.pop()
.expect("Segment over segment size always has at least one item; qed");
let segment_item = match segment_item {
SegmentItem::Padding => {
unreachable!("Buffer never contains SegmentItem::Padding; qed");
}
SegmentItem::Block {
mut bytes,
mut object_mapping,
} => {
let split_point = bytes.len() - spill_over;
let continuation_bytes = bytes[split_point..].to_vec();
bytes.truncate(split_point);
let continuation_object_mapping = BlockObjectMapping::V0 {
objects: object_mapping
.objects_mut()
.extract_if(|block_object: &mut BlockObject| {
if block_object.offset >= split_point as u32 {
block_object.offset -= split_point as u32;
true
} else {
false
}
})
.collect(),
};
// Update last archived block to include partial archiving info
last_archived_block.set_partial_archived(
u32::try_from(bytes.len())
.expect("Blocks length is never bigger than u32; qed"),
);
// Push continuation element back into the buffer where removed segment item was
self.buffer.push_front(SegmentItem::BlockContinuation {
bytes: continuation_bytes,
object_mapping: continuation_object_mapping,
});
SegmentItem::BlockStart {
bytes,
object_mapping,
}
}
SegmentItem::BlockStart { .. } => {
unreachable!("Buffer never contains SegmentItem::BlockStart; qed");
}
SegmentItem::BlockContinuation {
mut bytes,
mut object_mapping,
} => {
let split_point = bytes.len() - spill_over;
let continuation_bytes = bytes[split_point..].to_vec();
bytes.truncate(split_point);
let continuation_object_mapping = BlockObjectMapping::V0 {
objects: object_mapping
.objects_mut()
.extract_if(|block_object: &mut BlockObject| {
if block_object.offset >= split_point as u32 {
block_object.offset -= split_point as u32;
true
} else {
false
}
})
.collect(),
};
// Above code assumed that block was archived fully, now remove spilled-over
// bytes from the size
let archived_bytes = last_archived_block.partial_archived().expect(
"Block continuation implies that there are some bytes archived \
already; qed",
);
last_archived_block.set_partial_archived(
archived_bytes
- u32::try_from(spill_over)
.expect("Blocks length is never bigger than u32; qed"),
);
// Push continuation element back into the buffer where removed segment item was
self.buffer.push_front(SegmentItem::BlockContinuation {
bytes: continuation_bytes,
object_mapping: continuation_object_mapping,
});
SegmentItem::BlockContinuation {
bytes,
object_mapping,
}
}

@teor2345
Copy link
Member

Ah right, I see what you’re saying. When we split an object at the exact length of the segment, we could end up reducing the length of its encoded size. Which would leave padding.

There is no custom logic to handle this padding, it just happens. So even if the encoded size goes down 2 bytes, we don’t try to fit another byte in there.

By the way, this means there’s a bug in the object retrieval code, because it assumes the padding is 2 bytes or less. But the archiver code can generate 3 byte padding if it splits a block that’s 2^14 bytes (or larger), and the length in the first segment is 63 bytes or less.

https://wentelteefje.github.io/parity-scale-codec/encode/#243-four-byte-mode

@nazar-pc
Copy link
Member

But the archiver code can generate 3 byte padding if it splits a block that’s 2^14 bytes (or larger), and the length in the first segment is 63 bytes or less.

Indeed

Ah right, I see what you’re saying. When we split an object at the exact length of the segment, we could end up reducing the length of its encoded size. Which would leave padding.

Exactly. I actually though about this some more and there is a hacky way to work around this for small objects.

Here is the thinking: we have exactly 4 cases when crossing the boundary:

  • no padding
  • 1 byte padding
  • 2 bytes padding
  • 3 bytes padding

If we have the hash of the object we're supposed to get and know the hash function, we can simply hash it 3 times with all 3 cases and pick one that matches expected hash output without downloading other pieces of the first segment. For objects that span more than 2 segments this can be skipped since the cost of downloading of extra segment in that segment is less noticeable.

This is a hack and I am not excited about it, but it will get the job done with no changes anywhere else except the retrieval code.

@teor2345
Copy link
Member

Ah right, I see what you’re saying. When we split an object at the exact length of the segment, we could end up reducing the length of its encoded size. Which would leave padding.

Exactly. I actually though about this some more and there is a hacky way to work around this for small objects.

Here is the thinking: we have exactly 4 cases when crossing the boundary:

  • no padding
  • 1 byte padding
  • 2 bytes padding
  • 3 bytes padding

If we have the hash of the object we're supposed to get and know the hash function, we can simply hash it 3 times with all 3 cases and pick one that matches expected hash output without downloading other pieces of the first segment. For objects that span more than 2 segments this can be skipped since the cost of downloading of extra segment in that segment is less noticeable.

This is a hack and I am not excited about it, but it will get the job done with no changes anywhere else except the retrieval code.

I am both mildly horrified and impressed 🙂

Did you mean all 4 cases? (Not 3.)

But we know that the full block (or remaining block continuation) has to be larger than the object, and the part of the block in the first segment has to be larger than the part of the object in the first segment (excluding padding).

So it should be possible to eliminate some of these cases, then check the rest in order of likelihood. (Smaller to larger padding.)

For example, if the object has more than 63 bytes in this segment, then the only possible paddings are 0 and 2: moving from a 4 byte length (greater than or equal to 2^14 but less than 2^30) to a 2 byte length (greater than 63 but less than 2^14). And if the object has more than 2^14 bytes in this segment, we know there’s no padding.

And if the object length is greater than or equal to 2^14, then the block must have a 4 byte length, so the only possible paddings are 0, 2, and 3.

@nazar-pc
Copy link
Member

I am both mildly horrified and impressed 🙂

🤣

Did you mean all 4 cases? (Not 3.)

I originally wrote the message before you commented about 3 bytes padding, but ultimately it can still be 3 cases because if 3 fail then 4th is the one we probably want 🙈 (that is unless we have a bug, probability of which is never zero)

@shamil-gadelshin
Copy link
Contributor Author

@shamil-gadelshin @clostao the gateway HTTP server doesn’t have versioned object mappings, maybe we could introduce a version to make future format upgrades easier?

If I understand correctly you plan to add a new version of object mappings. We can support the versioned mappings, however, we need to know how we plan to add a new mappings format. Do we plan to support the old one or demand to reindex the objects from scratch.

@teor2345
Copy link
Member

@shamil-gadelshin @clostao the gateway HTTP server doesn’t have versioned object mappings, maybe we could introduce a version to make future format upgrades easier?

If I understand correctly you plan to add a new version of object mappings. We can support the versioned mappings, however, we need to know how we plan to add a new mappings format. Do we plan to support the old one or demand to reindex the objects from scratch.

We haven’t made any specific plans yet. But in general, we would support the old format for a while, then start returning an error if the old format was used.

But we don’t need specific plans to add a version enum to the JSON format used by the HTTP gateway and indexer. We could use the same design as the object mappings in the node.

Maybe you could do it when you change the block number from a string to an integer?

@teor2345 teor2345 changed the title Optimize object mappings and object fetcher for subspace-gateway. Reduce the number of pieces downloaded by the object fetcher in subspace-gateway Jan 6, 2025
@teor2345
Copy link
Member

teor2345 commented Jan 8, 2025

Draft Algorithm

Here’s an algorithm for this implementation:

Simple object reconstruction

  1. Download the piece the object starts in
  2. Read the object length from the first 1-4 bytes of the object, downloading another piece if needed, rejecting the length if it is too large
  3. Download enough pieces for the whole object, in parallel
  4. Read the bytes of the object from those pieces in order
  5. Check the hash, returning an error if it is invalid

Object reconstruction with potential segment padding

If the object is near the end of a segment, there might be 0-3 bytes of zero-filled segment padding. So we use a slightly more complex process from step 2 onwards.

  1. Read 1-4 possible lengths of the object, skipping 0-3 bytes of zero padding at the end of the segment, and the parent segment header and block continuation header at the start of the next segment, ignoring invalid lengths, …
  2. Download enough pieces for the smallest length…
  3. Read the bytes of the object…
  4. Check the hash, if it is invalid, go to step 3 with the next highest length
  5. If the hashes for all possible lengths are invalid, return an error

Using this algorithm, for valid requests, we only ever download pieces that contain actual object data. There are no pieces which are only downloaded to determine segment structure, then thrown away.

Security

It is possible for the object length to vary up to 256 times. (Since the padding is always zero-valued, any non-zero lengths have to start outside the padding, so those lengths will always have the same number of length bytes.)

This makes the maximum object size into a security parameter, because an attacker can:

  • spam the network with small objects with specifically crafted lengths (2^14 or 2^22), at low cost, until one of those object’s lengths lands in the segment padding
  • request that object with an incorrect hash
  • make the gateway request up to 4 MB or 1 GB before failing with an invalid hash

Using a maximum object size equal to the maximum block size in any domain (5MB) stops the second attack after it has downloaded 6 pieces, which is the standard download for a 4MB object (when it starts at the end of a piece or segment).

I don’t think there’s anything we can do about downloading 4MB (6 pieces) instead of 16 kB (2 pieces). But that’s only an amplification factor of 3x, rather than 256x. Piece caching on a local node might help.

But as far as I know, our current infrastructure doesn’t allow arbitrary mapping requests. So as part of this ticket, I’ll add a security section to the gateway docs, so if anyone else deploys it, they can manage those risks.

@nazar-pc
Copy link
Member

nazar-pc commented Jan 8, 2025

I think that'll work

@teor2345 teor2345 self-assigned this Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement it is already working, but can be better networking Subspace networking (DSN)
Projects
None yet
Development

No branches or pull requests

3 participants