Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix second segment offset in object retrieval, other minor cleanups #3367

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

teor2345
Copy link
Member

@teor2345 teor2345 commented Feb 4, 2025

This PR fixes a bug in the offset of the second segment downloaded during object reconstruction (commit 30b8167). This bug has been in the code since we added incremental segment creation, and removed the segment item count.

It also fixes a test hang in test_domain_tx_propagate.

This PR also contains multiple small cleanups based on these reviews:

And the reviewed cleanups from PR #3362, with changes from these reviews:

Close #3370.

Code contributor checklist:

@teor2345 teor2345 added bug Something isn't working refactor labels Feb 4, 2025
@teor2345 teor2345 self-assigned this Feb 4, 2025
@teor2345 teor2345 enabled auto-merge February 4, 2025 23:56
NingLin-P
NingLin-P previously approved these changes Feb 5, 2025
Copy link
Member

@NingLin-P NingLin-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me, thanks!

vedhavyas
vedhavyas previously approved these changes Feb 5, 2025
Copy link
Member

@vedhavyas vedhavyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. Left a question I'm sure about

let mut progress = 1 + Compact::compact_len(&(items.len() as u64));
// Unconditional progress is enum variant, always 1 byte in SCALE encoding.
// (Segments do not have an item count, to make incremental writing easier.)
let mut progress = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about the items count not being part of the count? I might have missed the pr mostly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Archiver (in order to be able to encode incrementally) purposefully doesn't encode number of elements:

fn encode_to<O: Output + ?Sized>(&self, dest: &mut O) {
match self {
Segment::V0 { items } => {
dest.push_byte(0);
for item in items {
item.encode_to(dest);
}
}
}
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was commit 96ed749 in PR #1200.

Here’s a potential test case, which I'll look into today:

2025-02-05T13:46:19.238562Z ERROR subspace_gateway::commands::http::server: Failed to fetch objects hashes=[ca34db057f1305579611948965ce644d2b686b4034f5cf110707df8efb95f28a] err=InvalidDataHash { data_hash: af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262, data_size: 0, mapping: GlobalObject { hash: ca34db057f1305579611948965ce644d2b686b4034f5cf110707df8efb95f28a, piece_index: PieceIndex(113406), offset: 963327 } }

It's on mainnet, pieceIndex=113406 and offset=963327

nazar-pc
nazar-pc previously approved these changes Feb 5, 2025
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers, makes sense to me

mapping: GlobalObject,
) -> Result<Vec<Piece>, Error> {
download_pieces(piece_indexes, &self.piece_getter)
download_pieces(piece_indexes.clone(), &self.piece_getter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate that we have to clone it just to log in error case. I was using Arc<[PieceIndex]> in some places for this reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change just in piece_fetcher.rs, and pushed it to this PR.

I tried doing it for all the PieceGetter trait impls, but that made it less ergonomic (commit 0f4bd8d). Do you think we should merge it?

let mut progress = 1 + Compact::compact_len(&(items.len() as u64));
// Unconditional progress is enum variant, always 1 byte in SCALE encoding.
// (Segments do not have an item count, to make incremental writing easier.)
let mut progress = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Archiver (in order to be able to encode incrementally) purposefully doesn't encode number of elements:

fn encode_to<O: Output + ?Sized>(&self, dest: &mut O) {
match self {
Segment::V0 { items } => {
dest.push_byte(0);
for item in items {
item.encode_to(dest);
}
}
}
}
}

@teor2345 teor2345 dismissed stale reviews from nazar-pc, vedhavyas, and NingLin-P via c0aa066 February 6, 2025 03:09
@teor2345 teor2345 force-pushed the evm-obj-minor-cleanups branch from f70a549 to c0aa066 Compare February 6, 2025 03:09
nazar-pc
nazar-pc previously approved these changes Feb 6, 2025
Copy link
Member Author

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the suggested changes.

This PR ran into an intermittent test failure, so I opened ticket #3370, and added timeouts to that test to diagnose why it's failing (commit c0aa066).

let mut progress = 1 + Compact::compact_len(&(items.len() as u64));
// Unconditional progress is enum variant, always 1 byte in SCALE encoding.
// (Segments do not have an item count, to make incremental writing easier.)
let mut progress = 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that was commit 96ed749 in PR #1200.

Here’s a potential test case, which I'll look into today:

2025-02-05T13:46:19.238562Z ERROR subspace_gateway::commands::http::server: Failed to fetch objects hashes=[ca34db057f1305579611948965ce644d2b686b4034f5cf110707df8efb95f28a] err=InvalidDataHash { data_hash: af1349b9f5f9a1a6a0404dea36dcc9499bcb25c9adc112b7cc9a93cae41f3262, data_size: 0, mapping: GlobalObject { hash: ca34db057f1305579611948965ce644d2b686b4034f5cf110707df8efb95f28a, piece_index: PieceIndex(113406), offset: 963327 } }

It's on mainnet, pieceIndex=113406 and offset=963327

mapping: GlobalObject,
) -> Result<Vec<Piece>, Error> {
download_pieces(piece_indexes, &self.piece_getter)
download_pieces(piece_indexes.clone(), &self.piece_getter)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this change just in piece_fetcher.rs, and pushed it to this PR.

I tried doing it for all the PieceGetter trait impls, but that made it less ergonomic (commit 0f4bd8d). Do you think we should merge it?

@teor2345 teor2345 requested a review from vedhavyas February 6, 2025 03:14
Copy link
Member Author

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows failed again in the same test, I increased the test timeout just in case that's the cause. (I don't think it is, but I just want to be extra sure.)

@nazar-pc
Copy link
Member

nazar-pc commented Feb 6, 2025

Tests are running under nextest, meaning a lot of them in parallel and some tests (notably archiving) are heavily multi-threaded, meaning it is possible for multiple tests to overlap and take a few minutes to run even though they wouldn't take that long individually. Due to essentially running multiple nodes, domain tests are even heavier and it is not surprising if they take way more than 5 minutes in CI.

Check logs of previous successful runs of that tests to see how long it actually took there to get a rough estimate.

@teor2345
Copy link
Member Author

teor2345 commented Feb 6, 2025

Tests are running under nextest, meaning a lot of them in parallel and some tests (notably archiving) are heavily multi-threaded, meaning it is possible for multiple tests to overlap and take a few minutes to run even though they wouldn't take that long individually. Due to essentially running multiple nodes, domain tests are even heavier and it is not surprising if they take way more than 5 minutes in CI.

Check logs of previous successful runs of that tests to see how long it actually took there to get a rough estimate.

Yes, I know 🙂

That test usually takes 30-180 seconds, so 300 seconds should be plenty. (And if it doubles in time maybe it’s worth investigating.)

@teor2345
Copy link
Member Author

teor2345 commented Feb 6, 2025

Looks like there’s also a real bug here:

thread 'tokio-runtime-worker' panicked at C:\actions-runner_work\subspace\subspace\domains\client\domain-operator\src\domain_worker.rs:133:13:
all branches are disabled and there is no else branch

https://github.com/autonomys/subspace/actions/runs/13171813519/job/36763423877?pr=3367#step:11:5588

Seems like it only happens on shutdown though. I’ll have a look tomorrow.

Copy link
Member Author

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready for another review.

I added commits to:

  • fix the shutdown panic in the test
  • un-ban the peer to make the flaky test_domain_tx_propagate test pass (in every test failure I've seen, the peer is banned)

I also tested that the object retrieval fix works. It takes ages because of #3318, but it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_domain_tx_propagate sometimes hangs because EVM full node is banned
4 participants