Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sessions] Block hash store does not update correctly #366

Closed
3 tasks
red-0ne opened this issue Feb 6, 2024 · 9 comments
Closed
3 tasks

[Sessions] Block hash store does not update correctly #366

red-0ne opened this issue Feb 6, 2024 · 9 comments
Assignees
Labels
bug Something isn't working session Changes related to Session management

Comments

@red-0ne
Copy link
Contributor

red-0ne commented Feb 6, 2024

Objective

Fix session mismatch errors occurring when RelayRequests are processed at some specific block heights.

Have a flawless relaying process at any time in a session or a rolled-over session and at any RelayRequest volume.

Origin Document

Resulting RelayResponse
image

Sequencer's logs
image

Goals

  • Ensure that no unwanted session mismatch error is occurring during RelayRequest verification that may be caused by GetSession being called at sensitive points in time.
  • Ensure deterministic GetSession queries.

Deliverables

  • A single PR addressing the fact that GetSession may not have access to the stored blockHash used in sessionId construction.

Non-goals / Non-deliverables

  • Implement a new solution for getting past block hashes.
  • Change the Pocket protocol to address the issue.

General deliverables

  • Comments: Add/update TODOs and comments alongside the source code so it is easier to follow.
  • Testing: Add new tests (unit and/or E2E) to the test suite.

Creator: @red-0ne
Co-Owners: @Olshansk

@red-0ne red-0ne added bug Something isn't working session Changes related to Session management labels Feb 6, 2024
@red-0ne red-0ne added this to the Shannon TestNet milestone Feb 6, 2024
@red-0ne red-0ne self-assigned this Feb 6, 2024
@red-0ne red-0ne added this to Shannon Feb 6, 2024
@red-0ne red-0ne moved this to 🔖 Ready in Shannon Feb 6, 2024
@Olshansk Olshansk self-assigned this Feb 6, 2024
@red-0ne
Copy link
Contributor Author

red-0ne commented Feb 6, 2024

It seems that Tendermint is broadcasting new blocks to the off-chain actors (through websocket) and to Cosmos at the same time.

Making the off-chain actors aware of the latest block height N while Cosmos is currently processing block height N.

This results in having the session querier trying to hydrate Sessions for blocks that does not have their hash stored yet (blockHeight == ctx.BlockHeight()).

ctx.BlockHeight() being the block that is currently being processed by Cosmos.

@Olshansk
Copy link
Member

Olshansk commented Feb 6, 2024

It seems that Tendermint is broadcasting new blocks to the off-chain actors (through websocket) and to Cosmos at the same time.

Is this not expected behaviour?

Making the off-chain actors aware of the latest block height N while Cosmos is currently processing block height N.

I looked at the list of events and found this:

Screenshot 2024-02-06 at 12 25 10 PM

This results in having the session querier trying to hydrate Sessions for blocks that does not have their hash stored yet (blockHeight == ctx.BlockHeight()).

Maybe it's as simple as simply storing the hash on EndBlocker instead of BeginBlocker?

@red-0ne
Copy link
Contributor Author

red-0ne commented Feb 6, 2024

@Olshansk,

Is this not expected behaviour?

I assumed that when the off-chain actors receive a committed block, Cosmos (not Tendermint) would have already processed and committed it on its side. Which seems not to be the case.

I looked at the list of events and found this:

Yes these are the Tendermint events, which are sent to off-chain actors via websocket but they are also sent to Cosmos for indexing (apparently this is done concurrently)

Maybe it's as simple as simply storing the hash on EndBlocker instead of BeginBlocker?

Tried it, it yields the same results.


The crafted log below shows us what's happening. It outputs on the sequencer when an off-chain actor queries Cosmos (in this case for session)

The code below is part of the keeper.GetSession callstack.
height is the height that the off-chain actor is aware of and used in its QueryGetSessionRequest
ctx.BlockHeight() is the current (not the last committed) block being processed by Cosmos
image

Sequencer output
image

  1. Cosmos is made aware of block 12.
  2. Off-chain actor is requesting session info (in 3 occurrences):
  • Off-chain actor is already aware of block 12 (from Tendermint) and is requesting it.
  • Cosmos' ctx.BlockHeight() is still 12 currently processing it and GetBlockHash(ctx, 12) returns the empty byte slice.
  • Finished hydrating session ID: f03f1a3..., Cosmos replied with f03f1a3... as sessionID for block 12, which is constructed out of the empty byte slice since it doesn't have block 12 hash yet. (BeginBlocker() has not triggered yet)
  1. Cosmos is now aware of block 13, it gets and stores the hash of block 12.
  2. Off-chain actors re-request the session info for block 12 (The RelayMiner is doing verification)
  • The currently processed block is 13 (ctx.BlockHeight == 13).
  • Finished hydrating session ID: 54f747, Cosmos is now replying with the reight sessionID for block 12.

The blow diagrams shows the assumption of how off-chain and on-chain actors get notified in-parallel about a committed block.

graph TB
  TM[Tendermint]
  COS[Cosmos 'SessionKeeper']
  OFF[Off-chain actor]

  TM -. BroadcastBlock: N .-> COS
  TM -. BroadcastBlock: N .-> OFF

  OFF <-- GetSession: N --> COS
  COS -- StorePrevHash: N-1 --> COS
Loading

@Olshansk
Copy link
Member

Olshansk commented Feb 6, 2024

Some notes regarding context:

  1. We should use the terms CometBFT, not Tendermint.

  2. There is confusing terminology here between "made aware", "indexing", etc...

  • CometBFT is a consensus engine
  • CosmoSDK is a framework communicating w/ CometBFT using ABCI
  • There are only two states: CommitedBlock, ProposedBlock
  1. When EndBlocker is called, the block is committed. When BeginBlocker is called, it is being proposed.

Asks:

  1. I'm still confused w.r.t what you're suggesting as next steps? To me it just seems like an off-by-one error on a wrong handler being called.

  2. In the logs, there are some assumptions (e.g. session generate using nil slice), but I'd ask to update the logs so that's not an assumption.

@red-0ne
Copy link
Contributor Author

red-0ne commented Feb 7, 2024

  1. Fair enough

  2. I still think CosmosSDK (just like the off-chain actors) is being notified by CometBFT about Committed blocks. And the fact that this is happening in parallel is what seems to be causing the issue. This can be seen in the logs by comparing the latest block that the off-chain actor is sending and the block height returned by ctx.BlockHeight().

  3. Yes, in the sense that when EndBlocker is called the block is then committed, so it doesn't change that much to the problem.

  4. I haven't elaborated about the solution here, Will do in a separate comment so we can evaluate its viability.

  5. Will comeback with more logs 👍

@red-0ne
Copy link
Contributor Author

red-0ne commented Feb 7, 2024

Added more logs and put the hash storing procedure in the EndBlock method. This is the result

image

In these logs we see:

  1. When ctx.BlockHeight() == 8, hash of block 7 is stored.

  2. Off-chain requesting hash of block 8:

  • Hash of block 8 is not stored yet, as it will be on block 9.
  • Nonetheless the off-chain actor knows about it (it even knows the hash from CometBFT, although it does not show-up here).
  • The hash returned is empty.
  • hydrateSessionID uses that empty slice to build sessionId 3a5ba9... for SessionStartBlockHeight: 8.
  1. Block 9 is being built and stores block 8 hash.

  2. Off-chain requesting hash of block 8 again (for session verification):

  • Now, the correct hash of block 8 is returned.
  • The hydrated session is constructed with the right block hash and returns a different sessionId b1449b... for SessionStartBlockHeight: 8.

I will challenge our assumption that the block hash we get from ctx.BlockHeader.LastBlockId.Hash corresponds to ctx.BlockHeight() - 1

@red-0ne
Copy link
Contributor Author

red-0ne commented Feb 7, 2024

Happy ending and assumption broken.

ctx.BlockHeader.LastBlockId.Hash is actually the hash of the block height returned by ctx.BlockHeight() and not ctx.BlockHeight() - 1.

Will open a PR fixing the session mismatch issue and adjusting comments. I'll also make sure ctx.BlockHeight() is used appropriately over the codebase.

@red-0ne red-0ne moved this from 🔖 Ready to 👀 In review in Shannon Feb 7, 2024
@Olshansk
Copy link
Member

Olshansk commented Feb 7, 2024

@red-0ne Left a few notes on #368. Let's prioritize figuring it out and merging it in.

@red-0ne
Copy link
Contributor Author

red-0ne commented Feb 12, 2024

Closed as per #368

@red-0ne red-0ne closed this as completed Feb 12, 2024
@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Shannon Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working session Changes related to Session management
Projects
Status: ✅ Done
Development

No branches or pull requests

2 participants