Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes can receive a wake up violation when they are actually shutting down #29

Open
scottyeager opened this issue Apr 26, 2024 · 3 comments

Comments

@scottyeager
Copy link

scottyeager commented Apr 26, 2024

I've observed a rare possibility that a node can receive a wake up violation for failing to boot within 30 minutes when the node is in fact shutting down.

Here's the sequence of events:

  1. Node boots due to farmerbot. Upon boot it sends an uptime report resulting in both power_managed and power_managed_boot set to None
  2. But, in the same block as that uptime event, there is also a power target change for Up for this node. Maybe this shouldn't happen in normal circumstances, but it can and actually has. Since the power state for this node is still Down at this point, power_managed_boot will be set
  3. The node only sets its power state to Up in the next block after its first uptime report, typically
  4. There is a power target change to Down for this node more than 30 minutes after the target change to Up
  5. When the node shuts down, it first sets its power state to Down and thus both power_managed and power_managed_boot are not None
  6. Next, the node sends a final uptime report before shutting down (usually in the next block after the power state change). At this point, minting interprets this uptime report as a wake up event and assigns the node a violation

If we accept that it's legitimate to send multiple power target changes until a node wakes up, then this definitely shouldn't result in a violation.

Perhaps the solution would be to reorder the sequence of operations in Zos, but I guess that it was implemented this way for a reason, and of course rolling out changes to Zos is slow.

@LeeSmet
Copy link
Contributor

LeeSmet commented May 7, 2024

So if I understand correctly: the farmer bot requests a boot by switching the target from down to up. While the node is apparently booting, the bot switches the target back to down, then back to up to request a second boot.

The behavior is correct, since the node did not finish its expected boot sequence for the first request (it must both send an uptime report and switch its power state, the latter only happens if its target is up). When the farmer bot was initially implemented, it was agreed that for verification purposes, a node MUST answer every power on request by fully booting. This is also what underpins the random wakeups.

As a side note, there is no specific ordering of calls in zos atm, and calls from multiple tasks which happen at the same time are inherently racy.

@scottyeager
Copy link
Author

So if I understand correctly: the farmer bot requests a boot by switching the target from down to up. While the node is apparently booting, the bot switches the target back to down, then back to up to request a second boot.

The power target isn't switched back to down until after the node has fully booted and set it's power state to up. It's allowed by tfchain create additional events to set the power target to up, even when the target is already up. So in this case the bot is attempting repeatedly to wake the node before the node finishes booting.

As a side note, there is no specific ordering of calls in zos atm, and calls from multiple tasks which happen at the same time are inherently racy.

That's good to know. In my observations, nodes tend to set their power state to up in the block immediately following their first uptime report after waking up.

@LeeSmet
Copy link
Contributor

LeeSmet commented Jun 5, 2024

In that case this is a bug in tfchain. Aside from the fact that an event should be emitted to notify of a change in state (while the state isn't changed here), the event is explicitly called (PowerTargetChanged). I guess it should be easy to update the tfchain code to prevent events from being emitted if there is no actual change. Aside from that, if the intent is that someting observes the current state, that something should just query the latest chain state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants