-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runs are lost if step:complete fails #2943
Comments
@theroinaochieng there will be some discussion around this one and it'll probably spin out into more issues. I think it's very very important but probably not super urgent |
Issue raised by @mtuchi . He has a workaround which should be OK. |
From @stuartc : seperate sockets per run (rather than a channel within a shared socket) would reduce the congestion effect of data upload steps. One big dataclip won't delay messages from concurrent runs (which is likely happening now) |
I believe we should do all the things above (maybe not urgently but soon) But the immediate error appears to be a timeout in postgres:
Postgres timesout on |
this may be fixed by #2682 |
I'm glad the underlying issue is fixed, but I think that everything in the bullet list in the OP should still be done, like, in the next month. Should we split out issues? |
@theroinaochieng, @stuartc cc @taylordowns2000 @josephjclark as requested I have investigated what else we could for this issue on Lightning. I have done an experiment setting this on the RuntimeManager: and posted a webhook request with a dataclip bigger than 1MB: the result was that the dataclip was received by Lightning and saved: Few additional considerations for action points:
On the
|
@jyeshe payload sizes in the worker are approximated (and probably not well). I wouldn't be at all surprised if marginal differences produce unexpected behaviour - and I'm not too worried about this either. I suppose we could mitigate this by enforcing The worker won't do anything about incoming dataclips. It's only when returning dataclips in |
Sure, that's why I was referring to when the 'dataclip grows' (only on the Runtime). A clear concern for the output dataclip. One detail that cannot be seen from what I shared is that there was no non-latin character, which I believe is the most common string usage by far until we start with the Nordics (: This means that the dataclip size becomes much more predictable and MAX_DATACLIP_SIZE_MB reliable without workarounds. Specially because the measurement is about size and not length (length <= size). |
A last bit on how I think Instead of enforcing |
@jyeshe Not really sure what to do about this. The issue here isn't really that the worker emitted a payload that was too large (the payloads in question are around 4-6mb with a limit of 10mb). Would you like me to raise an issue because a This issue needs breaking up into smaller ones that can be addressed and prioritised. I'm much more concerned about how easily a run is lost because of any error on a key event, and I'd be keen to focus on building out robustness and comms on both ends. |
This is a cross-repo issue but starting closer to the user's in Lightning.
We caught a Lost run today (GCP) which was caused because:
step:complete
timed outThe result is that all logs and events are rejected and the run is eventually lost. The user is given no feedback.
The worker should recover its capacity though - the engine WILL finish and report back. It's just the Lightning eventing that failed. This is actually a bit bad because the run has probably actually done everything it meant to do - the worklow actually completed, it just didn't tell anyone at openfn.
This isn't actually new - we've seen this before and it's a known issue. I wonder if there's something open. We may have thought we'd seen the last of it, having dealt with a bunch of error cases directly, but alas.
Some suggested fixes (we may even want to implement ALL of these)
step:start
when the previous one didn't complete, we should write off the previous step as an error and continue accepting logs. I don't quite know what "write off a step as error but finish the workflow" means. Tell you what, let's Crash the step as "Lost" (will we know why it was lost? can we log "timeout" instead?)step:complete
event timed-out? If it did, it could mark the step as Complete with status Crashed: Timeout there and then. Now the rest of the run should be processed happily, and the user gets pretty good feedback.step:complete
returns a dataclip id (a uuid generated by the worker), and then returns that instantly. Then it starts some kind ofupload-dataclip
event, probably with a long timeout, which can go off and upload the dataclip for that id. Ideally this is non-blocking, so maybe we just POST it rather than use the websocket. Lightning will need to understand that dataclips can be created in some kind of "pending" state while data is uploaded.The text was updated successfully, but these errors were encountered: