Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runs are lost if step:complete fails #2943

Open
josephjclark opened this issue Feb 17, 2025 · 11 comments
Open

Runs are lost if step:complete fails #2943

josephjclark opened this issue Feb 17, 2025 · 11 comments
Assignees
Labels
bug Newly identified bug For Consideration

Comments

@josephjclark
Copy link
Contributor

This is a cross-repo issue but starting closer to the user's in Lightning.

We caught a Lost run today (GCP) which was caused because:

  • step:complete timed out
  • Probably because the step output was too big at 6mb
  • The subsequent worker events were all rejected because the step hadn't been completed in Lightning
  • So the next step was rejected
  • And any subsequent events have invalid step ids

The result is that all logs and events are rejected and the run is eventually lost. The user is given no feedback.

The worker should recover its capacity though - the engine WILL finish and report back. It's just the Lightning eventing that failed. This is actually a bit bad because the run has probably actually done everything it meant to do - the worklow actually completed, it just didn't tell anyone at openfn.

This isn't actually new - we've seen this before and it's a known issue. I wonder if there's something open. We may have thought we'd seen the last of it, having dealt with a bunch of error cases directly, but alas.

Some suggested fixes (we may even want to implement ALL of these)

  • In Lightning, instead of erroring onstep:start when the previous one didn't complete, we should write off the previous step as an error and continue accepting logs. I don't quite know what "write off a step as error but finish the workflow" means. Tell you what, let's Crash the step as "Lost" (will we know why it was lost? can we log "timeout" instead?)
  • Actually: does Lightning know the step:complete event timed-out? If it did, it could mark the step as Complete with status Crashed: Timeout there and then. Now the rest of the run should be processed happily, and the user gets pretty good feedback.
  • In the Worker, we should probably Kill the run if a key workflow event times out. Timedout logs are fine. Note that by the time the event times-out and we kill the process, the workflow might have actually finished 🤷 Nothing we can do about that
  • Maybe we should split step-complete up into two events: step:complete returns a dataclip id (a uuid generated by the worker), and then returns that instantly. Then it starts some kind of upload-dataclip event, probably with a long timeout, which can go off and upload the dataclip for that id. Ideally this is non-blocking, so maybe we just POST it rather than use the websocket. Lightning will need to understand that dataclips can be created in some kind of "pending" state while data is uploaded.
  • I don't think the websocket timeout length is configurable for deployments. It should be. Maybe I want a local deployment with a 1 hour timeout on events. That's my business right? Do we want this per event? Or do we have like regular-timeout and extended-timeout, with two values, where regular lifecycle events must use the regular timeout, but anything with a data payload (logs, complete) uses the extended timeout.
@josephjclark josephjclark added the bug Newly identified bug label Feb 17, 2025
@github-project-automation github-project-automation bot moved this to New Issues in v2 Feb 17, 2025
@josephjclark
Copy link
Contributor Author

@theroinaochieng there will be some discussion around this one and it'll probably spin out into more issues. I think it's very very important but probably not super urgent

@josephjclark
Copy link
Contributor Author

Issue raised by @mtuchi . He has a workaround which should be OK.

@josephjclark
Copy link
Contributor Author

From @stuartc : seperate sockets per run (rather than a channel within a shared socket) would reduce the congestion effect of data upload steps. One big dataclip won't delay messages from concurrent runs (which is likely happening now)

@josephjclark
Copy link
Contributor Author

I believe we should do all the things above (maybe not urgently but soon)

But the immediate error appears to be a timeout in postgres:

error] Postgrex.Protocol (#PID<0.756.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2025.0> timed out because it queued and checked out the connection for longer than 15000ms

#PID<0.2025.0> was at location:

    (erts 14.2.5) :prim_inet.recv0/3
    (postgrex 0.19.3) lib/postgrex/protocol.ex:3302: Postgrex.Protocol.msg_recv/4
    (postgrex 0.19.3) lib/postgrex/protocol.ex:2304: Postgrex.Protocol.recv_bind/3
    (postgrex 0.19.3) lib/postgrex/protocol.ex:2196: Postgrex.Protocol.bind_execute/4
    (db_connection 2.7.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.7.0) lib/db_connection.ex:1558: DBConnection.run_execute/5
    (db_connection 2.7.0) lib/db_connection.ex:772: DBConnection.parsed_prepare_execute/5
    (db_connection 2.7.0) lib/db_connection.ex:764: DBConnection.prepare_execute/4
    (postgrex 0.19.3) lib/postgrex.ex:306: Postgrex.query/4
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:1150: Ecto.Adapters.SQL.struct/10
    (ecto 3.12.5) lib/ecto/repo/schema.ex:837: Ecto.Repo.Schema.apply/4
    (ecto 3.12.5) lib/ecto/repo/schema.ex:416: anonymous fn/15 in Ecto.Repo.Schema.do_insert/4
    (lightning 2.10.15) lib/lightning/runs/handlers.ex:326: anonymous fn/2 in Lightning.Runs.Handlers.CompleteStep.update_step/2
    (lightning 2.10.15) lib/lightning/repo.ex:31: anonymous fn/1 in Lightning.Repo.transact/2
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:1400: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.7.0) lib/db_connection.ex:1756: DBConnection.run_transaction/4
    (lightning 2.10.15) lib/lightning/runs/handlers.ex:314: Lightning.Runs.Handlers.CompleteStep.call/2
    (lightning 2.10.15) lib/lightning_web/channels/run_channel.ex:185: LightningWeb.RunChannel.handle_in/3
    (phoenix 1.7.18) lib/phoenix/channel/server.ex:322: Phoenix.Channel.Server.handle_info/2
    (stdlib 5.2.3) gen_server.erl:1095: :gen_server.try_handle_info/3

Postgres timesout on step:complete for large JSON payloads, and that's the error that throws everything else into chaos.

@taylordowns2000
Copy link
Member

this may be fixed by #2682

@theroinaochieng theroinaochieng moved this from New Issues to In progress in v2 Feb 19, 2025
@josephjclark
Copy link
Contributor Author

I'm glad the underlying issue is fixed, but I think that everything in the bullet list in the OP should still be done, like, in the next month. Should we split out issues?

@jyeshe
Copy link
Member

jyeshe commented Feb 20, 2025

@theroinaochieng, @stuartc cc @taylordowns2000 @josephjclark as requested I have investigated what else we could for this issue on Lightning.

I have done an experiment setting this on the RuntimeManager:
Image

and posted a webhook request with a dataclip bigger than 1MB:
curl -H "content-type: application/json" -d @./dataclip.json http://localhost:4000/i/cae544ab-03dc-4ccc-a09c-fb4edb255d7a

the result was that the dataclip was received by Lightning and saved:
Image

Few additional considerations for action points:

  • The dataclip was originally bigger than the limit before the Worker fetched the Run but on staging and prod the MAX_DATACLIP_SIZE_MB for the Worker and Lightning are the same so we would be fine on the entry door (on the webhook, before the Run is claimed).
  • If the dataclip grows during the Run execution, I believe the Worker should not send the dataclip that exceeds the limit to Lightning.

On the step:complete Lightning received an output_dataclip of 1000291 bytes for a MAX_DATACLIP_SIZE_MB="1":

step_complete_payload: %{
  "duration" => 1162,
  "error_message" => nil,
  "error_type" => nil,
  "job_id" => "c2d15b05-6e42-4f2f-bd95-eb8e8b7d680e",
  "mem" => %{"job" => 29252024, "system" => 164675584},
  # truncated
  "output_dataclip" => "{\"references\":[],\"data\":{\"a\":\"aaaaaaaaaaaaaaaaaa" <> ...,
  "output_dataclip_id" => "5f6c421a-5bd3-4ec2-8e17-802c01dc69ed",
  "reason" => "success",
  "step_id" => "510f8de3-bfe6-4905-85ba-984b02fd6151",
  "thread_id" => 2,
  "timestamp" => "1740055089438224"
}
step_complete_output_dataclip_size: 1000291

@josephjclark
Copy link
Contributor Author

@jyeshe payload sizes in the worker are approximated (and probably not well). I wouldn't be at all surprised if marginal differences produce unexpected behaviour - and I'm not too worried about this either.

I suppose we could mitigate this by enforcing MAX_DATACLIP_SIZE_MB * 0.95 or something to give a margin of error. I forget if we calculate in bytes or bits, but I think it's bytes 🤔 But again I'm not too worried about even a 10% tolerance.

The worker won't do anything about incoming dataclips. It's only when returning dataclips in workflow:complete or log that it'll run the check.

@jyeshe
Copy link
Member

jyeshe commented Feb 20, 2025

@jyeshe payload sizes in the worker are approximated (and probably not well). I wouldn't be at all surprised if marginal differences produce unexpected behaviour - and I'm not too worried about this either.

I suppose we could mitigate this by enforcing MAX_DATACLIP_SIZE_MB * 0.95 or something to give a margin of error. I forget if we calculate in bytes or bits, but I think it's bytes 🤔 But again I'm not too worried about even a 10% tolerance.

The worker won't do anything about incoming dataclips. It's only when returning dataclips in workflow:complete or log that it'll run the check.

Sure, that's why I was referring to when the 'dataclip grows' (only on the Runtime). A clear concern for the output dataclip.

One detail that cannot be seen from what I shared is that there was no non-latin character, which I believe is the most common string usage by far until we start with the Nordics (:

This means that the dataclip size becomes much more predictable and MAX_DATACLIP_SIZE_MB reliable without workarounds. Specially because the measurement is about size and not length (length <= size).

@jyeshe
Copy link
Member

jyeshe commented Feb 21, 2025

A last bit on how I think Lightning could help on this, if we want to restrict MAX_DATACLIP_SIZE_MB with 100% precision, by applying MAX_DATACLIP_SIZE_MB * 0.95 on the Worker side or not, in Elixir we can check the byte size in constant time, beautiful huh?. Some of the dataclips would be already filtered by the Worker and those that are on ].95, inf[ interval (due to chars with accents or symbols) could be discarded or truncated on Lightning.

Instead of enforcing MAX_DATACLIP_SIZE_MB * 0.95 I would only check the length on the Worker and let Lightning filter the exceeding ones to avoid false positives.

@josephjclark
Copy link
Contributor Author

@jyeshe Not really sure what to do about this. The issue here isn't really that the worker emitted a payload that was too large (the payloads in question are around 4-6mb with a limit of 10mb).

Would you like me to raise an issue because a 1000291 byte payload is sent out of a worker with a 10mb limit? Happy to do so. It would be useful to have the payload in question so that we can try to reproduce. I agree that it shouldn't happen but it's gonna be very low priority for me.

This issue needs breaking up into smaller ones that can be addressed and prioritised. I'm much more concerned about how easily a run is lost because of any error on a key event, and I'd be keen to focus on building out robustness and comms on both ends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Newly identified bug For Consideration
Projects
Status: In progress
Development

No branches or pull requests

3 participants