Runs are lost if step:complete fails #2943

josephjclark · 2025-02-17T17:52:06Z

This is a cross-repo issue but starting closer to the user's in Lightning.

We caught a Lost run today (GCP) which was caused because:

step:complete timed out
Probably because the step output was too big at 6mb
The subsequent worker events were all rejected because the step hadn't been completed in Lightning
So the next step was rejected
And any subsequent events have invalid step ids

The result is that all logs and events are rejected and the run is eventually lost. The user is given no feedback.

The worker should recover its capacity though - the engine WILL finish and report back. It's just the Lightning eventing that failed. This is actually a bit bad because the run has probably actually done everything it meant to do - the worklow actually completed, it just didn't tell anyone at openfn.

This isn't actually new - we've seen this before and it's a known issue. I wonder if there's something open. We may have thought we'd seen the last of it, having dealt with a bunch of error cases directly, but alas.

Some suggested fixes (we may even want to implement ALL of these)

In Lightning, instead of erroring onstep:start when the previous one didn't complete, we should write off the previous step as an error and continue accepting logs. I don't quite know what "write off a step as error but finish the workflow" means. Tell you what, let's Crash the step as "Lost" (will we know why it was lost? can we log "timeout" instead?)
Actually: does Lightning know the step:complete event timed-out? If it did, it could mark the step as Complete with status Crashed: Timeout there and then. Now the rest of the run should be processed happily, and the user gets pretty good feedback.
In the Worker, we should probably Kill the run if a key workflow event times out. Timedout logs are fine. Note that by the time the event times-out and we kill the process, the workflow might have actually finished 🤷 Nothing we can do about that
Maybe we should split step-complete up into two events: step:complete returns a dataclip id (a uuid generated by the worker), and then returns that instantly. Then it starts some kind of upload-dataclip event, probably with a long timeout, which can go off and upload the dataclip for that id. Ideally this is non-blocking, so maybe we just POST it rather than use the websocket. Lightning will need to understand that dataclips can be created in some kind of "pending" state while data is uploaded.
I don't think the websocket timeout length is configurable for deployments. It should be. Maybe I want a local deployment with a 1 hour timeout on events. That's my business right? Do we want this per event? Or do we have like regular-timeout and extended-timeout, with two values, where regular lifecycle events must use the regular timeout, but anything with a data payload (logs, complete) uses the extended timeout.

The text was updated successfully, but these errors were encountered:

josephjclark · 2025-02-17T17:54:23Z

@theroinaochieng there will be some discussion around this one and it'll probably spin out into more issues. I think it's very very important but probably not super urgent

josephjclark · 2025-02-17T17:54:50Z

Issue raised by @mtuchi . He has a workaround which should be OK.

josephjclark · 2025-02-18T08:48:59Z

From @stuartc : seperate sockets per run (rather than a channel within a shared socket) would reduce the congestion effect of data upload steps. One big dataclip won't delay messages from concurrent runs (which is likely happening now)

josephjclark · 2025-02-18T14:22:13Z

I believe we should do all the things above (maybe not urgently but soon)

But the immediate error appears to be a timeout in postgres:

error] Postgrex.Protocol (#PID<0.756.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.2025.0> timed out because it queued and checked out the connection for longer than 15000ms

#PID<0.2025.0> was at location:

    (erts 14.2.5) :prim_inet.recv0/3
    (postgrex 0.19.3) lib/postgrex/protocol.ex:3302: Postgrex.Protocol.msg_recv/4
    (postgrex 0.19.3) lib/postgrex/protocol.ex:2304: Postgrex.Protocol.recv_bind/3
    (postgrex 0.19.3) lib/postgrex/protocol.ex:2196: Postgrex.Protocol.bind_execute/4
    (db_connection 2.7.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.7.0) lib/db_connection.ex:1558: DBConnection.run_execute/5
    (db_connection 2.7.0) lib/db_connection.ex:772: DBConnection.parsed_prepare_execute/5
    (db_connection 2.7.0) lib/db_connection.ex:764: DBConnection.prepare_execute/4
    (postgrex 0.19.3) lib/postgrex.ex:306: Postgrex.query/4
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:1150: Ecto.Adapters.SQL.struct/10
    (ecto 3.12.5) lib/ecto/repo/schema.ex:837: Ecto.Repo.Schema.apply/4
    (ecto 3.12.5) lib/ecto/repo/schema.ex:416: anonymous fn/15 in Ecto.Repo.Schema.do_insert/4
    (lightning 2.10.15) lib/lightning/runs/handlers.ex:326: anonymous fn/2 in Lightning.Runs.Handlers.CompleteStep.update_step/2
    (lightning 2.10.15) lib/lightning/repo.ex:31: anonymous fn/1 in Lightning.Repo.transact/2
    (ecto_sql 3.12.1) lib/ecto/adapters/sql.ex:1400: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.7.0) lib/db_connection.ex:1756: DBConnection.run_transaction/4
    (lightning 2.10.15) lib/lightning/runs/handlers.ex:314: Lightning.Runs.Handlers.CompleteStep.call/2
    (lightning 2.10.15) lib/lightning_web/channels/run_channel.ex:185: LightningWeb.RunChannel.handle_in/3
    (phoenix 1.7.18) lib/phoenix/channel/server.ex:322: Phoenix.Channel.Server.handle_info/2
    (stdlib 5.2.3) gen_server.erl:1095: :gen_server.try_handle_info/3

Postgres timesout on step:complete for large JSON payloads, and that's the error that throws everything else into chaos.

taylordowns2000 · 2025-02-19T10:26:12Z

this may be fixed by #2682

josephjclark · 2025-02-20T12:00:05Z

I'm glad the underlying issue is fixed, but I think that everything in the bullet list in the OP should still be done, like, in the next month. Should we split out issues?

jyeshe · 2025-02-20T12:44:14Z

@theroinaochieng, @stuartc cc @taylordowns2000 @josephjclark as requested I have investigated what else we could for this issue on Lightning.

I have done an experiment setting this on the RuntimeManager:

and posted a webhook request with a dataclip bigger than 1MB:
curl -H "content-type: application/json" -d @./dataclip.json http://localhost:4000/i/cae544ab-03dc-4ccc-a09c-fb4edb255d7a

the result was that the dataclip was received by Lightning and saved:

Few additional considerations for action points:

The dataclip was originally bigger than the limit before the Worker fetched the Run but on staging and prod the MAX_DATACLIP_SIZE_MB for the Worker and Lightning are the same so we would be fine on the entry door (on the webhook, before the Run is claimed).
If the dataclip grows during the Run execution, I believe the Worker should not send the dataclip that exceeds the limit to Lightning.

On the step:complete Lightning received an output_dataclip of 1000291 bytes for a MAX_DATACLIP_SIZE_MB="1":

step_complete_payload: %{
  "duration" => 1162,
  "error_message" => nil,
  "error_type" => nil,
  "job_id" => "c2d15b05-6e42-4f2f-bd95-eb8e8b7d680e",
  "mem" => %{"job" => 29252024, "system" => 164675584},
  # truncated
  "output_dataclip" => "{\"references\":[],\"data\":{\"a\":\"aaaaaaaaaaaaaaaaaa" <> ...,
  "output_dataclip_id" => "5f6c421a-5bd3-4ec2-8e17-802c01dc69ed",
  "reason" => "success",
  "step_id" => "510f8de3-bfe6-4905-85ba-984b02fd6151",
  "thread_id" => 2,
  "timestamp" => "1740055089438224"
}
step_complete_output_dataclip_size: 1000291

josephjclark · 2025-02-20T14:25:49Z

@jyeshe payload sizes in the worker are approximated (and probably not well). I wouldn't be at all surprised if marginal differences produce unexpected behaviour - and I'm not too worried about this either.

I suppose we could mitigate this by enforcing MAX_DATACLIP_SIZE_MB * 0.95 or something to give a margin of error. I forget if we calculate in bytes or bits, but I think it's bytes 🤔 But again I'm not too worried about even a 10% tolerance.

The worker won't do anything about incoming dataclips. It's only when returning dataclips in workflow:complete or log that it'll run the check.

jyeshe · 2025-02-20T15:17:13Z

@jyeshe payload sizes in the worker are approximated (and probably not well). I wouldn't be at all surprised if marginal differences produce unexpected behaviour - and I'm not too worried about this either.

I suppose we could mitigate this by enforcing MAX_DATACLIP_SIZE_MB * 0.95 or something to give a margin of error. I forget if we calculate in bytes or bits, but I think it's bytes 🤔 But again I'm not too worried about even a 10% tolerance.

The worker won't do anything about incoming dataclips. It's only when returning dataclips in workflow:complete or log that it'll run the check.

Sure, that's why I was referring to when the 'dataclip grows' (only on the Runtime). A clear concern for the output dataclip.

One detail that cannot be seen from what I shared is that there was no non-latin character, which I believe is the most common string usage by far until we start with the Nordics (:

This means that the dataclip size becomes much more predictable and MAX_DATACLIP_SIZE_MB reliable without workarounds. Specially because the measurement is about size and not length (length <= size).

jyeshe · 2025-02-21T07:19:07Z

A last bit on how I think Lightning could help on this, if we want to restrict MAX_DATACLIP_SIZE_MB with 100% precision, by applying MAX_DATACLIP_SIZE_MB * 0.95 on the Worker side or not, in Elixir we can check the byte size in constant time, beautiful huh?. Some of the dataclips would be already filtered by the Worker and those that are on ].95, inf[ interval (due to chars with accents or symbols) could be discarded or truncated on Lightning.

Instead of enforcing MAX_DATACLIP_SIZE_MB * 0.95 I would only check the length on the Worker and let Lightning filter the exceeding ones to avoid false positives.

josephjclark · 2025-02-21T18:08:55Z

@jyeshe Not really sure what to do about this. The issue here isn't really that the worker emitted a payload that was too large (the payloads in question are around 4-6mb with a limit of 10mb).

Would you like me to raise an issue because a 1000291 byte payload is sent out of a worker with a 10mb limit? Happy to do so. It would be useful to have the payload in question so that we can try to reproduce. I agree that it shouldn't happen but it's gonna be very low priority for me.

This issue needs breaking up into smaller ones that can be addressed and prioritised. I'm much more concerned about how easily a run is lost because of any error on a key event, and I'd be keen to focus on building out robustness and comms on both ends.

josephjclark added the bug Newly identified bug label Feb 17, 2025

github-project-automation bot added this to v2 Feb 17, 2025

github-project-automation bot moved this to New Issues in v2 Feb 17, 2025

josephjclark added the For Consideration label Feb 17, 2025

taylordowns2000 mentioned this issue Feb 19, 2025

Creating a data clip fails on indexing when it's too large #2682

Closed

theroinaochieng moved this from New Issues to In progress in v2 Feb 19, 2025

theroinaochieng assigned josephjclark Feb 20, 2025

mtuchi mentioned this issue Mar 3, 2025

WF2 getting Lost when fetching a large patient list OpenFn/msf-lime-mosul#93

Closed

josephjclark mentioned this issue Mar 4, 2025

Engine: Limit payload sizes of events OpenFn/kit#890

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runs are lost if step:complete fails #2943

Runs are lost if step:complete fails #2943

josephjclark commented Feb 17, 2025

josephjclark commented Feb 17, 2025

josephjclark commented Feb 17, 2025

josephjclark commented Feb 18, 2025

josephjclark commented Feb 18, 2025

taylordowns2000 commented Feb 19, 2025

josephjclark commented Feb 20, 2025

jyeshe commented Feb 20, 2025

josephjclark commented Feb 20, 2025

jyeshe commented Feb 20, 2025

jyeshe commented Feb 21, 2025 •

edited

Loading

josephjclark commented Feb 21, 2025

Runs are lost if step:complete fails #2943

Runs are lost if step:complete fails #2943

Comments

josephjclark commented Feb 17, 2025

josephjclark commented Feb 17, 2025

josephjclark commented Feb 17, 2025

josephjclark commented Feb 18, 2025

josephjclark commented Feb 18, 2025

taylordowns2000 commented Feb 19, 2025

josephjclark commented Feb 20, 2025

jyeshe commented Feb 20, 2025

josephjclark commented Feb 20, 2025

jyeshe commented Feb 20, 2025

jyeshe commented Feb 21, 2025 • edited Loading

josephjclark commented Feb 21, 2025

jyeshe commented Feb 21, 2025 •

edited

Loading