You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for all your hard work on plrust. So this is an interesting one. I've been hitting an "abnormal server process exit" and tracked it down to a statement timeout setting:
I had recently configured a default statement_timeout of 45000 (ms) and when I recompiled one of my plrust functions, it started downloading and recompiling external crates which took longer than that. Exactly after 45 seconds the build failed -- presumably because it received some signal from Postgres, which is expected of course -- but shouldn't it exit more cleanly, rather than taking down the whole worker process all worker processes?
Logs
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
[...]
PQconsumeInput() ERROR:
0: `cargo build` failed
Location:
/rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/convert/mod.rs:716
`cargo build` stderr:
Updating crates.io index
Downloading crates ...
Downloaded thiserror-impl v1.0.62
Downloaded thiserror v1.0.62
Downloaded syn v2.0.71
Compiling thiserror v1.0.62
Compiling syn v2.0.71
Compiling getrandom v0.2.15
Compiling uuid v1.10.0
Compiling bindgen v0.68.1
Compiling serde_derive v1.0.204
Compiling thiserror-impl v1.0.62
Compiling enum-map-derive v0.17.0
Compiling enum-map v2.7.3
Compiling pest v2.7.11
Compiling semver-parser v0.10.2
Compiling serde v1.0.204
Compiling semver v0.11.0
Compiling rustc_version v0.3.3
Compiling atomic-traits v0.3.0
Compiling toml_datetime v0.6.6
Compiling serde_spanned v0.6.6
Compiling serde_json v1.0.120
Compiling toml_edit v0.22.15
Compiling serde_cbor v0.11.2
Compiling toml v0.8.14
Compiling cargo_toml v0.16.3
Compiling pgrx-pg-config v0.11.0
Compiling pgrx-pg-sys v0.11.0
Compiling pgrx v0.11.0
Compiling plrust-trusted-pgrx v1.2.7
Source Code:
// SNIP
// (I have the full error message here and can make it available if needed,
// but I don't think it matters. The function doesn't declare any additional dependencies.)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⋮ 2 frames hidden ⋮
3: plrust::user_crate::build::FnBuild::build_internal::hcdeaf7e52dfecfae
at <unknown source file>:<unknown line>
4: plrust::plrust::compile_function::h90a18482c5d32d1b
at <unknown source file>:<unknown line>
5: pgrx_pg_sys::submodules::panic::run_guarded::hbd55bed297e7b964
at <unknown source file>:<unknown line>
6: plrust_validator_wrapper<unknown>
at <unknown source file>:<unknown line>
7: FunctionCall1Coll<unknown>
at <unknown source file>:<unknown line>
8: OidFunctionCall1Coll<unknown>
at <unknown source file>:<unknown line>
9: ProcedureCreate<unknown>
at <unknown source file>:<unknown line>
10: CreateFunction<unknown>
at <unknown source file>:<unknown line>
11: standard_ProcessUtility<unknown>
at <unknown source file>:<unknown line>
12: pgrx_pg_sys::include::pg14::standard_ProcessUtility::he0425f28cedd16c0
at <unknown source file>:<unknown line>
13: pgrx_pg_sys::submodules::panic::run_guarded::h2575082d6cd79498
at <unknown source file>:<unknown line>
14: plrust::hooks::plrust_process_utility_hook::h95d150ca0faf3908
at <unknown source file>:<unknown line>
15: PortalRun<unknown>
at <unknown source file>:<unknown line>
16: PostgresMain<unknown>
at <unknown source file>:<unknown line>
17: PostmasterMain<unknown>
at <unknown source file>:<unknown line>
18: main<unknown>
at <unknown source file>:<unknown line>
19: __libc_start_main<unknown>
at <unknown source file>:<unknown line>
Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.
2024-07-16 12:40:58.612 GMT [217] DETAIL:
0: pgrx_pg_sys::submodules::panic::register_pg_guard_panic_hook::{{closure}}
1: std::panicking::rust_panic_with_hook
2: std::panicking::begin_panic_handler::{{closure}}
3: std::sys_common::backtrace::__rust_end_short_backtrace
4: rust_begin_unwind
5: core::panicking::panic_fmt
6: pgrx_pg_sys::submodules::panic::run_guarded
7: plrust_validator_wrapper
8: FunctionCall1Coll
9: OidFunctionCall1Coll
10: ProcedureCreate
11: CreateFunction
12: <unknown>
13: standard_ProcessUtility
14: pgrx_pg_sys::include::pg14::standard_ProcessUtility
15: pgrx_pg_sys::submodules::panic::run_guarded
16: plrust::hooks::plrust_process_utility_hook
17: <unknown>
18: <unknown>
19: PortalRun
20: <unknown>
21: PostgresMain
22: <unknown>
23: PostmasterMain
24: main
25: __libc_start_main
26: <unknown>
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql# SELECT version();
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ version │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 14.12 (Debian 14.12-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Also, I'm not sure if I understand the error message fully, is it just the one worker process exiting or is this the equivalent of a full postmaster crash? The mention about possibly corrupted shared memory makes me think it's the latter. Unfortunately I can't test this as reliably anymore because the dependencies are now already downloaded and built (because I tested successfully with the statement timeout removed.)
(I realized that I can test this easily by removing our prewarming code, and I can confirm that all workers are crashing.)
Now, the most pressing question I'm facing is how worried I should be about the unclean exit. Could you shed some light on this? I can't think of a foolproof way to prevent this scenario from ever occurring in production so it would be good to know what the impact might be in that case.
Now, the most pressing question I'm facing is whether this is to do with the specific way plrust is set up in our dev env or whether the same could happen in our RDS production environment. I'm going to be able to test this more later today, but perhaps you can shed any light on this?
This incident also brought to mind a whole other question: you'll notice in the logs that plrust is downloading and building external crates, this is of course the reason I ran into the timeout in the first place. However, we're priming the workdir (along these lines) and so this wouldn't be reflecting initial setup but rather it must be an implicit (automatic) upgrade of existing dependencies, if I'm not missing anything. (The plrust function that the logs are for isn't specifying any additional crates by the way.)
The fact that it's downloading a brand-new version of thiserror only released a few days ago (long after I've built the Docker image in question and even longer after plrust 1.27 was released) also speaks to this theory.
If this is caused by automatic upgrades, and not due to a misconfiguration (of the plrust installation) in our dev env, I can see a range of potential problems, from supply chain security, to server instability due to bugs in newer crate versions, to fluctuating duration of migrations, and potentially subtle differences in plrust function behavior depending on when exactly they were compiled.
Would this (automatic upgrading of crate versions) have something to do with the way our Dockerfile is installing plrust in the dev env, or would this also be expected behavior on RDS, say? Is there any configuration we could apply to pin crate versions so that they can only be upgraded manually?
The text was updated successfully, but these errors were encountered:
jscheid
changed the title
Statement timeout leads to crash
Statement timeout during cargo build leads to crash
Jul 16, 2024
(There's a lot going on in this issue -- in the future, please consider making one issue per topic/idea)
Crashing during at statement_timeout event
This is definitely not by design. :/ We'll have to look into this. Doesn't sound like it'll be hard to recreate
plrust is downloading and building external crates
This is what cargo does. You could adjust the plrust.path_override GUC to get your own "cargo" wrapper script on the PATH that maybe adds --offline or any other arguments you may want. https://plrust.io/config-pg.html#plrustpath_override-string
I don't have a "best practices" suggestion for you here. All of this really depends on what your requirements are. I do think, however, pl/rust provides enough knobs for you to direct the way you need.
Hi, thanks for all your hard work on plrust. So this is an interesting one. I've been hitting an "abnormal server process exit" and tracked it down to a statement timeout setting:
I had recently configured a default
statement_timeout
of45000
(ms) and when I recompiled one of my plrust functions, it started downloading and recompiling external crates which took longer than that. Exactly after 45 seconds the build failed -- presumably because it received some signal from Postgres, which is expected of course -- but shouldn't it exit more cleanly, rather than taking downthe whole worker processall worker processes?Logs
Also, I'm not sure if I understand the error message fully, is it just the one worker process exiting or is this the equivalent of a full postmaster crash? The mention about possibly corrupted shared memory makes me think it's the latter. Unfortunately I can't test this as reliably anymore because the dependencies are now already downloaded and built (because I tested successfully with the statement timeout removed.)(I realized that I can test this easily by removing our prewarming code, and I can confirm that all workers are crashing.)
Now, the most pressing question I'm facing is how worried I should be about the unclean exit. Could you shed some light on this? I can't think of a foolproof way to prevent this scenario from ever occurring in production so it would be good to know what the impact might be in that case.Now, the most pressing question I'm facing is whether this is to do with the specific way plrust is set up in our dev env or whether the same could happen in our RDS production environment. I'm going to be able to test this more later today, but perhaps you can shed any light on this?
This incident also brought to mind a whole other question: you'll notice in the logs that plrust is downloading and building external crates, this is of course the reason I ran into the timeout in the first place. However, we're priming the workdir (along these lines) and so this wouldn't be reflecting initial setup but rather it must be an implicit (automatic) upgrade of existing dependencies, if I'm not missing anything. (The plrust function that the logs are for isn't specifying any additional crates by the way.)
The fact that it's downloading a brand-new version of
thiserror
only released a few days ago (long after I've built the Docker image in question and even longer after plrust 1.27 was released) also speaks to this theory.If this is caused by automatic upgrades, and not due to a misconfiguration (of the plrust installation) in our dev env, I can see a range of potential problems, from supply chain security, to server instability due to bugs in newer crate versions, to fluctuating duration of migrations, and potentially subtle differences in plrust function behavior depending on when exactly they were compiled.
Would this (automatic upgrading of crate versions) have something to do with the way our Dockerfile is installing plrust in the dev env, or would this also be expected behavior on RDS, say? Is there any configuration we could apply to pin crate versions so that they can only be upgraded manually?
The text was updated successfully, but these errors were encountered: