Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statement timeout during cargo build leads to crash #413

Open
jscheid opened this issue Jul 16, 2024 · 2 comments
Open

Statement timeout during cargo build leads to crash #413

jscheid opened this issue Jul 16, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jscheid
Copy link

jscheid commented Jul 16, 2024

Hi, thanks for all your hard work on plrust. So this is an interesting one. I've been hitting an "abnormal server process exit" and tracked it down to a statement timeout setting:

I had recently configured a default statement_timeout of 45000 (ms) and when I recompiled one of my plrust functions, it started downloading and recompiling external crates which took longer than that. Exactly after 45 seconds the build failed -- presumably because it received some signal from Postgres, which is expected of course -- but shouldn't it exit more cleanly, rather than taking down the whole worker process all worker processes?

Logs
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
[...]
PQconsumeInput() ERROR:  
   0: `cargo build` failed

Location:
   /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/convert/mod.rs:716

`cargo build` stderr:
       Updating crates.io index
    Downloading crates ...
     Downloaded thiserror-impl v1.0.62
     Downloaded thiserror v1.0.62
     Downloaded syn v2.0.71
      Compiling thiserror v1.0.62
      Compiling syn v2.0.71
      Compiling getrandom v0.2.15
      Compiling uuid v1.10.0
      Compiling bindgen v0.68.1
      Compiling serde_derive v1.0.204
      Compiling thiserror-impl v1.0.62
      Compiling enum-map-derive v0.17.0
      Compiling enum-map v2.7.3
      Compiling pest v2.7.11
      Compiling semver-parser v0.10.2
      Compiling serde v1.0.204
      Compiling semver v0.11.0
      Compiling rustc_version v0.3.3
      Compiling atomic-traits v0.3.0
      Compiling toml_datetime v0.6.6
      Compiling serde_spanned v0.6.6
      Compiling serde_json v1.0.120
      Compiling toml_edit v0.22.15
      Compiling serde_cbor v0.11.2
      Compiling toml v0.8.14
      Compiling cargo_toml v0.16.3
      Compiling pgrx-pg-config v0.11.0
      Compiling pgrx-pg-sys v0.11.0
      Compiling pgrx v0.11.0
      Compiling plrust-trusted-pgrx v1.2.7


Source Code:
   // SNIP
   // (I have the full error message here and can make it available if needed,
   // but I don't think it matters. The function doesn't declare any additional dependencies.)

	  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
	                                ⋮ 2 frames hidden ⋮
	   3: plrust::user_crate::build::FnBuild::build_internal::hcdeaf7e52dfecfae
	      at <unknown source file>:<unknown line>
	   4: plrust::plrust::compile_function::h90a18482c5d32d1b
	      at <unknown source file>:<unknown line>
	   5: pgrx_pg_sys::submodules::panic::run_guarded::hbd55bed297e7b964
	      at <unknown source file>:<unknown line>
	   6: plrust_validator_wrapper<unknown>
	      at <unknown source file>:<unknown line>
	   7: FunctionCall1Coll<unknown>
	      at <unknown source file>:<unknown line>
	   8: OidFunctionCall1Coll<unknown>
	      at <unknown source file>:<unknown line>
	   9: ProcedureCreate<unknown>
	      at <unknown source file>:<unknown line>
	  10: CreateFunction<unknown>
	      at <unknown source file>:<unknown line>
	  11: standard_ProcessUtility<unknown>
	      at <unknown source file>:<unknown line>
	  12: pgrx_pg_sys::include::pg14::standard_ProcessUtility::he0425f28cedd16c0
	      at <unknown source file>:<unknown line>
	  13: pgrx_pg_sys::submodules::panic::run_guarded::h2575082d6cd79498
	      at <unknown source file>:<unknown line>
	  14: plrust::hooks::plrust_process_utility_hook::h95d150ca0faf3908
	      at <unknown source file>:<unknown line>
	  15: PortalRun<unknown>
	      at <unknown source file>:<unknown line>
	  16: PostgresMain<unknown>
	      at <unknown source file>:<unknown line>
	  17: PostmasterMain<unknown>
	      at <unknown source file>:<unknown line>
	  18: main<unknown>
	      at <unknown source file>:<unknown line>
	  19: __libc_start_main<unknown>
	      at <unknown source file>:<unknown line>
	
	Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
	Run with RUST_BACKTRACE=full to include source snippets.
2024-07-16 12:40:58.612 GMT [217] DETAIL:  
	   0: pgrx_pg_sys::submodules::panic::register_pg_guard_panic_hook::{{closure}}
	   1: std::panicking::rust_panic_with_hook
	   2: std::panicking::begin_panic_handler::{{closure}}
	   3: std::sys_common::backtrace::__rust_end_short_backtrace
	   4: rust_begin_unwind
	   5: core::panicking::panic_fmt
	   6: pgrx_pg_sys::submodules::panic::run_guarded
	   7: plrust_validator_wrapper
	   8: FunctionCall1Coll
	   9: OidFunctionCall1Coll
	  10: ProcedureCreate
	  11: CreateFunction
	  12: <unknown>
	  13: standard_ProcessUtility
	  14: pgrx_pg_sys::include::pg14::standard_ProcessUtility
	  15: pgrx_pg_sys::submodules::panic::run_guarded
	  16: plrust::hooks::plrust_process_utility_hook
	  17: <unknown>
	  18: <unknown>
	  19: PortalRun
	  20: <unknown>
	  21: PostgresMain
	  22: <unknown>
	  23: PostmasterMain
	  24: main
	  25: __libc_start_main
	  26: <unknown>

server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

psql# SELECT version();
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                               version                                                               │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PostgreSQL 14.12 (Debian 14.12-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Also, I'm not sure if I understand the error message fully, is it just the one worker process exiting or is this the equivalent of a full postmaster crash? The mention about possibly corrupted shared memory makes me think it's the latter. Unfortunately I can't test this as reliably anymore because the dependencies are now already downloaded and built (because I tested successfully with the statement timeout removed.)

(I realized that I can test this easily by removing our prewarming code, and I can confirm that all workers are crashing.)

Now, the most pressing question I'm facing is how worried I should be about the unclean exit. Could you shed some light on this? I can't think of a foolproof way to prevent this scenario from ever occurring in production so it would be good to know what the impact might be in that case.

Now, the most pressing question I'm facing is whether this is to do with the specific way plrust is set up in our dev env or whether the same could happen in our RDS production environment. I'm going to be able to test this more later today, but perhaps you can shed any light on this?


This incident also brought to mind a whole other question: you'll notice in the logs that plrust is downloading and building external crates, this is of course the reason I ran into the timeout in the first place. However, we're priming the workdir (along these lines) and so this wouldn't be reflecting initial setup but rather it must be an implicit (automatic) upgrade of existing dependencies, if I'm not missing anything. (The plrust function that the logs are for isn't specifying any additional crates by the way.)

The fact that it's downloading a brand-new version of thiserror only released a few days ago (long after I've built the Docker image in question and even longer after plrust 1.27 was released) also speaks to this theory.

If this is caused by automatic upgrades, and not due to a misconfiguration (of the plrust installation) in our dev env, I can see a range of potential problems, from supply chain security, to server instability due to bugs in newer crate versions, to fluctuating duration of migrations, and potentially subtle differences in plrust function behavior depending on when exactly they were compiled.

Would this (automatic upgrading of crate versions) have something to do with the way our Dockerfile is installing plrust in the dev env, or would this also be expected behavior on RDS, say? Is there any configuration we could apply to pin crate versions so that they can only be upgraded manually?

@jscheid jscheid changed the title Statement timeout leads to crash Statement timeout during cargo build leads to crash Jul 16, 2024
@eeeebbbbrrrr
Copy link
Contributor

(There's a lot going on in this issue -- in the future, please consider making one issue per topic/idea)

Crashing during at statement_timeout event

This is definitely not by design. :/ We'll have to look into this. Doesn't sound like it'll be hard to recreate

plrust is downloading and building external crates

This is what cargo does. You could adjust the plrust.path_override GUC to get your own "cargo" wrapper script on the PATH that maybe adds --offline or any other arguments you may want. https://plrust.io/config-pg.html#plrustpath_override-string

I don't have a "best practices" suggestion for you here. All of this really depends on what your requirements are. I do think, however, pl/rust provides enough knobs for you to direct the way you need.

@workingjubilee might have some ideas here...

@eeeebbbbrrrr eeeebbbbrrrr added the bug Something isn't working label Jul 16, 2024
@jscheid
Copy link
Author

jscheid commented Jul 16, 2024

in the future, please consider making one issue per topic/idea

You're right, my apologies. Would you like me to open a separate issue for discussing the auto-upgrading question?

You could adjust the plrust.path_override GUC

Something tells me that's not possible on RDS, but I'll investigate this later today with our ops team, thanks for the pointers so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants