wasmtime: Check stack limits only on exit from wasm #9093

jameysharp · 2024-08-08T19:40:00Z

Currently, we do an explicit check for stack overflow on entry to every WebAssembly function. But that costs some time, and is a significant performance hit for very short functions.

This commit instead switches Wasmtime to relying on guard pages at the end of the stack to catch stack overflow, so the MMU does this check for "free". This means we may allow deeper recursion in guest code than we did before.

To make this work, we need Wasmtime's signal handlers to recognize when a guest memory fault is in a stack guard page and report the appropriate stack-overflow trap code.

Note that we can't turn host-code signals into guest traps, so the signal handlers have to verify that the signal occurred in guest code.

When the guest calls host code (explicitly due to calling an imported host function, or implicitly due to a libcall inserted by Wasmtime or Cranelift), we also need to ensure that there is enough stack space available for the host code to not hit the guard pages. We do that by checking the stack limit that the embedder provided in the trampolines where we exit wasm.

I've been laid off from Fastly so I won't be pursuing this further, but I feel good about the work I put into it so far and want to leave it here in case anyone else wants to pick it up.

sunfishcode · 2024-12-13T20:40:40Z

I've now rebased this and ported it to main. There some test failures at the moment.

thread 'stack_overflow::host_always_has_some_stack' has overflowed its stack
fatal runtime error: stack overflow

which I'm looking into.

github-actions · 2024-12-13T21:44:35Z

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "pulley", "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

fitzgen: pulley

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

sunfishcode · 2024-12-18T22:43:43Z

The stack overflow in the host appears to be because the stack limit checks are in the trampolines, however host libcalls don't use trampolines. The testcase disables sse4_1 and does f32.ceil, which causes it to get a host libcall, so it calls from guest to host without a trampoline, and therefore without checking the stack limit.

cfallin · 2024-12-18T23:04:54Z

I remember talking about the libcall problem at the time with @jameysharp but have only vague recollections of the various answers we came up with; but IIRC we did decide that recognizing the overflow by fault address in guard page was more reliable than trying to deduce fault reason from PC (because we don't know the first instruction that might touch the guard page in the <1 page frame size case); I see this PR does that with some logic checking if the fault address is within a certain distance of SP; I remember us realizing that this might also (should also?) work for libcalls because natively compiled code is also compatible with a guard-page scheme. So I guess I'm curious: what's the reason that this condition isn't triggering in this case?

(Apologies if I've lost too much context -- it's been a long time and I skimmed this PR again but there are many subtleties!)

fitzgen · 2024-12-18T23:08:33Z

Also it seems like libcalls should probably do stack checks as well, since I would be hesitant to try and recover from runtime code hitting the guard page, so it might make sense to make the locals use trampolines like imports do.

alexcrichton · 2024-12-18T23:34:04Z

If it helps I've got an old wasmtime branch which removed the stack limit checks from functions through generating trampolines for all libcalls. I never pushed that over the finish line though in terms of polish and testing

jameysharp · 2024-12-19T18:26:52Z

I've lost the context I had on this.

I believe Wasmtime libcalls work correctly in this PR, because they use Wasmtime's trampolines, which this PR makes do stack limit checks. It's only the Cranelift libcalls that still don't work.

There was some reason why we can only safely use the guard page if we see that the faulting instruction is in Cranelift-compiled code. I think because our only recourse in the signal handler is to fully unwind the stack, and we can't trigger Rust panic unwinding let alone any libc cleanup, so there's no telling what invariants are broken if we just rewind the stack in native code. But the specific way we use Cranelift is always safe to just drop everything at any time.

I remember at one point intending to add support to Cranelift for inserting a stack limit check immediately before any libcall, or insert it in the prologue like today but only if some libcalls appear somewhere in the function. I don't remember if that was the last plan I had in mind or if something had superceded it.

I definitely carefully considered Alex's old branch. I only found out about it after landing on the idea of checking the stack pointer in the signal handler, and it was reassuring to see we'd both landed in the same place there.

I was not comfortable with the specific mechanism in that branch for getting stack limit checks into Cranelift libcalls.

The idea there was that Cranelift makes us supply the function for it to call, so let's supply a function that does a stack limit check.
But we need the vmctx to find the stack limit, and Cranelift doesn't pass that to libcalls. (I discussed changing that with Chris, IIRC; I think we decided not to.)
So Alex needed to look up the vmctx again, from thread-locals. Nobody wants to do that from Cranelift, so we need to call into Rust to do the thread-local lookup.
But we need to do a stack limit check before calling into Rust!
We can assume some upper bound on how much stack we'll need for the thread-local lookup. So Alex did something I think involving synthesizing a Cranelift function that would touch the stack some distance down; to hit the guard page and trigger the signal handler, maybe? And it was recursive for reasons I didn't understand, doing this four times or something.
If that function returns then it's safe to call the Rust function to get the vmctx, then do the stack limit check by hand, then call the real libcall. These steps can be represented in CLIF, so the signal handler protects this function too.

End result was calling two extra Cranelift functions plus a Rust function before the real libcall could run, and that felt like a lot of trouble to me.

I hope this helps!

sunfishcode · 2024-12-20T01:57:44Z

@jameysharp Thanks, that's useful context!

@alexcrichton Thanks for that branch. It's worth considering that approach, however if possible, I'd like to try taking an approach where we just pass vmctx to the libcalls and use trampolines, so that they work the same as other Wasmtime builtins, rather than having the libcall code do special things.

I've now added patches to experiment with making the calls to ceil/floor/etc. work like normal Wasmtime builtins, meaning they get a vmctx and a trampoline in the normal way. This involves handling them in the Wasm-to-Cranelift translator, which is a little annoying, but has the nice side effect of eliminating a whole separate relocation mechanism from Wasmtime, so I think it's an overall simplification. See the changes in crates/cranelift/src/translate/code_translator.rs for the place where this hooks in.

There's a test failure in host_sefault.rs. The test attempts to trigger a crash by deliberately misconfiguring the host, and with this patch, we interpret the crash as a Wasm trap, which causes the test to fail. I need to investigate more.

There are lots of Pulley failures still. I haven't yet looked at how all this interacts with Pulley.

And the patch is a bit rough still. I need to tidy up all the loose ends around x86_pshufb and __m128i and things, and some bits of unnecessary relocation code left behind, and probably other things too.

alexcrichton · 2024-12-20T16:46:38Z

This involves handling them in the Wasm-to-Cranelift translator, which is a little annoying, but has the nice side effect of eliminating a whole separate relocation mechanism from Wasmtime, so I think it's an overall simplification

Oh I really like that strategy, much better than the hack I had 👍

There are lots of Pulley failures still. I haven't yet looked at how all this interacts with Pulley.

Feel free to ignore any pulley failures where possible, but if that's causing issues I can help take a deeper look too!

sunfishcode · 2024-12-21T01:00:03Z

The patch is now tidied up more. I still need to fix the host_segfault.rs test, and there's a todo to figure out the stack pointer on s390x, but I think it's otherwise in pretty good shape.

@alexcrichton The CI shows a lot of pulley failures; any help you could provide with those would be appreciated!

alexcrichton · 2025-01-06T15:42:13Z

Ah I think what's happening there is that the pulley stack is different from the native stack so the limit recorded when we enter wasm isn't actually the limit that needs to be checked by wasm itself. I think an easy fix though would be to skip stack limit checks on Pulley entirely since that already has to check all decrements manually anyway. (basically just generate different CLIF for pulley targets)

Nothing uses this yet but I want to get collection of this register right on our many supported platforms before introducing the changes that need it.

Currently, we do an explicit check for stack overflow on entry to every WebAssembly function. But that costs some time, and is a significant performance hit for very short functions. This commit instead switches Wasmtime to relying on guard pages at the end of the stack to catch stack overflow, so the MMU does this check for "free". This means we may allow deeper recursion in guest code than we did before. To make this work, we need Wasmtime's signal handlers to recognize when a guest memory fault is in a stack guard page and report the appropriate stack-overflow trap code. Note that we can't turn host-code signals into guest traps, so the signal handlers have to verify that the signal occurred in guest code. When the guest calls host code (explicitly due to calling an imported host function, or implicitly due to a libcall inserted by Wasmtime or Cranelift), we also need to ensure that there is enough stack space available for the host code to not hit the guard pages. We do that by checking the stack limit that the embedder provided in the trampolines where we exit wasm.

And reinstate the old stack limit checks for when signal handling is disabled.

sunfishcode · 2025-01-17T00:39:42Z

I've now figured got all the pulley interactions untangled, and fixed the host_segfault.rs test in a defensible way, and that fixes all the remaining test failures, and I believe this is now ready for review.

As a recap of what's here:

This PR eliminates all Cranelift-generated builtins from Wasmtime, and deletes the associated relocation handling. For things like floor/ceil/etc. without sse4 and vector swizzles on x86 without pshufb, Wasmtime inserts libcalls in the Wasm-to-cranelift translator instead, so that they work like all other libcalls in Wasmtime. That's a nice simplification, and it helps this PR by ensuring that all libcalls go through the same trampoline mechanism, which can ensure that we handle stack overflow for them consistently.
This PR switches Wasmtime to rely on the stack guard pages for catching most stack overflows, when signal-handling is available, and in this mode it only does explicit checks on exits from Wasm code, such as trampoline calls into host functions. This reduces the overhead of entering Wasm because we no longer need explicit stack overflow checks on entrypoints.

prtest:full

alexcrichton

Thanks again for continuing to push on this! I've left a number of comments below, although nothing too major. I've additionally pushed a commit with prtest:full (feel free to remove that and push your own) to run the full test suite before getting to the merge queue.

Some high-level questions though:

I may have missed it during this review, but IIRC we currently have inline stack probes turned off and this PR will require us to turn them back on, right?
Could you review the stack overflow tests we currently have and ensure they're testing all the various bits necessary to stress this new scheme?

For testing and example is that the existing tests don't stress this PR. At each level of recursion they call out to the host meaning that we'll never get close to the guard page. IIRC I rewrote that test at some point to instead perform N recursive calls and then call the host, after which the test would do a bit of a search to figure out what N causes stack overflow and it would test N-10..N to ensure that stack overflow wouldn't happen (or something like that). I never felt happy enough with the test coverage to land something, but regardless I feel that our current testing of stack overflow is insufficient to land this since I think that test doesn't ever hit the guard page.

alexcrichton · 2025-01-17T15:39:35Z

crates/cranelift/src/compiler.rs

+            // abort for the whole program since the runtime limits configured by
+            // the embedder should cause wasm to trap before it reaches that
+            // (ensuring the host has enough space as well for its functionality).
+            if !isa.triple().is_pulley() {


Could the is_pulley check get lifted to the outer conditional to avoid two levels of nesting?

alexcrichton · 2025-01-17T15:41:41Z

crates/cranelift/src/compiler.rs

+        // If we're not catching traps with signals, insert checks at the
+        // beginnings of functions.
+        if !self.tunables.signals_based_traps {
+            // The `stack_limit` global value below is the implementation of stack


If you're up for it, and it's ok to defer this to later, our stack overflow story is complicated enough it might be worth having a dedicated page in docs/*.md to talk about. Just for internal documentation, not necessarily user facting, but I feel like we've got a lot of scattered bits about stack overflow in various places as our strategy has evolved over time. For example the "The Wasm spec..." bit here is duplicated with save_last_wasm_exit_fp_and_pc below and neither gives a complete picture of how stack overflow is handled. In reality to understand how stack overflow works it requires understanding lots of locations, which is where I think a docs/*.md would be good that every place could point back to.

I realize though that reorganizing docs is mostly orthogonal to this PR, so I think it's ok to defer this to later if you'd prefer to not tackle this now.

alexcrichton · 2025-01-17T15:42:51Z

crates/wasmtime/src/runtime/trap.rs

@@ -104,8 +104,10 @@ pub(crate) fn from_runtime_box(
            // then simultaneously assert that it's within a known linear memory
            // and additionally translate it to a wasm-local address to be added
            // as context to the error.
-            if let Some(fault) = faulting_addr.and_then(|addr| store.wasm_fault(pc, addr)) {
-                err = err.context(fault);
+            if trap != Trap::StackOverflow {


Could you add some comments as to why stack overflow traps are ignored here?

alexcrichton · 2025-01-17T15:44:08Z

crates/wasmtime/src/runtime/vm.rs

+#[cfg(not(target_arch = "x86_64"))]
+#[allow(non_camel_case_types)]
+#[derive(Copy, Clone)]
+pub(crate) struct i8x16(());


Would it be possible to drop this type definition entirely? Ideally we do want this to be __m128i on x86 because that is guaranteed to be an xmm register in the ABI. I think this might be possible by putting #[cfg(target_arch = "x86_64")] on the definitions of the intrinsics themselves? (IIRC the macro already supports feature-gated intrinsics so I think it should support platform-specific intrinsics as well)

alexcrichton · 2025-01-17T15:45:52Z

crates/wasmtime/src/runtime/vm/interpreter.rs

@@ -299,6 +303,9 @@ impl InterpreterRef<'_> {
            (@ty i32) => (i32);
            (@ty u64) => (u64);
            (@ty i64) => (i64);
+            (@ty f32) => (f32);
+            (@ty f64) => (f64);
+            (@ty i8x16) => (i8x16);


How does this end up working? I would have thought that i8x16 being empty or __m128i are both incompatible with Pulley's [i8; 16] definition...

Oh I see the pulley changes down below...

Could this call macro be updated on Pulley to just skip the shuffle/swizzle intrinsics entirely? That way it'd fall into a panic at the bottom of this function. That would also avoid the need to update pulley with methods that can't be supported and are just fixing the build for these bits too

alexcrichton · 2025-01-17T15:48:12Z

crates/wasmtime/src/runtime/vm/libcalls.rs

+#[cfg(not(target_arch = "x86_64"))]
+fn i8x16_swizzle(
+    _store: &mut dyn VMStore,
+    _instance: &mut Instance,
+    _a: i8x16,
+    _b: i8x16,
+) -> i8x16 {
+    unreachable!()
+}


For this my hope is that with a #[cfg] on the intrinsic definition we could avoid needing to define this entirely

alexcrichton · 2025-01-17T15:49:45Z

crates/wasmtime/src/runtime/vm/sys/unix/signals.rs

@@ -302,48 +305,56 @@ unsafe fn get_trap_registers(cx: *mut libc::c_void, _signum: libc::c_int) -> Tra
            TrapRegisters {
                pc: (cx.uc_mcontext.psw.addr - trap_offset) as usize,
                fp: *(cx.uc_mcontext.gregs[15] as *const usize),
+                sp: todo!("what's sp on s390x?"),


I suspect this won't be able to get past CI, so I pushed a commit to this PR with prtest:full to run the full test suite to help catch this earlier

alexcrichton · 2025-01-17T15:50:43Z

crates/wasmtime/src/runtime/vm/traphandlers.rs

+        let stack_overflow_trap = if let Some(faulting_address) = faulting_addr {
+            (regs.sp - 128 <= faulting_address && faulting_address <= regs.sp + 4096)
+                .then_some(wasmtime_environ::Trap::StackOverflow)
+        } else {
+            None
+        };


This'll definitely want some comments as to what's going on here. For example why the -128 and why the +4096? (e.g. why 128, why 4096 instead of page_size, etc)

alexcrichton · 2025-01-17T15:51:28Z

pulley/src/interp.rs

@@ -701,6 +748,7 @@ pub struct MachineState {
    f_regs: [FRegVal; FReg::RANGE.end as usize],
    v_regs: [VRegVal; VReg::RANGE.end as usize],
    fp: *mut u8,
+    sp: *mut u8,


This'll want to be removed because the sp for Pulley is tracked in x_regs. In generally ideally this PR wouldn't have any changes to Pulley here

alexcrichton · 2025-01-17T15:59:48Z

tests/host_segfault.rs

+    // If the test exited with a special exit code, that indicates that it had
+    // a stack overflow but caught it as a Wasm trap.
+    if status.code() == Some(STACK_OVERFLOW_CODE) {
+        return true;
+    }


Could this be split out into a new expectation of "WasmStackOverflow" which requires that the STACK_OVERFLOW_CODE is exited with? I'd ideally prefer to be as strict as possible when matching on this.

github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Aug 8, 2024

sunfishcode force-pushed the wasmtime-probe-stack branch 2 times, most recently from babdab0 to cf2a975 Compare December 13, 2024 20:40

github-actions bot added the pulley Issues related to the Pulley interpreter label Dec 13, 2024

sunfishcode force-pushed the wasmtime-probe-stack branch from cf2a975 to f160de6 Compare December 20, 2024 01:36

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen labels Dec 20, 2024

sunfishcode force-pushed the wasmtime-probe-stack branch from e024896 to 177a8a6 Compare January 16, 2025 00:57

jameysharp and others added 10 commits January 16, 2025 14:40

wasmtime: Collect stack pointer during traps

cea4703

Nothing uses this yet but I want to get collection of this register right on our many supported platforms before introducing the changes that need it.

Update for Wasmtime changes.

c951e57

Port to main.

c7a70e3

Switch to using Wasmtime-style builtins for ceil, floor, etc.

99139c4

And reinstate the old stack limit checks for when signal handling is disabled.

Remove unneeded libcall codes.

cffd050

Remove more relocation handling code.

5e60d5f

Remove unneeded FMA code.

62389a4

Remove more unneeded code.

12486e7

Add comments.

67d2e49

sunfishcode added 11 commits January 16, 2025 14:40

Remove more unneeded code.

99b6237

Add swizzle and shuffle as Wasmtime builtins.

b64a22f

Fix warnings.

53184ff

More comments.

8aa1bb8

Fix compilation on more platforms.

25498ec

Don't insert probe checks on pulley.

fa8f3a9

Update test expectations for pulley.

0f1420b

rustfmt

bd483c0

Fix has_round for pulley.

965d19f

Don't use x86_pshufb on pulley.

dfbe0c2

Fix tests/host_segfault.rs for the case where the trap is caught.

e4e55e0

sunfishcode force-pushed the wasmtime-probe-stack branch from 177a8a6 to e4e55e0 Compare January 17, 2025 00:04

sunfishcode marked this pull request as ready for review January 17, 2025 00:31

sunfishcode requested review from a team as code owners January 17, 2025 00:31

sunfishcode requested review from cfallin and alexcrichton and removed request for a team January 17, 2025 00:31

Run full test suite on this PR

d91ccaa

prtest:full

alexcrichton reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wasmtime: Check stack limits only on exit from wasm #9093

wasmtime: Check stack limits only on exit from wasm #9093

jameysharp commented Aug 8, 2024

sunfishcode commented Dec 13, 2024

github-actions bot commented Dec 13, 2024

sunfishcode commented Dec 18, 2024

cfallin commented Dec 18, 2024

fitzgen commented Dec 18, 2024

alexcrichton commented Dec 18, 2024

jameysharp commented Dec 19, 2024

sunfishcode commented Dec 20, 2024

alexcrichton commented Dec 20, 2024

sunfishcode commented Dec 21, 2024

alexcrichton commented Jan 6, 2025

sunfishcode commented Jan 17, 2025

alexcrichton left a comment

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

alexcrichton Jan 17, 2025

wasmtime: Check stack limits only on exit from wasm #9093

Are you sure you want to change the base?

wasmtime: Check stack limits only on exit from wasm #9093

Conversation

jameysharp commented Aug 8, 2024

sunfishcode commented Dec 13, 2024

github-actions bot commented Dec 13, 2024

Subscribe to Label Action

sunfishcode commented Dec 18, 2024

cfallin commented Dec 18, 2024

fitzgen commented Dec 18, 2024

alexcrichton commented Dec 18, 2024

jameysharp commented Dec 19, 2024

sunfishcode commented Dec 20, 2024

alexcrichton commented Dec 20, 2024

sunfishcode commented Dec 21, 2024

alexcrichton commented Jan 6, 2025

sunfishcode commented Jan 17, 2025

alexcrichton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment