Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IR benchmark failure #1733

Open
lpawelcz opened this issue Nov 20, 2024 · 3 comments
Open

IR benchmark failure #1733

lpawelcz opened this issue Nov 20, 2024 · 3 comments

Comments

@lpawelcz
Copy link
Contributor

lpawelcz commented Nov 20, 2024

Describe the bug
The IR benchmark for DecoderMux proc is prone to a random failure after increasing pipeline_stages parameter for the benchmark from 2 up to 10. The error is indicated by the following output:

/home/xlsuser/.cache/bazel/_bazel_xlsuser/976f9331675cc2c7ada78a437e0227fe/execroot/com_google_xls/bazel-out/k8-opt/bin/xls/modules/zstd/dec_mux_opt_ir_benchmark.sh: line 3: 3086416 Segmentation fault      (core dumped) xls/dev_tools/benchmark_main xls/modules/zstd/dec_mux_verilog.opt.ir --delay_model=asap7 --generator=pipeline --pipeline_stages=10 --reset=rst --reset_data_path=false --use_system_verilog=false --module_name=dec_mux_opt_ir_benchmark_default $@ $*

This failure does not occur for every benchmark run. It might be required to run the benchmark multiple times (could even take 50 runs) to reproduce the error.

To Reproduce
Steps to reproduce the behavior:

  1. Checkout on current main branch (cd90112 at the time of writing this issue)
  2. Modify pipeline_stages parameter for DecoderMux codegen and IR benchmark:
diff --git a/xls/modules/zstd/BUILD b/xls/modules/zstd/BUILD
index 871749792..94b6b3ecd 100644
--- a/xls/modules/zstd/BUILD
+++ b/xls/modules/zstd/BUILD
@@ -506,7 +506,7 @@ xls_dslx_verilog(
     codegen_args = {
         "module_name": "DecoderMux",
         "delay_model": "asap7",
-        "pipeline_stages": "2",
+        "pipeline_stages": "3",
         "reset": "rst",
         "use_system_verilog": "false",
     },
@@ -520,7 +520,7 @@ xls_benchmark_ir(
     name = "dec_mux_opt_ir_benchmark",
     src = ":dec_mux_verilog.opt.ir",
     benchmark_ir_args = {
-        "pipeline_stages": "2",
+        "pipeline_stages": "10",
         "delay_model": "asap7",
     },
     tags = ["manual"],
  1. Run bazel run -c opt -- //xls/modules/zstd:dec_mux_opt_ir_benchmark repeatedly up to the segfault occurrence, e.g.:
count=0; while bazel run -c opt -s -- //xls/modules/zstd:dec_mux_opt_ir_benchmark --logtostderr; do (( count++ )); echo "Run $count"; done; echo "successfull runs: $count"
  1. Observe segfault of the benchmark_main

Expected behavior
benchmark_main should fail gracefully or not at all

Additional context
I noticed that proc DecoderMux can be divided by the toolchain into more than 2 pipeline stages as it is currently on the main branch. I increased this parameter for the IR benchmark to 10 to see the number of stages that have some IR nodes assigned to those (happens to be 3 pipeline stages). Then I set the pipeline stages for the codegen rule to 3 stages indicated by the benchmark. Then I noticed that sometimes the IR benchmark would fail.

The segfault seems to be happening somewhere in the evaluation of the Block IR with JIT.

This was also happening in the CI for the #1616: https://github.com/google/xls/actions/runs/11494086318/job/32320996820?pr=1616#step:10:463

CC @proppy

@proppy
Copy link
Member

proppy commented Nov 21, 2024

Run bazel run -c opt -- //xls/modules/zstd:dec_mux_opt_ir_benchmark repeatedly up to the segfault occurrence, e.g.:

Looks like we don't set the see in benchmark_main when generating random input:

std::minstd_rand rng_engine;

Maybe setting the seed from a commandline flag could help our narrow down this issue (w/o having to run in a loop).

@ericastor
Copy link
Collaborator

I was able to reproduce this with enough runs. Unfortunately, it looks like controlling the seed doesn't actually make this deterministic. The crash does not happen consistently, even with the same seed! Building with either ASAN or MSAN doesn't fix this either.

This seems likely to be some subtler issue with the JIT.

@ericastor
Copy link
Collaborator

I've also checked - the issue reproduces even without the change to the BUILD file, so whatever it is, it's pre-existing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants