Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault Caused by mhpmcounter10 Access Not Implemented in DUT and Spike #129

Open
Bill94l opened this issue Dec 9, 2024 · 5 comments

Comments

@Bill94l
Copy link

Bill94l commented Dec 9, 2024

Hi,

The attached riscv-dv generated code attempts to access mhpmcounter10, which is not implemented in either the DUT or Spike. This results in a segmentation fault and the code fails. Below is the specific line of code causing the issue:

    800001b6:	b0aa15f3          	csrrw	a1,mhpmcounter10,s4

After this issue #121 (comment) I modified the number of HPM counters in Spike to ensure consistency between the DUT and Spike but that doesn't change anything:
https://github.com/SpinalHDL/riscv-isa-sim/blob/5b37ab699f6176ea7f2d2b3187f75625eca7e511/riscv/processor.h#L22

#define N_HPMCOUNTERS 7 //29

Results:

  1. With the regression simulator:
./obj_dir/VNaxRiscv --name dv --output-dir failed --load-elf basic_test_1.o --start-symbol _start --pass-symbol pass --fail-symbol fail --no-putc-flush --seed 96102624 ${ARGS} --trace-ref

*** MISSMATCH PC DUT=80103000 REF=8010301c ***

TIME=574
LAST PC COMMIT=80103000
INCOMING SPIKE PC=80101bf8
ROB_ID=x1c
FAILURE dv
  1. With RVLS using SocSim
[info] **************************************************
[info] DUT-PC = 00000000800001b6, REF-PC=800001b6, INST = b0aa15f3
[info] Debug: CSR_MHPMCOUNTER3 = 0xb03, CSR_MHPMCOUNTER31 = 0xb1f, csrAddress = 0xb0a, csrReadData = 0x0
[info] Warning: csrmap[0xb0a] is null. Skipping write.
[info] # A fatal error has been detected by the Java Runtime Environment:
[info] #
[info] #  SIGSEGV (0xb) at pc=0x00007ffa1f6e01e8, pid=2445164, tid=0x00007ffa68e47640
[info] #
[info] # JRE version: OpenJDK Runtime Environment (8.0_432) (build 1.8.0_432-8u432-ga~us1-0ubuntu2~22.04-ga)
[info] # Java VM: OpenJDK 64-Bit Server VM (25.432-bga mixed mode linux-amd64 compressed oops)
[info] # Problematic frame:
[info] # C  [rvls.so+0x3a91e8]  processor_t::get_csr(int, insn_t, bool, bool)+0x58
[info] #
[info] # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
[info] #
[info] # An error report file with more information is saved as:
[info] # /media/big_dd/nanotrust/bi262934/naxriscv_tuleap/naxriscv_dv_vf/NaxRiscv/hs_err_pid2445164.log
[info] #
[info] # If you would like to submit a bug report, please visit:
[info] #   http://bugreport.java.com/bugreport/crash.jsp
[error] Nonzero exit code returned from runner: 134
[error] (Compile / runMain) Nonzero exit code returned from runner: 134
[error] Total time: 12 s, completed 10 déc. 2024 00:09:08

failed_test.zip

Could you please provide guidance on how to handle this mismatch or segmentation fault when accessing unimplemented HPM counters?

Thank you!

@Bill94l Bill94l changed the title Segmentation Fault Caused by mhpmcounter10 Access Not Implemented in DUT or Spike Segmentation Fault Caused by mhpmcounter10 Access Not Implemented in DUT and Spike Dec 9, 2024
@Dolu1990
Copy link
Member

Hi,

With the regression simulator:

I would say, better focus on getting things ok with RVLS, and omit this one.

With RVLS using SocSim

So, when i have crashed happening in RVLS (when called from SpinalSim), then what i do, is to ask the testbench to generate the trace.log file (--trace argument should do it), which contains all the execution trace in the format RVLS is able to read.
Then with that trace.log file, i run RVLS in debug mode (in a standalone mode), with the -f trace.log argument.
There i can use regular C/C++ debug tools and figure out what is going wrong.

Just be sure to compile riscv-isa-sim in debug mode by adding :
CFLAGS='-g -O0' CXXFLAGS='-g -O0'
as arguments of the configure command

So in short, figuring out why RVLS crash is the first thing to do :)
It maybe due to that :
https://github.com/SpinalHDL/rvls/blob/b5c15e129f8168c2317b2c129aef91adf935bfeb/src/hart.cpp#L234
RVLS will syncronise the pmucounter on access, as they are hardware specific. Thing is your changes in riscv-isa-sim seems to deleted the extra counters ? (rvls assume their CSR is defined)

The implement intends of NaxRiscv is to match the RISC-V spec, but there may be a few missmatches.

@Bill94l
Copy link
Author

Bill94l commented Jan 15, 2025

Problem Description

Currently, NaxRiscv allows access to all decodable CSRs, including unimplemented HPM counters. This behavior causes inconsistencies with Spike when the number of HPM counters (N_HPMCOUNTERS) is set to its default value of 29. Here is the list of CSRs implemented by NaxRiscv:

[info] CSR FILTER List(b00, b02, b03, b04, b05, b06, c00, c02, c03, c04, c05, c06)

The number of implemented HPM counters is defined by the additionalCounterCount parameter, which is set to 4:

additionalCounterCount = 4,

val csrList = ArrayBuffer[Int]()
csrList ++= List(CSR.MCYCLE, CSR.MINSTRET)
csrList ++= CSR.MHPMCOUNTER3 until CSR.MHPMCOUNTER3 + additionalCounterCount
if(withHigh) csrList ++= csrList.map(_ + 0x80)
if(priv.implementUser) csrList ++= csrList.map(_ + CSR.UCYCLE - CSR.MCYCLE)
val csrFilter = CsrListFilter(csrList.toList)

For example, when accessing mhpmcounter10 (which is not implemented in this case), Spike correctly triggers a trap_illegal_instruction, as seen in spike.log:

core   0: 0x00000000800001b6 (0xb0aa15f3) csrrw   a1, mhpmcounter10, s4
core   0: exception trap_illegal_instruction, epc 0x00000000800001b6
core   0:           tval 0x00000000b0aa15f3

However, NaxRiscv still attempts to commit the instruction, as shown in tracer.log:

rv commit 0 00000000800001b2
rv trap 0 0 2 00000000b0aa15f3

This discrepancy occurs because the current implementation does not restrict CSR access to the actual list of implemented HPM counters.

Root Cause

The issue lies in the CSR filtering logic, which currently grants access to a broader range of CSRs than those explicitly implemented. The relevant code is located in

val csrFilter = CsrListFilter(csrList.toList)
if(priv.implementSupervisor) csr.allowCsr(CSR.SCOUNTEREN)
csr.allowCsr(CSR.MCOUNTEREN)
csr.allowCsr(CsrListFilter((3 to 31).flatMap(e => List(e + 0xB00, e + 0xB80, e + 0x320))))

val decodeCsrFilter = CsrListFilter((3 to 31).flatMap(e => List(e + 0xB00, e + 0xB80, e + 0x320)))
csr.allowCsr(decodeCsrFilter)

This code allows access to all decodable counters, leading to NaxRiscv committing instructions for unimplemented HPM counters, which results in invalid behavior.

Proposed Solution

To resolve this issue, the filtering logic for CSR access should be updated to restrict access only to implemented counters. Replace the above logic with:

    val csrFilter = CsrListFilter(csrList.toList)

    if(priv.implementSupervisor) csr.allowCsr(CSR.SCOUNTEREN)
    csr.allowCsr(CSR.MCOUNTEREN)
    csr.allowCsr(csrFilter)

This modification :

    csr.allowCsr(csrFilter)

Ensures that NaxRiscv behaves consistently with Spike by only allowing access to the CSRs explicitly declared in csrFilter.

Expected Behavior

After the proposed fix:

1. Spike and NaxRiscv should behave consistently.
2. Accessing an unimplemented HPM counter should result in a `trap_illegal_instruction` in both `Spike` and `NaxRiscv`.
3. No commit should occur for instructions accessing unimplemented CSRs.

This fix improves compatibility with Spike and ensures correctness when handling unimplemented HPM counters.

@Dolu1990
Copy link
Member

Hi ^^

Hmm i'm not sur, thing is, the RISC-V spec says :
"""
All counters should be implemented, but a legal implementation is to make both the counter and its corresponding event selector be read-only 0.
"""

Are you sure the spike N_HPMCOUNTERS is meant to be modified ? and isn't just to reflect how many counter they could be maximum ?

@Bill94l
Copy link
Author

Bill94l commented Jan 21, 2025

Hi Dolu1990,

Thank you for bringing this up!

Based on my understanding of the RISC-V specification, the behavior for unimplemented CSRs is clearly defined:

  • Read: Should return zero.
  • Write: Should trigger an illegal instruction exception.

If I keep the initial behavior, the VexRiscv commits on a write access to an HPM counter that is not implemented, whereas Spike correctly triggers a trap.

Knowing that the NaxRiscv read access returns 0, which complies with the specification, but Spike raises an exception, this behavior does not adhere to the specifications.

For reference, here’s a related discussion on a similar issue: CSR reads fail for unimplemented features like HPM counters.

Best regards,
Billal

@Dolu1990
Copy link
Member

Hmmm right, should read zero and trap on write.
I don't know why the spec doesn't specify "trap on read, trap on write" instead of "read only zero" to keep things simple :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants