-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPI overflow in CAN bus HAT communication - IRQ handler mcp251xfd_handle_tefif() #6644
Comments
Attached is a code snippet that seems to replicate the bug (in my setup). The longer it runs, the more overflows occur, although I'm not 100% sure, as it's difficult to gather statistics.
|
Can you please copy https://github.com/linux-can/can-utils/tree/master/mcp251xfd/99-devcoredump.rules to If the IRQ handler fails with an error, the driver will generate a dump of the driver and chip state and write it to |
i also discovered this problem. I'm on the latest available kernel it dont happens every time, it randomly occurs. also the driver feels a bit unstable currently. in the most of the time I randomly face some "timeouts" where im unable to send or read. but not every time I get IRQ handler fails. |
Did it without success, no file has been generated, but the IRQ handler message is present in dmesg | grep can filogold@raspberrypi:/etc/udev/rules.d $ ls
filogold@raspberrypi:/usr/sbin $ ls
filogold@raspberrypi:/usr/sbin $ ifconfig
filogold@raspberrypi:~ $ dmesg | grep can
|
Did you make chmod +x /usr/sbin/devcoredump |
...and you have to restart the system, after you initially copied |
Sorry guys, my bad. I still need to learn how to properly handle Linux. Upload the file here @marckleinebudde |
You can upload files here, too. For reference: devcoredump-20250203-234629.dump.gz |
Can you explain your HW setup. Which raspi? Which mcp2518fd connected to which SPI and CS? |
devcoredump-20250203-234629.dump.gz devcoredump-20250204-112314.dump.gz I attached 2 log report |
Yes sure. Im currently using a raspberry pi 5 with PiCAN FD HAT. Its configured like described in their instructions
I'm currently only using classic CAN and not FD with 1Mhz and about 40% of load on the bus. As mentioned i not always face the IRQ issue. But sometimes short dropouts and increasing overruns (TX and RX) the longer it runs. Sometimes during dropouts I see crazy amount of overruns. |
@Vinz1911, do you have something else attached to the SPI? |
No, i just have this HAT attached. Additionally i use the onboard bluetooth and i have a display over i2c (small oled 128x32). No USB devices or something else. |
You can rule out a class of Pi 5-specific problems by installing the patch in pull request #6646: |
unfortunately did not solve my problem.
i also discovered this happens or is much more reproducible, after a fresh start of the raspberry pi. then my program runs like a half minute then I get a lot of overruns and im unable to read or write on the CANBus. restarting the program then works, I would say, as expected but still slowly increasing the overrun count for RX and TX. But since your patch, it seems, the overruns increase are much less |
That's a shame - I'll just observe while Marc gets on with it. |
I did some additional testing today and found a way to reliable reproduce the issue that appears for me. My program reads and writes on the CANBus and ive configured in socketCAN a read timeout of 100ms and my program stops after 3 failed read attempts. This Hiccup, when the Pi is unable to interact with the CANBus, only appears after a cold boot (also warm boot) (pi without any power source connected). When the Pi has finished booting i enable the interface with: sudo ip link set can0 up type can bitrate 1000000 restart-ms 1000 berr-reporting on fd off
sudo ifconfig can0 txqueuelen 2500 it takes around 20 seconds then something happens and the hiccup occurs, then my program exits because its unable to read or write. restarting the program then works and runs without "any" issues (overruns count increase slightly but seems not to affect my program). When I do a reboot also everything works fine and the issue does not appear again. Except I do a cold/warm boot. removing the power source so the Pi is completely turned off and turn it on again, then the hiccup appears once again and then it works like described. After the hiccup it looks mostly like this (overruns and error count varies a lot): can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 2500 (UNSPEC)
RX packets 151705 bytes 1076459 (1.0 MiB)
RX errors 20 dropped 5989 overruns 0 frame 0
TX packets 33412 bytes 226254 (220.9 KiB)
TX errors 0 dropped 0 overruns 20 carrier 0 collisions 0
device interrupt 171 I dont see any issues when using can0: flags=193<UP,RUNNING,NOARP> mtu 16
unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 2500 (UNSPEC)
RX packets 1679156 bytes 11911832 (11.3 MiB)
RX errors 0 dropped 7049 overruns 0 frame 0
TX packets 306671 bytes 2066722 (1.9 MiB)
TX errors 0 dropped 0 overruns 1 carrier 0 collisions 0
device interrupt 186 It seems something strange happens after enabling the interface for the first time after a cold/warm boot which leads in this hiccup while the pi is interacting with the CAN HAT. EDIT: i forgot to mention that the CANBus is not restarted. I also attached an additional raspberry pi with a Waveshare CAN HAT (MCP2515) to check if something weird happens or the entire Bus gets stuck during the hiccup appears, but this is not the case. |
Tried with this Kernel version, still same behavior. |
Good to know - it was worth a try. |
The community and me don't see these kind of errors on non Raspi-5 systems, even the Raspi-4 is good. Maybe the SPI host driver is more optimized and triggers a race condition in the mcp251xfd driver. I'm on holidays next week and I think I don't find time to look into this issue before. |
Timing will certainly be different. I doubt it will be more optimised though - the Synopsys had several missing features, some of which I've added, and has received much less attention than the highly polished bcm2835 equivalent. I also wouldn't rule out a real bug. |
Describe the bug
Hi everyone,
I'm encountering an issue with the IRQ handler. When I simulate multiple periodic messages over the CAN bus, at a certain point, I suspect that SPI communication fails to properly manage data exchange with the two HAT modules(https://www.waveshare.com/wiki/2-CH_CAN_FD_HAT) I am using.
I updated the kernel to the latest version (see below), which seemed to improve the situation slightly. However, it only mitigated the buffer issue by delaying the occurrence of the fault rather than fully resolving it.
My setup consists of a Raspberry Pi 5 with two CAN HAT 2CH FD modules, each using the MCP251XFD chip.
To ensure the problem is related to the transmitter and not the receiver, I tested two different scenarios:
Using two channels of the Raspberry Pi to transmit and the remaining two channels to receive (all configured with the same parameters: bitrate, data rate, and sampling time).
Using two channels of the Raspberry Pi to transmit and two channels of a Vector VN1640 (with the 1057Gcap installed) to receive. Also in this case, I carefully verified the CAN configuration.
In both cases the termination are correctly verified, providing 60 ohms in the buses.
I am aware that I am stressing the module, as I am sending multiple periodic messages (128 periodic messages per CAN FD, generating a 40% bus load). The messages require some processing to calculate the internal CRC and MC in the payload, which is correctly handled by the Python code.
Based on the information provided, the issue appears to be related to overrun packets. I suspect this is the cause of the failure.
ifconfig:
The information I provided above is essentially the same as in the second case, where I use the Vector hardware as the receiver. One observation I have made is that after a bus failure, the overruns seem to stop or decrease. However, after a certain period (several hours), at some point, the second bus also fails.
Thanks in advance
Steps to reproduce the behaviour
I can't find a piece of code that fully replicates the bug. I wrote a program that sends periodic messages, which generates some overruns, but not as many as my main code.
I use Bluetooth serial communication to share information about the message I want to simulate, including the ID, initial payload, and CAN bus settings. This communication is not constant.
I could try running the code without using Bluetooth, but I still need to work on it.
How I Initialize the bus:
To run a periodic message, I create a thread using the library function:
The message object is defined as follow:
Device (s)
Raspberry Pi 5
System
uname -a:
modinfo can:
modinfo mcp251xfd:
Logs
dmesg | grep can:
Additional context
@marckleinebudde
Thanks in advance
The text was updated successfully, but these errors were encountered: