Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qube hanging/freezing...coma? #8145

Open
the-moog opened this issue Apr 16, 2023 · 16 comments
Open

Qube hanging/freezing...coma? #8145

the-moog opened this issue Apr 16, 2023 · 16 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. affects-4.2 This issue affects Qubes OS 4.2. C: other needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. P: major Priority: major. Between "default" and "critical" in severity.

Comments

@the-moog
Copy link

the-moog commented Apr 16, 2023

I've seen other similar issues that may be closely related to this one.
And I was going to comment on those but my notes all got a bit too long

e.g. I think this may be exactly the same issue #7695

Qubes OS release

4.1.1

Brief summary

I've had issues with an appVM qube randomly hanging every few days. The machine has 32G RAM. The app causing issues has an obvious memory leak and that is what kills the machine every 3-4 days. The appvm has 12G allocated.

When doing some data processing work I noticed something interesting that could be related.
When I ran a jupyter notebook I could get it to hang in the same way as the faulty app. (for which I am waiting a bug fix, but that's unrelated)
But then if I ran that same python module from the CLI it did not.
I tried a few experiments and came up with a minimal, reasonably repeatable means of triggering the 'hang'.
I've documented that here.

Steps to reproduce

(Collapsed due to length)

Make any Qube freeze with a handful of lines of Python

On an app vm

$(cat <<EOF
import random, sys
array = list(random.randbytes(1024))
while True:
    array.extend(array)
    print(sys.getsizeof(array))
) > memoryhog.py

sudo -i
apt install python3 python3-pip jupyter
pip3 install pyqt5 qtconsole

Test1: Same VM as a user

python3 memoryhog.py
# runs for a bit then says 'Killed'

Test2: Try this in qtconsole

sudo -i
$(cat <<EOF
[Desktop Entry]
Version=1.0
Type=Application
Terminal=false
Icon=/usr/share/icons/HighContrast/256x256/actions/edit-copy.png
Name=Jupyter QtConsole
GenericName=Python GUI Console
Comment=Python QtConsole
Exec=jupyter qtconsole
EOF
) > /usr/share/applications/qtconsole.desktop

Then as a user in a termial

jupyter qtconsole
# In the 1st cell
import memoryhog
# Runs for a bit and says jupyter kernel restarted

Test3: Same from dom0 console

# Note this is a cut and paste from how Firefox starts, no idea if it's correct for qtconsole.
qvm-run -q -a --service -- testvm qubes.StartApp+qtconsole

Expected behavior

  • No hanging/freezing
  • Informative feedback when there is an OOM.
  • No one user app should never be able to take out a whole vm.

Actual behavior

  • The opposite of the above with the following notes:
    • Command line python never seems to hang
    • The same seem true with qtconsole started from the command line
    • Starting qtconsole via a remote execution it more often hangs, but not every time.
    • Perhaps this seems to be worse if I have a lot of other windows open on the same VM

Other observations that may help:
When it's broken, the xentop command on dom0 shows the offending vm's CPU is 100%.
If you know which app is be causing the issue (which the logs indicate) you can sometimes recover using qvm-run --pass-io vmname -- killall <appname> (or kill SIGTERM <pids of related processes>)

My thoughts

(Collapsed section due to length)

Why is this? Looking at the VM logs I see whenever this happens there is something call `oom-killer`, which spews out a lot of diag data but it seems responsible for the 'Killed' message on the console. If you are in a console it just seems to print `Killed` and then the culprit exits and the VM carries on regardless, which is fine. Why not always? But a GUI app, perhaps only when started via the remote execution mechanism from dom0, (i.e. where there is no console?) the oom-killer seems to not work properly, and may be making things worse. I don't know enough about this, but I think maybe parts of the windowing system are taken out by oom-killer (either directly or indirectly) as there are a lot of error message in the guid logs too.

Possible reasons (guesses)?
The app itself is uncooperative with oom-killer and is restarting all of part of itself as fast as oom can kill it? Wack-a-mole
Bits of some mechanisms related to gui apps in the OS are left partially broken?
Some interdependent threads are left running when they should be stopped or restarted so the resources can be freed?

Whatever is happening it's clear, at least for me, when you trigger this using the above script, oom-killer sometimes starts breaking other things, making things worse instead of better. It would be great if others can confirm the same links.

Additionally, and perhaps related, I notice that is another issue. Again, confirmation needed: When you trigger this dom0 is completely unaware there is anything awry. Any command you type or status you view just says everything is fine. You can even interact with the container dialogs of the broken app. When in reality, if you dig deeper, the app vm it's sitting at 100% cpu and either killing things as fast as they are starting or failing to recover critical components.

The usual solution is often to kill that appvm completely as there is no way to login to fix the issue as no more apps (i.e. a terminal) can start.
I did resolve the issue a couple times by sending the remote kill command to take out processes, only because I happened to be running top in another window when it was hung, so I had those pids to hand.

I should read up, research and understand better how this all works so I apologies if I am making assumptions, right now I don't have time to dig into this more. So having made an excuse, I will make the following (uneducated) conclusion:

Any offending app should just fail in a memory allocation as it would normally on any other OS and then quit gracefully or segfault. The result of the impending oom state surely should not be to trigger a means to randomly kill off tasks, as that might create chaos (core wars they used to call it) and compromises stability of the entire system instead of just one app. What I mean is just because (in the above) it is python using all the memory how can oom-killer know for certain which app was the straw that broke the camel and not wrongly pick on some other poor process that is just happening to do a malloc at about the same time?

Thus maybe this is a fundamental issue and it is oom-killer that is the cause. Perhaps what oom-killer should do is just send a message to dom0, which would simply pause/freeze the stressed qube and report the issue to the user in a message dialog. This would happen when some threshold is exceeded, not when its too late, allowing an immediate unfreeze and user intervention, indeed the next action would be to present some sort of process manager and a dice.

Final thought:
In reality I think it's just the graphics and IO that can't respond, so I guess this could be described as a coma, not a freeze!!
As there is no way I've found (so far) to create a terminal login to inspect and resolve the issue once triggered
The vm is clearly still alive but not visibly.
Hopefully the few hours/days I've put into this will help others with better knowledge come up with a simple resolution. For me it's a tad annoying and I have lost work to it but not enough to give up on QubeOS. It could even be just a common config error that myself and a few others have fallen into and nothing to do with oom-killer at all?

@the-moog the-moog added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug labels Apr 16, 2023
@andrewdavidwong
Copy link
Member

I've seen other similar issues that may be closely related to this one. And I was going to comment on those but my notes all got a bit too long

e.g. I think this may be exactly the same issue #7695

We don't keep duplicate issues open for comment length reasons. Brevity is a virtue, but if you feel that you must include large amounts of content, I suggest using collapsed sections or providing the content elsewhere (e.g., in a Gist), then linking to it.


This appears to be a duplicate of an existing issue. If so, please comment on the appropriate existing issue instead. If anyone believes this is not really a duplicate, please leave a comment briefly explaining why. We'll be happy to take another look and, if appropriate, reopen this issue. Thank you.

@andrewdavidwong andrewdavidwong closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2023
@andrewdavidwong andrewdavidwong added the R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. label Apr 17, 2023
@the-moog
Copy link
Author

@andrewdavidwong Thanks, I did not know about the <details> tag. I have added it. Though the "not planned" response and closing it is surely an error? This seems to be a serious issue?

Have you repeated my tests?

I can see from other posts that people have dropped Qubes because of similar if not the same issue. There is a security implication too as it leads to a simple DDOS of Qubes by getting a JS script to allocate memory.

@andrewdavidwong
Copy link
Member

andrewdavidwong commented Apr 17, 2023

Though the "not planned" response and closing it is surely an error?

You wrote, "I think this may be exactly the same issue #7695." That would mean that this issue is a duplicate of #7695. Please read our policy on duplicate issues. Instead of opening a new issue that is a duplicate of an existing issue, you should instead comment on the existing issue if you have anything to add. You said you were going to do this, but then the length of your would-be comment motivated you to open this new issue instead. I pointed out that the length of a comment is not a reason to open a duplicate issue and provided some ways to handle lengthy comments.

This seems to be a serious issue?

It may very well be, but we do not keep duplicate issues open based on the seriousness of bugs. If we did that, we might end up with dozens of duplicate issues for extremely serious bugs, which would be confusing and illogical. This would probably hinder work on those bugs rather than help it. Instead, we have other ways of representing the seriousness of bugs, such as our issue priority system.

@the-moog
Copy link
Author

the-moog commented Apr 18, 2023

I only said MAY Be I did not say IS. It's just a similar and perhaps completely unrelated issue. Or are you saying somebody has repeated my tests and have concurred that this issue is the same as #7695, or conversely that what I see is just me and not related to the other issue at all. The other issue explicitly mentions Firefox (which I've had no problems with) and I am saying there could be a much more general issue that should be investigated. If this can be confirmed then this level of detail will help resolve multiple issues. Rather than just closing one without merit.

@andrewdavidwong
Copy link
Member

I only said MAY Be I did not say IS.

Indeed, but when an issue report believes that their issue may be a duplicate of an existing issue, commenting on the existing issue is generally the best place to start. We regularly have people who aren't aware of this create duplicate issues, using almost exactly the type of language you used, which strongly suggested that this was another one of those cases.

It's just a similar and perhaps completely unrelated issue. [...] The other issue explicitly mentions Firefox (which I've had no problems with) and I am saying there could be a much more general issue that should be investigated.

Ah, I see. In that case, it makes sense to reopen this and have it as a separate issue. (If, after a technical diagnosis, it turns out that they're both reports of the same bug, then it'll probably make sense to keep just one.)

Rather than just closing one without merit.

Well, it's not that closing suspected duplicates is without merit. (It's necessary to keep the issue tracker organized and useful for our developers, and closed issues can always be reopened with a single click.) Rather, it's just that there was a miscommunication in which I took you to be conveying the suspicion that this may be a duplicate of #7695, whereas you intended to convey merely that it's similar and possibly related to #7695.

@andrewdavidwong andrewdavidwong added C: other needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. and removed R: duplicate Resolution: Another issue exists that is very similar to or subsumes this one. labels Apr 19, 2023
@andrewdavidwong andrewdavidwong added this to the Release 4.1 updates milestone Apr 19, 2023
@the-moog
Copy link
Author

the-moog commented Apr 21, 2023

Some more information and possible root cause.

Though I don't completely understand how this works yet, and I think perhaps I am reading out of date documentation or maybe the way it's done in Qubes is not the same as a 'normal' Linux vm or maybe this has changed with kernel versions?

I think perhaps the root cause is a system/service configuration issue: Key services are not being differentiated from user apps or 'toys' in the AppVM Qubes and the OOM system is enabled in the kernel.

For OOM to work, processes should have a priority to reflect their importance and that seems to be missing.

Fundamentally, services in Qubes app VMs run with the same OOMadjustment as user programs (i.e. value zero), and only a handful of lower level Linux services are configured differently. On a VM I have (which will hang as shown previously), most services and apps are with OOMscore of 666 (or only a little more perhaps 680) even when they are starting to hog the system.

Observations:
I have setting /proc/sys/vm/over_commit_memory = 0 ; This means apps can ask kernel for as much allocation as they want and assume infinite resources, the kernel will never say no. I believe this should be 1 to disable over provision, or 2 so that the /proc/sys/vm/overcommit_ratio= has a meaning. Ideally this should be set such that the system prevents apps allocating rather than relying on the oom-killer

Other services, core to keeping dom0 control over a Qube, should have a very large negative OOM adjustment using the service unit settings
[Service]
OOMScoreAdjust=score

or by using the choom utility to start daemons with a negative OOMa.

That to me sounds like a lot of configuration to manage and I don't know of a way to do that in a manageable way.
Perhaps there should be a daemon running on DOM0 that adjusts app VMs tasks using the /proc//oom_score_adj dynamically? Allowing a bias between damage and control.
Or a setting on Qubes to provide at least a negative value for some key services.

I am not sure of the best way to set these appropriately for all Qubes guest services and which of the several dozen services are more important. If anybody could provide a minimal list that would help.

IMO regardless of CLI or GUI, being able to start a dispVM terminal or interact with a appvm dialog and try to save work is better than being able to print, make a noise or open a new app (other than a task manager).
Perhaps the GUI could be compromised safely, by turning off only some part? But I think what happens is parts of X get killed, don't recover on their own and leave the X server in a muddle. And what if, e.g. journald dies. (I am guessing this does as when resources are low error reports occur and so journald allocates memory and is killed in response) If that dies does it take e.g. systemd with it? Are log writes blocking or queued somehow?

Things I don't understand:

  • In the docs I have read, I get conflicting values for OOMadjustment and OOMscore. I have references for both with ranges from +/-20, +/-1000 and +/- 2^31. I'm wondering if this is just configured wrong. Any insight on this would help?
    Update:
    /proc/oom_adj is -17 to +15 (-17 means never kill) but is deprecated in favor of
    /proc/oom_score_adj is -1000 to +1000 with -1000 meaning never kill. Writing the latter scales the former.
    I don't know which scale choom and systemd are using.

  • The docs mention /proc/sys/oom-killer=0 or 1. That proc node does not exist for me though still things are killed all too randomly by oom. Again, guessing; Perhaps in the kernel build the settings are altered for Qubes or that is now non-adjustable?

As a test I will alter my settings thus and see what happens:
Increase the swap as for a VM with 12G it only has a 1G swap.
Set /proc/sys/vm/over_commit_memory = 2
and /vm/overcommit_ratio=10

According to one doc, swappyness makes a difference as disk (or network) virtual memory is slow so to free or re-allocate takes a long time compared to RAM. A bad app can ask for all of the system memory faster than memory can be freed or backed out and thus trigger the killing to start before the system has responded. Again it's not good that a user space app can take down an OS.

@brendanhoar
Copy link

Observations:
I have setting /proc/sys/vm/over_commit_memory = 0 ; This means apps can ask kernel for as much allocation as they want and assume infinite resources, the kernel will never say no. I believe this should be 1 to disable over provision, or 2 so that the /proc/sys/vm/overcommit_ratio= has a meaning.

I think you swapped 1 and 2 above: 2 disables over provisioning entirely.

B

@brendanhoar
Copy link

In addition I suspect that memory ballooning tends to reduce the ability to use swap efficiently when needed.

Might want to try adding a lot of swap and disabling ballooning at the same time to your test strategies.

B

@the-moog
Copy link
Author

I think you swapped 1 and 2 above: 2 disables over provisioning entirely.

LOL, sorry yes I did.

@the-moog
Copy link
Author

I've been running my changes for a couple days with much more stability. I think some programs are, shall we say, liberal with memory usage. e.g. Element (Matrix Client) asks for 1TB RAM (bug raised). The OOM heuristics take into account type, duration, history and a whole bunch of other stuff. A user program which abuses memory allocation/reservation could preempt those heuristics without correct OOM configuration of important system level tasks.
Conversely Firefox seems to work hard in the background to page content of unused tabs and unload content. Though it seems perhaps some sites have pages with JS that causes them to be less nice. Guessing you can do stuff in JS that prevents paging?

Though I've been using Linux/Unix since Slackware and Yggdrasil back in the early 90's, I have to admit I am new to OOM features of Linux Kernel.

Improving paging?
From this link https://unix.stackexchange.com/questions/10214/how-to-set-per-process-swapiness-for-linux it seems there may be a per process swappiness control. Perhaps a mechanism should pump that up as the system runs lower on physical resources?

@the-moog
Copy link
Author

Got some results but they were mixed.
Worked very well for a few days. Then this evening it broke very badly and without warning or any obvious trigger.
There is little I can diag, as the final solution was a reboot. But here is what I noticed.
I left my desk for a short while and when I came back I only had two windows open. No desktop. I am using KDE with multiple desktops and activities, one per application. There was now no means to switch desktops or activities. The window manager (plasma) had died. Logs show killed by oom-killer. So oom had decided the best thing to kill off was the window manager!
The interesting thing is the VM that I made the changes mentioned previously to was fine and survived (though I could not get to it's dialogs). It was using <1/4 of it's swap and <30% of ram, but it had reduced it's ram by 30% so leveling had kicked in, though I have no idea why. The dom0 has max 4G (compared to 12 for the other) and it would normally use <40% of that.
I think a means to set a system wide policy is required. Any suggestions?

@mfc
Copy link
Member

mfc commented May 31, 2023

hi, i would just to chip in on this issue because it may actually more accurately describe my issue than #7695 . in the other ticket, the Firefox window is freezing. however my issue is that the qube itself freezes, requiring me to kill the qube itself & restart it. this seems to be more accurately described in this ticket.

I get it once a day or two, in a qube dedicated to using firefox that i use daily. currently FF in fedora-37, initial memory 553MB, max memory 3GB, 4 VCPUs, included in memory balancing. will obviously try to switch to the newly released fedora-38 and see if issues continue. let me know if there are nice ways to get logs that may be helpful. would like to keep using firefox for this purpose but the issue is quite annoying.

@the-moog
Copy link
Author

the-moog commented Jun 1, 2023

@mfc My first experience of this was that the Qube appreared to be frozen. I then started probing around and could see IO and CPU activity - evidence of brain function = not dead!! I would say that may not be enough RAM. As said I have managed to crash it with >8G (and no swap). The issue is the OOM killer system. Instead of letting one app fail it heads toards disaster for the whole system.

What was happening is that the deamon that sits on the end of the pipe that joins DOMs had been killed and that meant that no commands from the Qube manager were getting through neither were any status messages. I looked at the log and you could see it trying to restart, but at the same time as a bunch of other processes and it kept loosing out.

The other channel is GUI and that uses X over the same interface. So no GUI updates either. There is also a video buffer shared memory, but I don't know how that works.

I explored some of the Xen tools (like xl and xentop) and that proved there was activity. The final proof was in the log files. I could not even do a Xen console into the 'dead' qube. Hence why I called this a coma.

Another time I monitored the situation using top and could see it going wrong in real time.
As final proof, a while back I posted a simple Python script that forces this to happen in just a few seconds.

Dom0 is surprisingly memory hungry and suffers exactly the same issues.

Things to do:
Make sure swap is working in both dom0 and the Qube. Have at least 1x the swap as RAM, ideally 2x.
When you start the Qube immediately start an Xterm and run top in it as root. Learn and use and save (press w) the top settings (press ? for how). Make sure you have the necessary columns visible (key f) You can also change the display to show memory and CPU graphically.

Then run FF as normal and see what happens.

Logs:
On the Qube you have the normal logs.

sudo tail -f /var/log/syslog
journalctl -kb -0   # -(zero) = last boot info 
journalctl -xe   # xtra detail and recent errors at the bottom
jounalctl -f       # Continuous output (add -x or -u <unit> to change detail)

The problem with jounalctl is without -u it gets noisy and unless you know the name of the unit(s) you need to know about you don't know the name to use. systemctl list-units will help.

Then you have the exported logs (if you can't get the system ones) by right clicking on the Qube in manage
and selecting logs The Qubes specific logs are in the same place. This is whee I would expect the OOM message to appear assuming it had time to arrive before the message system broke.

Do the same on DOM0 as that is sending the commands and expecting responses.

The is a completely separate logging system for Xen xl dmesg and xl top were most useful.

@andrewdavidwong andrewdavidwong added P: major Priority: major. Between "default" and "critical" in severity. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Jun 1, 2023
@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
@andrewdavidwong andrewdavidwong removed this from the Release 4.1 updates milestone Aug 13, 2023
@heinrich-ulbricht
Copy link

Found this while searching for reasons for frequent qube "comas" in 4.1. It appeared to happen at high memory load and searching in this direction I quickly found this issue. In my case I'm running Jira, Confluence, Postgres, Visual Studio Code, and browser tabs galore in a 12 Gig app VM, on a machine with 16 Gig RAM. "Own fault" you might say. But having to kill the App VM regularly is not particularly convenient.

@andrewdavidwong andrewdavidwong added the eol-4.1 Closed because Qubes 4.1 has reached end-of-life (EOL) label Dec 7, 2024
@andrewdavidwong andrewdavidwong removed the needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. label Dec 7, 2024

This comment was marked as outdated.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024
@codemath3000
Copy link

Affects 4.2

@andrewdavidwong andrewdavidwong added needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. affects-4.2 This issue affects Qubes OS 4.2. and removed eol-4.1 Closed because Qubes 4.1 has reached end-of-life (EOL) labels Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. affects-4.2 This issue affects Qubes OS 4.2. C: other needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. P: major Priority: major. Between "default" and "critical" in severity.
Projects
None yet
Development

No branches or pull requests

6 participants