-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qube hanging/freezing...coma? #8145
Comments
We don't keep duplicate issues open for comment length reasons. Brevity is a virtue, but if you feel that you must include large amounts of content, I suggest using collapsed sections or providing the content elsewhere (e.g., in a Gist), then linking to it. This appears to be a duplicate of an existing issue. If so, please comment on the appropriate existing issue instead. If anyone believes this is not really a duplicate, please leave a comment briefly explaining why. We'll be happy to take another look and, if appropriate, reopen this issue. Thank you. |
@andrewdavidwong Thanks, I did not know about the <details> tag. I have added it. Though the "not planned" response and closing it is surely an error? This seems to be a serious issue? Have you repeated my tests? I can see from other posts that people have dropped Qubes because of similar if not the same issue. There is a security implication too as it leads to a simple DDOS of Qubes by getting a JS script to allocate memory. |
You wrote, "I think this may be exactly the same issue #7695." That would mean that this issue is a duplicate of #7695. Please read our policy on duplicate issues. Instead of opening a new issue that is a duplicate of an existing issue, you should instead comment on the existing issue if you have anything to add. You said you were going to do this, but then the length of your would-be comment motivated you to open this new issue instead. I pointed out that the length of a comment is not a reason to open a duplicate issue and provided some ways to handle lengthy comments.
It may very well be, but we do not keep duplicate issues open based on the seriousness of bugs. If we did that, we might end up with dozens of duplicate issues for extremely serious bugs, which would be confusing and illogical. This would probably hinder work on those bugs rather than help it. Instead, we have other ways of representing the seriousness of bugs, such as our issue priority system. |
I only said MAY Be I did not say IS. It's just a similar and perhaps completely unrelated issue. Or are you saying somebody has repeated my tests and have concurred that this issue is the same as #7695, or conversely that what I see is just me and not related to the other issue at all. The other issue explicitly mentions Firefox (which I've had no problems with) and I am saying there could be a much more general issue that should be investigated. If this can be confirmed then this level of detail will help resolve multiple issues. Rather than just closing one without merit. |
Indeed, but when an issue report believes that their issue may be a duplicate of an existing issue, commenting on the existing issue is generally the best place to start. We regularly have people who aren't aware of this create duplicate issues, using almost exactly the type of language you used, which strongly suggested that this was another one of those cases.
Ah, I see. In that case, it makes sense to reopen this and have it as a separate issue. (If, after a technical diagnosis, it turns out that they're both reports of the same bug, then it'll probably make sense to keep just one.)
Well, it's not that closing suspected duplicates is without merit. (It's necessary to keep the issue tracker organized and useful for our developers, and closed issues can always be reopened with a single click.) Rather, it's just that there was a miscommunication in which I took you to be conveying the suspicion that this may be a duplicate of #7695, whereas you intended to convey merely that it's similar and possibly related to #7695. |
Some more information and possible root cause. Though I don't completely understand how this works yet, and I think perhaps I am reading out of date documentation or maybe the way it's done in Qubes is not the same as a 'normal' Linux vm or maybe this has changed with kernel versions? I think perhaps the root cause is a system/service configuration issue: Key services are not being differentiated from user apps or 'toys' in the AppVM Qubes and the OOM system is enabled in the kernel. For OOM to work, processes should have a priority to reflect their importance and that seems to be missing. Fundamentally, services in Qubes app VMs run with the same OOMadjustment as user programs (i.e. value zero), and only a handful of lower level Linux services are configured differently. On a VM I have (which will hang as shown previously), most services and apps are with OOMscore of 666 (or only a little more perhaps 680) even when they are starting to hog the system. Observations: Other services, core to keeping dom0 control over a Qube, should have a very large negative OOM adjustment using the service unit settings or by using the choom utility to start daemons with a negative OOMa. That to me sounds like a lot of configuration to manage and I don't know of a way to do that in a manageable way. I am not sure of the best way to set these appropriately for all Qubes guest services and which of the several dozen services are more important. If anybody could provide a minimal list that would help. IMO regardless of CLI or GUI, being able to start a dispVM terminal or interact with a appvm dialog and try to save work is better than being able to print, make a noise or open a new app (other than a task manager). Things I don't understand:
As a test I will alter my settings thus and see what happens: According to one doc, swappyness makes a difference as disk (or network) virtual memory is slow so to free or re-allocate takes a long time compared to RAM. A bad app can ask for all of the system memory faster than memory can be freed or backed out and thus trigger the killing to start before the system has responded. Again it's not good that a user space app can take down an OS. |
I think you swapped 1 and 2 above: 2 disables over provisioning entirely. B |
In addition I suspect that memory ballooning tends to reduce the ability to use swap efficiently when needed. Might want to try adding a lot of swap and disabling ballooning at the same time to your test strategies. B |
LOL, sorry yes I did. |
I've been running my changes for a couple days with much more stability. I think some programs are, shall we say, liberal with memory usage. e.g. Element (Matrix Client) asks for 1TB RAM (bug raised). The OOM heuristics take into account type, duration, history and a whole bunch of other stuff. A user program which abuses memory allocation/reservation could preempt those heuristics without correct OOM configuration of important system level tasks. Though I've been using Linux/Unix since Slackware and Yggdrasil back in the early 90's, I have to admit I am new to OOM features of Linux Kernel. Improving paging? |
Got some results but they were mixed. |
hi, i would just to chip in on this issue because it may actually more accurately describe my issue than #7695 . in the other ticket, the Firefox window is freezing. however my issue is that the qube itself freezes, requiring me to kill the qube itself & restart it. this seems to be more accurately described in this ticket. I get it once a day or two, in a qube dedicated to using firefox that i use daily. currently FF in fedora-37, initial memory 553MB, max memory 3GB, 4 VCPUs, included in memory balancing. will obviously try to switch to the newly released fedora-38 and see if issues continue. let me know if there are nice ways to get logs that may be helpful. would like to keep using firefox for this purpose but the issue is quite annoying. |
@mfc My first experience of this was that the Qube appreared to be frozen. I then started probing around and could see IO and CPU activity - evidence of brain function = not dead!! I would say that may not be enough RAM. As said I have managed to crash it with >8G (and no swap). The issue is the OOM killer system. Instead of letting one app fail it heads toards disaster for the whole system. What was happening is that the deamon that sits on the end of the pipe that joins DOMs had been killed and that meant that no commands from the Qube manager were getting through neither were any status messages. I looked at the log and you could see it trying to restart, but at the same time as a bunch of other processes and it kept loosing out. The other channel is GUI and that uses X over the same interface. So no GUI updates either. There is also a video buffer shared memory, but I don't know how that works. I explored some of the Xen tools (like Another time I monitored the situation using top and could see it going wrong in real time. Dom0 is surprisingly memory hungry and suffers exactly the same issues. Things to do: Then run FF as normal and see what happens. Logs: sudo tail -f /var/log/syslog
journalctl -kb -0 # -(zero) = last boot info
journalctl -xe # xtra detail and recent errors at the bottom
jounalctl -f # Continuous output (add -x or -u <unit> to change detail) The problem with jounalctl is without -u it gets noisy and unless you know the name of the unit(s) you need to know about you don't know the name to use. Then you have the exported logs (if you can't get the system ones) by right clicking on the Qube in manage Do the same on DOM0 as that is sending the commands and expecting responses. The is a completely separate logging system for Xen |
Found this while searching for reasons for frequent qube "comas" in 4.1. It appeared to happen at high memory load and searching in this direction I quickly found this issue. In my case I'm running Jira, Confluence, Postgres, Visual Studio Code, and browser tabs galore in a 12 Gig app VM, on a machine with 16 Gig RAM. "Own fault" you might say. But having to kill the App VM regularly is not particularly convenient. |
This comment was marked as outdated.
This comment was marked as outdated.
Affects 4.2 |
I've seen other similar issues that may be closely related to this one.
And I was going to comment on those but my notes all got a bit too long
e.g. I think this may be exactly the same issue #7695
Qubes OS release
4.1.1
Brief summary
I've had issues with an appVM qube randomly hanging every few days. The machine has 32G RAM. The app causing issues has an obvious memory leak and that is what kills the machine every 3-4 days. The appvm has 12G allocated.
When doing some data processing work I noticed something interesting that could be related.
When I ran a jupyter notebook I could get it to hang in the same way as the faulty app. (for which I am waiting a bug fix, but that's unrelated)
But then if I ran that same python module from the CLI it did not.
I tried a few experiments and came up with a minimal, reasonably repeatable means of triggering the 'hang'.
I've documented that here.
Steps to reproduce
(Collapsed due to length)
Make any Qube freeze with a handful of lines of Python
On an app vm
Test1: Same VM as a user
python3 memoryhog.py # runs for a bit then says 'Killed'
Test2: Try this in qtconsole
Then as a user in a termial
Test3: Same from dom0 console
Expected behavior
Actual behavior
Other observations that may help:
When it's broken, the
xentop
command on dom0 shows the offending vm's CPU is 100%.If you know which app is be causing the issue (which the logs indicate) you can sometimes recover using
qvm-run --pass-io vmname -- killall <appname>
(orkill SIGTERM <pids of related processes>
)My thoughts
(Collapsed section due to length)
Possible reasons (guesses)?
The app itself is uncooperative with oom-killer and is restarting all of part of itself as fast as oom can kill it? Wack-a-mole
Bits of some mechanisms related to gui apps in the OS are left partially broken?
Some interdependent threads are left running when they should be stopped or restarted so the resources can be freed?
Whatever is happening it's clear, at least for me, when you trigger this using the above script, oom-killer sometimes starts breaking other things, making things worse instead of better. It would be great if others can confirm the same links.
Additionally, and perhaps related, I notice that is another issue. Again, confirmation needed: When you trigger this dom0 is completely unaware there is anything awry. Any command you type or status you view just says everything is fine. You can even interact with the container dialogs of the broken app. When in reality, if you dig deeper, the app vm it's sitting at 100% cpu and either killing things as fast as they are starting or failing to recover critical components.
The usual solution is often to kill that appvm completely as there is no way to login to fix the issue as no more apps (i.e. a terminal) can start.
I did resolve the issue a couple times by sending the remote kill command to take out processes, only because I happened to be running
top
in another window when it was hung, so I had those pids to hand.I should read up, research and understand better how this all works so I apologies if I am making assumptions, right now I don't have time to dig into this more. So having made an excuse, I will make the following (uneducated) conclusion:
Any offending app should just fail in a memory allocation as it would normally on any other OS and then quit gracefully or segfault. The result of the impending oom state surely should not be to trigger a means to randomly kill off tasks, as that might create chaos (core wars they used to call it) and compromises stability of the entire system instead of just one app. What I mean is just because (in the above) it is python using all the memory how can oom-killer know for certain which app was the straw that broke the camel and not wrongly pick on some other poor process that is just happening to do a malloc at about the same time?
Thus maybe this is a fundamental issue and it is oom-killer that is the cause. Perhaps what
oom-killer
should do is just send a message to dom0, which would simply pause/freeze the stressed qube and report the issue to the user in a message dialog. This would happen when some threshold is exceeded, not when its too late, allowing an immediate unfreeze and user intervention, indeed the next action would be to present some sort of process manager and a dice.Final thought:
In reality I think it's just the graphics and IO that can't respond, so I guess this could be described as a coma, not a freeze!!
As there is no way I've found (so far) to create a terminal login to inspect and resolve the issue once triggered
The vm is clearly still alive but not visibly.
Hopefully the few hours/days I've put into this will help others with better knowledge come up with a simple resolution. For me it's a tad annoying and I have lost work to it but not enough to give up on QubeOS. It could even be just a common config error that myself and a few others have fallen into and nothing to do with oom-killer at all?
The text was updated successfully, but these errors were encountered: