Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External USB drives possibly interrupting backup by going to sleep #7797

Closed
AlbertGoma opened this issue Sep 30, 2022 · 16 comments
Closed

External USB drives possibly interrupting backup by going to sleep #7797

AlbertGoma opened this issue Sep 30, 2022 · 16 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: core C: usb proxy eol-4.1 Closed because Qubes 4.1 has reached end-of-life (EOL) hardware support P: critical Priority: critical. Between "major" and "blocker" in severity.

Comments

@AlbertGoma
Copy link

Qubes OS release

Upgrade from 4.0 to 4.1.1

Brief summary

When doing a backup of relatively large VMs on an external USB drive the motor stops spinning at a certain point and the resulting file contains only a sequential portion of the data.

It's already mentioned it in my comment in issue #7567 as this might be a cause for I/O errors and the fix could probably solve both issues.

Steps to reproduce

  1. Attach an external USB hard drive (in this particular scenario, a 3.5" SATA hard drive on a USB 3.0 dock) with a valid partition table, enough space and a healthy filesystem (in this case GPT and ext4).
  2. Start a non-networked disposable VM and mount the drive's block device sys-usb:sda (not the USB device) on it.
  3. Start the Qubes Backup tool and select around 40 VMs with a few of them having a storage use over 200GiB each, exceeding 900GiB in total. Have some of those GiB filled from /dev/zero and some others from /dev/urandom, just in case.
  4. Uncheck the Compress backup checkbox and click Next.
  5. Set the disposable VM as the Target qube and choose a Backup directory in the external hard drive's filesystem.
  6. Click Next until the backup starts.
  7. Wait until the backup is apparently finished and the hard drive motor has stopped spinning.

Expected behavior

The restored VMs' logical volumes' storage byte count is identical to the original one before starting the backup.

Actual behavior

In the Qubes Backup Restore tool an I/O error popped out and half of the VMs showed 0 bytes of Disk Usage in the Qube Manager.

When doing an emergency recovery all of those 0-byte VMs had an Unexpected EOF error in all of their chunks when decrypting them with scrypt. One of the VMs' chunks were readable until the 490'th.

@AlbertGoma AlbertGoma added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug labels Sep 30, 2022
@DemiMarie
Copy link

First, Qubes Backup should definitely fail. That means it should indicate an error and return a non-zero exit code. If it does not, that is a bug.

The second is that your hardware might have problems. One possibility is a failing hard drive, but another is that it uses device-managed shingled magnetic recording (SMR). SMR drives can freeze for long periods of time during garbage collection, and this can cause Linux to treat them as failed and disconnect them.

@DemiMarie DemiMarie added P: critical Priority: critical. Between "major" and "blocker" in severity. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Oct 1, 2022
@andrewdavidwong andrewdavidwong added C: core C: usb proxy needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. hardware support labels Oct 1, 2022
@andrewdavidwong andrewdavidwong added this to the Release 4.1 updates milestone Oct 1, 2022
@AlbertGoma
Copy link
Author

AlbertGoma commented Oct 1, 2022

Regarding the hard drive I looked at the manufacturer's technical specifications sheet and all the 3.5" versions of that category used CMR (which I assume must be an acronym for Conventional Magnetic Recording). It could have been failing but it's supposed to be a high-end one within its 5-year warranty and it hasn't shown any other signs of failure yet. I could scan it for bad sectors if that might be useful.

Today I tried to reproduce the error on the same hard drive using the same USB 3.0 docking station but unfortunately the backup finished and verified successfully. However:

  • My current R4.1.1's sys-usb and the disposable VM where I mounted the drive are both based on the current debian-11 template rather than R4.0's old fedora-32. The kernel version of the disposable VM under both Qubes releases must have been different as well, as it uses PVH virtualization, but I don't remember which was the last version I had in R4.0. As sys-usb uses HVM I understand it uses the latest kernel installed on the template.
  • According to my phone's Clock app's stopwatch, 20 minutes and few seconds after python3 -m qubes.tarwriter started the hard drive's motor stopped, but when scrypt enc - /tmp/randomname/vmXX/private.img.XXX.enc appeared in Dom0's Task Manager the motor started spinning again.
  • I only tried to backup a single AppVM with the following data in the /home/user directory:
-rw-r--r-- 1 user user  50G Oct  1 09:21 urandom.1
-rw-r--r-- 1 user user 100G Oct  1 09:47 urandom.3
-rw-r--r-- 1 user user 200G Oct  1 09:37 zero.2
-rw-r--r-- 1 user user  50G Oct  1 09:49 zero.4
  • The qubes-backup file size is 153,656.37 MiB while the Disk Usage displayed by the Qube Manager is 419,471.36 MiB, therefore sparse zeroes have been left out.

So sleep happens, although not causing any I/O errors under these settings. The old fedora-32 template was saved from the disaster, so maybe I should try again using it for both sys-usb and dispVM in HVM mode and with enough backed up VMs to almost fill the entire drive so it causes multiple sleep events within the same backup session.

@rustybird
Copy link

  • The qubes-backup file size is 153,656.37 MiB while the Disk Usage displayed by the Qube Manager is 419,471.36 MiB, therefore sparse zeroes have been left out.

So sleep happens, although not causing any I/O errors under these settings.

That's normal on LVM.

@ddevz
Copy link

ddevz commented Oct 7, 2022

... I made the risky decision of not verifying the backup's integrity, as it would have required a similar amount of hours ...

While I recommend doing verifies in the future, dont feel too bad about that decision because that the "verify" does not actually seem to verify that the backup happened, meaning that you could have done the verify and gotten the "everything backed up fine", and still had the same problem. (The EOF message implies to me that this would have happened to you) (note: I've just turned the verify problem into it's own issue at #7809 )

@AlbertGoma
Copy link
Author

In case it may be useful, the USB 3.0 dock I used to perform the backup was a Sharkoon QuickPort Combo USB3.0. Both the computer and the dock were plugged into an Uninterruptible Power Suply.

@andrewdavidwong
Copy link
Member

To be clear: This happens only when using the dock; it does not happen when bypassing the dock and plugging the external USB hard drive directly into the computer?

@AlbertGoma
Copy link
Author

This dock's function is to allow using internal drives as external ones. After the failure I kept doing backups on that hard drive but bypassing the dock and plugging the drive directly into the motherboard's SATA port. When I do backups like this the motor never stops and the verify seems to work fine. (However I never dared to restore them on my PC yet. I could install Qubes on another drive and try to restore them there to confirm the verification process didn't give a false success message if that may be useful)

@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
@andrewdavidwong andrewdavidwong removed this from the Release 4.1 updates milestone Aug 13, 2023
@andrewdavidwong andrewdavidwong added the eol-4.1 Closed because Qubes 4.1 has reached end-of-life (EOL) label Dec 7, 2024

This comment has been minimized.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024
@andrewdavidwong andrewdavidwong removed the needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. label Dec 7, 2024
@SarneWeber
Copy link

Affects R4.2.3

@SarneWeber
Copy link

In the cube that the drive is attached to I ran a script that created and delete a file every 5 seconds hoping that would make the drive stay awake, but the backup still got stuck unfortunately

@andrewdavidwong andrewdavidwong added needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. affects-4.2 This issue affects Qubes OS 4.2. and removed eol-4.1 Closed because Qubes 4.1 has reached end-of-life (EOL) T: bug labels Jan 22, 2025
@DemiMarie
Copy link

In the cube that the drive is attached to I ran a script that created and delete a file every 5 seconds hoping that would make the drive stay awake, but the backup still got stuck unfortunately

You need to call sync or fsync to ensure the changes make it all the way out to storage, rather than just staying in a cache somewhere.

@rustybird
Copy link

rustybird commented Jan 23, 2025

Is it even a problem if the drive spins down? Normally it should just transparently spin up again when data transfer resumes. This is not supposed to upset filesystem mounts etc.

I think it's usually not that spinning down causes a pause in the backup, but the reverse: Some temporary or indefinite pause in the backup causes the drive to spin down.

@SarneWeber Do you see any kernel errors?

@SarneWeber
Copy link

SarneWeber commented Jan 24, 2025

Thanks for reminding me to call sync. I tried it with that, but got the same results unfortunately.

Then, I tried something else - I tried to use ssh to store the backup on a different computer, with a different (this time internal) drive. The drive has not gone to sleep, and yet the backup is still stuck at the same amount of bytes written as the case was with my external USB drive. Therefore, I believe I was mistaken, as rustybird suggested.

I am still curious about what the problem is that causes my backups to get stuck. But I'm unsure if a comment on this issue is an appropiate place, as the description of the issue likely does not match my issue. So please tell me if it is inappropriate.

Here's some details about my issue:

  • When the backup has failed and has gotten stuck (always after roughly the same amount of bytes written, with less than 0.1% difference), neither the gzip nor the scrypt proces shows up in top.
  • If I try to backup only shutdown qubes (and if I exclude dom0 as well), the backup still gets stuck.
  • If I turn off compression, the backup still gets stuck, this time with a larger file size.
  • I see no concerning messages in journalctl or dmesg in dom0, the disposable VM that the drive is attached to, or sys-usb. That is, apart from journalctl messages in dom0 that say [Time+date] dom0 qubesd[3004]: socket.send() raised exception.
  • Small backups do work
  • The backup still hangs if I keep using my laptop throughout the backup process
  • I have one particularly large qube. If I make a backup with that one disabled and with all VMs that are turned on disabled, the backup still fails.
  • I'm currently in the process of testing the backup with only the large qube. The large qube is significantly larger than the failed backup files.
  • If I try to verify these failed backups, the recovery tool succesfully recognises there's an unexpected end of file.
  • Sometimes the Qubes Backup window becomes unresponsive
  • I tried the CLI utility once in verbose mode, but got the same result.

@andrewdavidwong
Copy link
Member

@SarneWeber, if I understand you correctly, you're saying that this issue does not really affect Qubes 4.2 after all and therefore should not have been reopened. In that case, I'll re-close this issue.

Regarding your related problem, please note that this issue tracker (qubes-issues) is not intended to serve as a help desk or tech support center. Instead, we've set up other venues where you can ask for help and support, ask questions, and have discussions. (By contrast, the issue tracker is more of a technical tool intended to support our developers in their work.) Thank you for your understanding.

@andrewdavidwong andrewdavidwong added eol-4.1 Closed because Qubes 4.1 has reached end-of-life (EOL) and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. affects-4.2 This issue affects Qubes OS 4.2. labels Jan 25, 2025
Copy link

This issue is being closed because:

If anyone believes that this issue should be reopened, please leave a comment saying so.
(For example, if a bug still affects Qubes OS 4.2, then the comment "Affects 4.2" will suffice.)

@rustybird
Copy link

rustybird commented Jan 25, 2025

@SarneWeber This could be a combination of two problems:

  1. Some of your VMs are causing an error when they are being backed up. Try to narrow it down to one individual VM by doing a binary search (i.e. divide the VMs to be backed up in half to see which half hangs forever and repeat the process with that half). Then maybe create a new ticket or forum post?

  2. What should be a noisy fatal error sometimes causes the backup system to silently hang forever. I've opened Some fatal backup errors cause the backup system to silently hang forever (unless backing up to dom0?) #9739 to track this aspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. C: core C: usb proxy eol-4.1 Closed because Qubes 4.1 has reached end-of-life (EOL) hardware support P: critical Priority: critical. Between "major" and "blocker" in severity.
Projects
None yet
Development

No branches or pull requests

6 participants