Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Race condition when running backup and monitor at ~ the same time will result in failed backup #4496

Open
MicLieg opened this issue Feb 12, 2024 · 2 comments

Comments

@MicLieg
Copy link
Contributor

MicLieg commented Feb 12, 2024

User story

As a server admin, i want the monitor command to not interrupt currently running backups.

Game

All ?!

Linux distro

Ubuntu 22.04

Command

command: backup

Further information

I noticed a bug when my Minecraft server got mistakenly restarted during a backup because of an automated cronjob running the monitor command (while the server was creating a backup) but essentially at the same time. This occured becasuse i have the monitor command set to run every 5 minutes and the backup command to run every day at 4:00 am.

After looking into what's happening i think i found the issue:

If you start a backup and immediatley run the monitor command after the server has already been stopped by the backup command but before the backup.lock file was created, the monitor command will pass the backup.lock filecheck and run regularly (start querying the gameserver) and restart the server because the backup.lock file didn't exist. The monitor command will send out the appropriate alert Unable to query mcserver. Game server has been restarted. and restart the server. This will possibly result in the backup not finishing becasue of a tar error: Backing up mcserver: Backup (9,7G) mcserver-2024-02-12-015407.tar.gz, in progress...tar: ././serverfiles/world: file changed as we read it FAIL This will also lead to the backup.lock file not being deleted.

Note that this will not be noticed if the backup finishes within 60 seconds of the monitor command being called, in which case the monitor command will simply assume that the server is online and thus won't restart the server.

If creating the backup takes a little longer the monitor command will continue querying and spew out errors depending how far the backup command got:

[ ./mcserver backup is already running ]
testgameserver@HostMachine:~/minecraft$ ./mcserver m
[  OK  ] Monitoring mcserver: Checking session: OK
[ .... ] Monitoring mcserver: Querying port: gamedig: 192.168.1.60:25566 : 0/1: QUERYINGhead: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
/home/testgameserver/minecraft/lgsm/modules/command_monitor.sh: line 273: [: : integer expression expected
[ .... ] Monitoring mcserver: Querying port: gamedig: 127.0.0.1:25566 : 15/2: QUERYINGhead: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
/home/testgameserver/minecraft/lgsm/modules/command_monitor.sh: line 273: [: : integer expression expected
[ .... ] Monitoring mcserver: Querying port: gamedig: 192.168.1.60:25566 : 15/2: QUERYINGhead: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
/home/testgameserver/minecraft/lgsm/modules/command_monitor.sh: line 273: [: : integer expression expected
[ .... ] Monitoring mcserver: Querying port: gamedig: 127.0.0.1:25566 : 30/3: QUERYINGhead: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
/home/testgameserver/minecraft/lgsm/modules/command_monitor.sh: line 273: [: : integer expression expected
[ .... ] Monitoring mcserver: Querying port: gamedig: 192.168.1.60:25566 : 30/3: QUERYINGhead: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
/home/testgameserver/minecraft/lgsm/modules/command_monitor.sh: line 273: [: : integer expression expected
[ .... ] Monitoring mcserver: Querying port: gamedig: 127.0.0.1:25566 : 45/4: QUERYINGhead: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
/home/testgameserver/minecraft/lgsm/modules/command_monitor.sh: line 273: [: : integer expression expected
[ .... ] Monitoring mcserver: Querying port: gamedig: 192.168.1.60:25566 : 45/4: QUERYINGhead: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
/home/testgameserver/minecraft/lgsm/modules/command_monitor.sh: line 273: [: : integer expression expected
[ .... ] Monitoring mcserver: Querying port: gamedig: 127.0.0.1:25566 : 60/5: QUERYINGhead: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
/home/testgameserver/minecraft/lgsm/modules/command_monitor.sh: line 273: [: : integer expression expected
[ FAIL ] Monitoring mcserver: Querying port: gamedig: 127.0.0.1:25566 : 60/5: FAIL
[  OK  ] Monitoring mcserver: Sending Discord alert
fetching GitHub command_restart.sh...OK

Another error that can occur depending on timing:

[ ./mcserver backup is already running ]
testgameserver@HostMachine:~/minecraft$ ./mcserver m
[  OK  ] Monitoring mcserver: Checking session: OK
[  OK  ] Monitoring mcserver: Querying port: gamedig: 0.0.0.0:25566 : 0/1: DELAY
head: cannot open '/home/testgameserver/minecraft/lgsm/lock/mcserver-started.lock' for reading: No such file or directory
date: invalid date ‘@’

core_exit.sh exiting with code: 0

Possible Solutions:

  • Create the backup.lock file earlier in the backup flow.
  • Re-Check the existance of lockfiles whilst querying the server in the monitor flow.
  • Add a gameserver-monitor.lock file that forces commands that affect uptime to wait until the lock file is removed.

Relevant log output

Feb 12 01:54:00.784 mcserver: START: PASS: Started TestServer
Feb 12 01:54:00.887 mcserver: START: PASS: Started TestServer
Feb 12 01:54:00.889 mcserver: START: INFO: Sending alert: mcserver has started
Feb 12 01:54:02.618 mcserver: START: PASS: Sending Discord alert
Feb 12 01:54:02.624 mcserver: START: INFO: LinuxGSM version: v24.1.3
Feb 12 01:54:02.625 mcserver: START: PASS: core_exit.sh exiting with code: 0
Feb 12 01:54:07.788 mcserver: BACKUP: INFO: Starting backup
Feb 12 01:54:07.890 mcserver: BACKUP: INFO: No previous backups found
Feb 12 01:54:07.995 mcserver: BACKUP: WARN: mcserver will be restarted
Feb 12 01:54:13.036 mcserver: STOP: INFO: Graceful: sending "stop"
Feb 12 01:54:14.023 mcserver: MONITOR: INFO: Checking session: CHECKING
Feb 12 01:54:14.236 mcserver: MONITOR: PASS: Checking session: OK
Feb 12 01:54:15.740 mcserver: MONITOR: INFO: Querying port: gamedig: 127.0.0.1:25566 : 1 : QUERYING
Feb 12 01:54:16.844 mcserver: MONITOR: INFO: Querying port: gamedig: 0.0.0.0:25566 : 1 : DELAY
Feb 12 01:54:16.845 mcserver: MONITOR: INFO: Query bypassed: mcserver started less than 1 minutes ago
Feb 12 01:54:16.847 mcserver: MONITOR: INFO: Server started: Mo 12. Feb 01:54:00 CET 2024
Feb 12 01:54:16.848 mcserver: MONITOR: INFO: Current time: Mo 12. Feb 01:54:16 CET 2024
Feb 12 01:54:16.851 mcserver: MONITOR: INFO: LinuxGSM version: v24.1.3
Feb 12 01:54:16.852 mcserver: MONITOR: PASS: core_exit.sh exiting with code: 0
Feb 12 01:54:19.965 mcserver: MONITOR: INFO: Checking session: CHECKING
Feb 12 01:54:20.178 mcserver: MONITOR: PASS: Checking session: OK
Feb 12 01:54:20.774 mcserver: STOP: PASS: Graceful: sending "stop": OK: 6 seconds
Feb 12 01:54:20.784 mcserver: BACKUP: INFO: Backup lockfile generated
Feb 12 01:54:20.785 mcserver: BACKUP: INFO: /home/testgameserver/minecraft/lgsm/lock/backup.lock
Feb 12 01:54:20.887 mcserver: BACKUP: INFO: A total of 9,7G will be compressed: /home/testgameserver/minecraft/lgsm/backup/mcserver-2024-02-12-015407.tar.gz
Feb 12 01:54:21.389 mcserver: BACKUP: INFO: Backup 9,7G mcserver-2024-02-12-015407.tar.gz, in progress
Feb 12 01:54:21.681 mcserver: MONITOR: INFO: Querying port: gamedig: 127.0.0.1:25566 : 1 : QUERYING
Feb 12 01:54:21.891 mcserver: MONITOR: WARN: Querying port: gamedig: 127.0.0.1:25566 : 1 : FAIL
Feb 12 01:54:23.393 mcserver: MONITOR: INFO: Querying port: gamedig: 192.168.1.60:25566 : 1 : QUERYING
Feb 12 01:54:23.603 mcserver: MONITOR: WARN: Querying port: gamedig: 192.168.1.60:25566 : 1 : FAIL
Feb 12 01:54:41.632 mcserver: MONITOR: INFO: Querying port: gamedig: 127.0.0.1:25566 : 2 : QUERYING
Feb 12 01:54:41.841 mcserver: MONITOR: WARN: Querying port: gamedig: 127.0.0.1:25566 : 2 : FAIL
Feb 12 01:54:43.344 mcserver: MONITOR: INFO: Querying port: gamedig: 192.168.1.60:25566 : 2 : QUERYING
Feb 12 01:54:43.553 mcserver: MONITOR: WARN: Querying port: gamedig: 192.168.1.60:25566 : 2 : FAIL
Feb 12 01:55:01.582 mcserver: MONITOR: INFO: Querying port: gamedig: 127.0.0.1:25566 : 3 : QUERYING
Feb 12 01:55:01.792 mcserver: MONITOR: WARN: Querying port: gamedig: 127.0.0.1:25566 : 3 : FAIL
Feb 12 01:55:03.295 mcserver: MONITOR: INFO: Querying port: gamedig: 192.168.1.60:25566 : 3 : QUERYING
Feb 12 01:55:03.505 mcserver: MONITOR: WARN: Querying port: gamedig: 192.168.1.60:25566 : 3 : FAIL
Feb 12 01:55:21.532 mcserver: MONITOR: INFO: Querying port: gamedig: 127.0.0.1:25566 : 4 : QUERYING
Feb 12 01:55:21.742 mcserver: MONITOR: WARN: Querying port: gamedig: 127.0.0.1:25566 : 4 : FAIL
Feb 12 01:55:23.245 mcserver: MONITOR: INFO: Querying port: gamedig: 192.168.1.60:25566 : 4 : QUERYING
Feb 12 01:55:23.454 mcserver: MONITOR: WARN: Querying port: gamedig: 192.168.1.60:25566 : 4 : FAIL
Feb 12 01:55:41.481 mcserver: MONITOR: INFO: Querying port: gamedig: 127.0.0.1:25566 : 5 : QUERYING
Feb 12 01:55:41.691 mcserver: MONITOR: WARN: Querying port: gamedig: 127.0.0.1:25566 : 5 : FAIL
Feb 12 01:55:41.894 mcserver: MONITOR: WARN: Querying port: gamedig: 127.0.0.1:25566 : 5 : FAIL
Feb 12 01:55:41.897 mcserver: MONITOR: INFO: Sending alert: Unable to query mcserver. Game server has been restarted
Feb 12 01:55:43.635 mcserver: MONITOR: PASS: Sending Discord alert
Feb 12 01:55:44.937 mcserver: STOP: INFO: TestServer is already stopped
Feb 12 01:55:46.589 mcserver: START: INFO: Rotating log files

Feb 12 01:55:47.710 mcserver: START: PASS: Started TestServer
Feb 12 01:55:47.812 mcserver: START: PASS: Started TestServer
Feb 12 01:55:47.820 mcserver: MONITOR: INFO: LinuxGSM version: v24.1.3
Feb 12 01:55:47.820 mcserver: MONITOR: PASS: core_exit.sh exiting with code: 0
Feb 12 01:57:25.220 mcserver: BACKUP: FATAL: Backup in progress: FAIL

Feb 12 01:57:25.323 mcserver: BACKUP: FATAL: Starting backup
Feb 12 01:57:27.148 mcserver: START: ERROR: TestServer is already running
Feb 12 01:57:27.725 mcserver: START: INFO: Rotating log files

Feb 12 01:55:47.710 mcserver: START: PASS: Started TestServer
Feb 12 01:55:47.812 mcserver: START: PASS: Started TestServer
Feb 12 01:55:47.820 mcserver: MONITOR: INFO: LinuxGSM version: v24.1.3
Feb 12 01:55:47.820 mcserver: MONITOR: PASS: core_exit.sh exiting with code: 0
Feb 12 01:57:25.220 mcserver: BACKUP: FATAL: Backup in progress: FAIL

Feb 12 01:57:25.323 mcserver: BACKUP: FATAL: Starting backup
Feb 12 01:57:27.148 mcserver: START: ERROR: TestServer is already running
Feb 12 01:57:27.725 mcserver: START: INFO: Rotating log files

Feb 12 01:57:28.842 mcserver: START: PASS: Started TestServer
Feb 12 01:57:28.945 mcserver: START: PASS: Started TestServer
Feb 12 01:57:28.953 mcserver: BACKUP: INFO: LinuxGSM version: v24.1.3
Feb 12 01:57:28.954 mcserver: BACKUP: PASS: core_exit.sh exiting with code: 0
Feb 12 01:58:04.750 mcserver: DETAILS: INFO: LinuxGSM version: v24.1.3
Feb 12 01:58:04.751 mcserver: DETAILS: PASS: core_exit.sh exiting with code: 0

Steps to reproduce

  1. Install any gameserver
  2. Enable stoponbackup
  3. Start gameserver
  4. Run ./gameserver backup
  5. Wait until the backup command is trying to stop the gameserver
  6. Run ./gameserver monitor
    • If monitor command notices a backup.lock file, you were too slow and the backup command has gotten too far already
    • If monitor command finds that the server is running, the backup command didn't stop the gameserver yet (quickly re-run the monitor command)
  7. The monitor command should find that the server is not online and start to repeatedly query the gameserver for 60 seconds
    • If the backup takes less than 60 seconds, the monitor command will acept the gameserver to be online (again)
    • If the backup takes more than 60 seconds, the monitor will assume the server is offline and will restart the gameserver
      • Force this by adding a big file to the serverfiles directory
  8. If the gameserver got restarted by the monitor command the backup can fail

Keep in mind to manually delete the backup.lock file in between trying to reproduce this bug becasue it won't be deleted sometimes.

@MicLieg
Copy link
Contributor Author

MicLieg commented Feb 24, 2024

Possibly related to #3972

@MicLieg
Copy link
Contributor Author

MicLieg commented Feb 25, 2024

link #2106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🆕 New Issues
Development

No branches or pull requests

1 participant