How to handle zstor temporary outage/SIGKILL #21

OmarElawady · 2022-06-27T12:24:46Z

If zdb calls the hook and zstor was down, nothing is done to make sure the datafile is uploaded. Also, a recent change is zstor was made to make the store commands non-blocking and it internally queues theses commands. If zstor was SIGKILLed, this data is never uploaded again.
After discussion with @LeeSmet @maxux, two approaches is suggested. To make zstor client store the store commands in a persistent queue from which zstor can pick up the commands and execute them. The second approach is to make zdb use the check zstor command to check that the files are uploaded successfully. For example, it can keep track of last uploaded data file, and check periodically the file after it for being successfully uploaded, and if it's not it can reissue the store command.

The text was updated successfully, but these errors were encountered:

maxux · 2022-06-27T12:39:47Z

Commands issue in zdb hook are usually blocking (not all of them), it's better to have the fastest hook possible, adding some check and retry there is really not a good idea :)

scottyeager · 2024-11-16T02:12:02Z

A quick and easy solution would be to add a script that runs as a cron job and does the check/store routine. Maybe some mechanism is needed to prevent a duplicate call of store when the same file is already being stored (not sure what happens if this is attempted). Otherwise, I think a very simple approach of just iterating over the existing data files and checking all of them except for the last (open) one can achieve most of what's needed. Namespace files shouldn't be hard either.

The index files are more tricky though, since they can be mutated and then need to be uploaded again. In that case, we could check that the hash returned by zstor matches the hash of the local copy. Another approach would be storing the set of modified files when the relevant hook runs, then clearing them when a check succeeds.

Checking all the files on each run isn't ideal. So we could keep a list of all the file names that were already checked. That adds some complication but is probably an acceptable trade off.

scottyeager · 2024-12-20T03:27:59Z

I put together a script implementing my idea above.

It has the following behaviors:

Checks that the namespace files are stored
Checks that the data files are stored, except for the highest index number (the one zdb is still writing into)
Writes a list of checked data files into a local text file, and skips checking in the future if the file name is present
Index files get checked every time, since they are mutable and might have changed
By default runs in an infinite loop with a sleep after every cycle. Sleep time is an arg and negative values cause only a single run (for use with cron for example)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle zstor temporary outage/SIGKILL #21

How to handle zstor temporary outage/SIGKILL #21

OmarElawady commented Jun 27, 2022

maxux commented Jun 27, 2022

scottyeager commented Nov 16, 2024

scottyeager commented Dec 20, 2024

How to handle zstor temporary outage/SIGKILL #21

How to handle zstor temporary outage/SIGKILL #21

Comments

OmarElawady commented Jun 27, 2022

maxux commented Jun 27, 2022

scottyeager commented Nov 16, 2024

scottyeager commented Dec 20, 2024