Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle zstor temporary outage/SIGKILL #21

Open
OmarElawady opened this issue Jun 27, 2022 · 3 comments
Open

How to handle zstor temporary outage/SIGKILL #21

OmarElawady opened this issue Jun 27, 2022 · 3 comments

Comments

@OmarElawady
Copy link
Contributor

If zdb calls the hook and zstor was down, nothing is done to make sure the datafile is uploaded. Also, a recent change is zstor was made to make the store commands non-blocking and it internally queues theses commands. If zstor was SIGKILLed, this data is never uploaded again.
After discussion with @LeeSmet @maxux, two approaches is suggested. To make zstor client store the store commands in a persistent queue from which zstor can pick up the commands and execute them. The second approach is to make zdb use the check zstor command to check that the files are uploaded successfully. For example, it can keep track of last uploaded data file, and check periodically the file after it for being successfully uploaded, and if it's not it can reissue the store command.

@maxux
Copy link
Collaborator

maxux commented Jun 27, 2022

Commands issue in zdb hook are usually blocking (not all of them), it's better to have the fastest hook possible, adding some check and retry there is really not a good idea :)

@scottyeager
Copy link
Contributor

A quick and easy solution would be to add a script that runs as a cron job and does the check/store routine. Maybe some mechanism is needed to prevent a duplicate call of store when the same file is already being stored (not sure what happens if this is attempted). Otherwise, I think a very simple approach of just iterating over the existing data files and checking all of them except for the last (open) one can achieve most of what's needed. Namespace files shouldn't be hard either.

The index files are more tricky though, since they can be mutated and then need to be uploaded again. In that case, we could check that the hash returned by zstor matches the hash of the local copy. Another approach would be storing the set of modified files when the relevant hook runs, then clearing them when a check succeeds.

Checking all the files on each run isn't ideal. So we could keep a list of all the file names that were already checked. That adds some complication but is probably an acceptable trade off.

@scottyeager
Copy link
Contributor

I put together a script implementing my idea above.

It has the following behaviors:

  1. Checks that the namespace files are stored
  2. Checks that the data files are stored, except for the highest index number (the one zdb is still writing into)
  3. Writes a list of checked data files into a local text file, and skips checking in the future if the file name is present
  4. Index files get checked every time, since they are mutable and might have changed
  5. By default runs in an infinite loop with a sleep after every cycle. Sleep time is an arg and negative values cause only a single run (for use with cron for example)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants