Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental mode #11

Open
xpqz opened this issue Apr 27, 2017 · 17 comments
Open

Incremental mode #11

xpqz opened this issue Apr 27, 2017 · 17 comments

Comments

@xpqz
Copy link
Contributor

xpqz commented Apr 27, 2017

Incremental backup mode: retain the seq id from the last completed run, allow the backup tool to start from there.

Incremental restore mode: (this may already be possible) restore from a set of incremental backups.

@ricellis ricellis added this to the Later milestone Apr 27, 2017
@mikerhodes
Copy link
Member

I think the core design concern here is safety -- can we have an incremental backup file list its "parent" file for example (via something like GUID?)?

The restore tool could then construct a restore path from a complete backup via intermediary incremental backups, and refuse if that chain wasn't complete.

Other sanity checks are possible such as:

  • Store a completion seq in a backup file's header/footer. The backup tool can resume from a previous backup file then, rather than the user having to sort the seq values.
  • Store a start seq in a backup file's header/footer. The restore tool can check that the chain of completion seq to start seq is complete for a given restore chain of files.
  • We should encourage regular complete backups. As the incremental chain gets longer, its fidelity will decrease (as backup files may get lost) and restore time will increase.

A --force could override this, in case a backup file was lost so a partial restore is possible.

@pulkitanchalia
Copy link

Is this feature coming in near future?

@jareware
Copy link

Is this not what --resume true is for..?

@ricellis
Copy link
Member

no, --resume is different, see #218 (comment)

@wmbutler
Copy link
Contributor

wmbutler commented Jul 2, 2018

Any word on planned support for incremental backups?

@ricellis
Copy link
Member

ricellis commented Jul 3, 2018

Not at this time, no.

@wmbutler
Copy link
Contributor

We have some databases that are taking 8-12 hours to backup. Can we get a concerted effort to look at this?

@baversjo
Copy link

baversjo commented Oct 17, 2018

We're streaming backups up to S3. Would it be possible to implement a solution that doesn't require download of the entire backup file from S3, to start a new (incremental) backup? Maybe there could be an "index" file that has information about all incremental backups and what backup is the last full backup (base for all increments). Also, maybe the backup system will max store 14 incremental backups and force a full backup after that?

@wmbutler
Copy link
Contributor

wmbutler commented Dec 2, 2018

I've been playing with the /_changes endpoint. Seems to me we could fork this repo and start playing around with the idea of adding a last-event-id flag to pass through. Only two things are needed really:

  1. The backup must record the seq from the /_changes endpoint return
  2. We need to store the seq so we know where to pick up from

Seems that it might make sense to store this in a text file with the backup. The file could be called:

last-event-id.txt and could contain the seq value.

 "results": [
        {
            "seq": "6-g1AAAAMteJyl0TEOwiAUgGG0JuopdHNrSii2nfQm-h4PU03FxFRXvYneRG-iN6kgi91qu0BCwvcHXsEYG-UBsRmhOhz1kpCHQGdtylxDUeZRxENVHE4EpgyNLgt7oQ8MJ1VV7fIA2N4eDBWpVPCY2PhkSG-2RtPfKE7tiou6CxKyjHdzl85d1dxUCJJI3dy1cy81N5GY2G_o5JqBXdnVbpa-ObvnbZGkBNBca1m--_LDlfvfcjxHrQRvWxZNy09ffrly4MsbOZepbK61fPPbl38mGSFEGdUmufsAPT4DhQ",
            "id": "a0234f3e12be9d3faec1510e356a2257",
            "changes": [
                {
                    "rev": "1-42fd8ac8d7af38c45be37171b344f6a5"
                }
            ],
            "doc": {
                "_id": "a0234f3e12be9d3faec1510e356a2257",
                "_rev": "1-42fd8ac8d7af38c45be37171b344f6a5",
                "first": "frank"
            }
        }
    ],
    "last_seq": "6-g1AAAAMteJyl0TEOwiAUgGG0JuopdHNrSii2nfQm-h4PU03FxFRXvYneRG-iN6kgi91qu0BCwvcHXsEYG-UBsRmhOhz1kpCHQGdtylxDUeZRxENVHE4EpgyNLgt7oQ8MJ1VV7fIA2N4eDBWpVPCY2PhkSG-2RtPfKE7tiou6CxKyjHdzl85d1dxUCJJI3dy1cy81N5GY2G_o5JqBXdnVbpa-ObvnbZGkBNBca1m--_LDlfvfcjxHrQRvWxZNy09ffrly4MsbOZepbK61fPPbl38mGSFEGdUmufsAPT4DhQ",
    "pending": 0
}

@wmbutler
Copy link
Contributor

wmbutler commented Dec 2, 2018

We're streaming backups up to S3. Would it be possible to implement a solution that doesn't require download of the entire backup file from S3, to start a new (incremental) backup? Maybe there could be an "index" file that has information about all incremental backups and what backup is the last full backup (base for all increments). Also, maybe the backup system will max store 14 incremental backups and force a full backup after that?

I think an incremental would be assumed if the user didn't pass in the last-event-id flag discussed above.

@wmbutler
Copy link
Contributor

wmbutler commented Dec 2, 2018

Looks like you are already capturing the lastSeq value upon successful backup.

https://github.com/cloudant/couchbackup/blob/e5517bad8559e44e72d5a0a43b1ad9df064fcf77/includes/spoolchanges.js#L90

Seems to me each database could have a manifest for managing incremental backups. Each line of the file could just be the latest successful lastSeq.

-last-event-id.txt

1-g1AAAAKJeJyl0UEOgjAQBdBRTNRT6M4doUECrOQmOu3UIKklMcWt3kRvojfRm2CbrtgJbKZJk_9mplUAsCgDgg1xUV9kQZyFSFepTSlRmTKKWChU3RBqE2pplA1MEfiqbduqDBDO9mIuSGQx2xIsG03yeNKSeqN8bSvfdV1MMM_ZOLdw7t65CP3TB5e-daZKE57aZUdNpWe2wt0eln44e-LtOM0I8X9tYOen7_wa8iZeeHvhM1z4esH9S_UDtxTSKQ
3-g1AAAALbeJy10EEOgjAQBdBRTNRT6M4doQFCWclNdNqpQYIlMcWt3kRvojfRm2ArK3YKcTNNJun7P1MCwCz3CFYkZHVUGQnmI52UNrnC0uRBwHxZVjWhNr5WprQfxghi0TRNkXsIB7uYSpI8ZBHBvNakdnut6GdULO0U666LMaYpG-Zmzt10XB6GFAsa5m6de-64SSwSe4ZBrp7YCRf7WPrq7FFrhwknxO-1nsm3NvleIPQUHq3wdN3Hn-4UIeco_9791SY3tnvxBvS-6xc
5-g1AAAAMTeJy10EEOgjAQQNERTNRT6M4doSlIWclNdNqpQYIlMeBWb6I30ZvoTbDIwrBTiZtpMknfTyYHgHHqEsxJqmKvE5LMQzpoU6Ya8zL1feapvKgITekZXeb2g4Mgp3VdZ6mLsLOLkSIlOAsIJpUhvdkaTV-jcmanXHZdDDGOWT83adxVxxWcUyipn7tu3GPHjUIZ2TP0cs3QTjjZx9Lnxh60No8EIX6u_Vi-tOVrU3Ze5WAhteLs7-VbW76_yxSgEKj-Xn605TpDyJ6HkfzV

With this in place, the script could look in the same location where the target file was to be written for -last-event-id.txt, strip of the last line and use that as the value for last-event-id

@wmbutler
Copy link
Contributor

wmbutler commented Dec 3, 2018

I spent the weekend working on this. Instructions are in the Readme. Anyone interested in further collaboration would be welcome. I've tested it and it works. It keeps mostly with the spirit of existing functionality but the code might be a little sloppy in places. Hoping the Cloudant team will get a developer to review and fine tune things.

As it stands, it creates a new log with _0, _1, _2 appended for each occurrence where there is a revision. This means that end users can set the recurrence interval in crontab to whatever they like: 1 hr, 6 hrs, 1 day etc.

It's not an NPM but I included instructions for forking my repo and installing it in the readme.

https://github.com/wmbutler/couchbackup

@emlaver
Copy link
Contributor

emlaver commented Dec 3, 2018

@wmbutler Please open a PR and follow our contributing guidelines (e.g. added tests for code changes) to have our team review your changes.

@ricellis
Copy link
Member

ricellis commented Dec 3, 2018

One of the reasons this feature has been outstanding for a long while is that the simple solutions do not offer guarantees of completeness. That is:
i) guaranteeing that a set of incremental backup files is restoreable
ii) guaranteeing that a restore from a set of incremental files results in a complete database

Meeting these criteria would likely mean implementing a significant part of the replication protocol. I don't think we'd be able to accept an incremental backup solution that didn't offer this level of robustness.

@wmbutler
Copy link
Contributor

wmbutler commented Dec 3, 2018

@ricellis Would love to hear more detail. In reviewing the backup file, it appears to be an array of documents. All I'm doing is creating a series of text files each with an array of docs. It's basically a changelog driven means of creating multiple backup files. I'm not aware of any additional complexity regarding your statement:

i) guaranteeing that a set of incremental backup files is restorable

file_1

[
{},
{},
{}
]

file_2

[
{},
{}
]

I don't see how this is much different from
file

[
{},
{},
{},
{},
{}
]

It just seems to me that as a large company (IBM), it might make sense to dedicate a couple a hundred man hours to this pursuit. Failure to do so will mean losing customers to solutions that offer modern backup practices.

@baversjo
Copy link

baversjo commented Dec 3, 2018

Backup of our databases takes hours and lots of resources on our cluster. I agree with @wmbutler, an incremental backup solution is for my company, probably the most important missing feature in Cloudant. Especially as we've been told by our AEs to migrate from the managed / built in backup tool in Cloudant dedicated cluster to this open source solution. So indeed it would be nice for Cloudant to dedicate some resources to this. You'd think it can't be that complex as the couch database is an "append only" log file essentially.

@ghost
Copy link

ghost commented Apr 24, 2019

Jumping in to say that would be nice too; maybe we can start with a basic implementation as @wmbutler proposed, activated with a flag and a warning in the README regarding that flag?

@ricellis ricellis removed this from the Later milestone Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants