Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource life cycle tracking #5

Open
tokee opened this issue Dec 5, 2016 · 1 comment
Open

Resource life cycle tracking #5

tokee opened this issue Dec 5, 2016 · 1 comment

Comments

@tokee
Copy link

tokee commented Dec 5, 2016

Some sort of life-cycle handling should be added, so that 404's or similar can be queried. De-duplication is more tricky as it seems a waste to index the same content twice. One idea is to make updatable documents by supporting some sort of pseudo-delete functionality for the frozen shards. A shared inverse filter with the IDs of deleted documents maybe? Or maybe reserve space for the delete file (and its temporary twin when deleted are added) and make a contract that says that only deletes are allowed for frozen shards?

@thomasegense
Copy link

Latest plan for revisits:
Index revisits, use hash from warc-header. It is identical to the one calculated in warc-indexer code.
new fields:
warc-id (for all records)
warc-refers-to (only for revisits)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants