You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some sort of life-cycle handling should be added, so that 404's or similar can be queried. De-duplication is more tricky as it seems a waste to index the same content twice. One idea is to make updatable documents by supporting some sort of pseudo-delete functionality for the frozen shards. A shared inverse filter with the IDs of deleted documents maybe? Or maybe reserve space for the delete file (and its temporary twin when deleted are added) and make a contract that says that only deletes are allowed for frozen shards?
The text was updated successfully, but these errors were encountered:
Latest plan for revisits:
Index revisits, use hash from warc-header. It is identical to the one calculated in warc-indexer code.
new fields:
warc-id (for all records)
warc-refers-to (only for revisits)
Some sort of life-cycle handling should be added, so that 404's or similar can be queried. De-duplication is more tricky as it seems a waste to index the same content twice. One idea is to make updatable documents by supporting some sort of pseudo-delete functionality for the frozen shards. A shared inverse filter with the IDs of deleted documents maybe? Or maybe reserve space for the delete file (and its temporary twin when deleted are added) and make a contract that says that only deletes are allowed for frozen shards?
The text was updated successfully, but these errors were encountered: