-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep and upload database of finished jobs #465
Comments
This is a great idea, I think size is not a problem here. I'm not sure what is exactly stored in the DB, any sensitive information? If not, we should definitely preserve the DB (also on finished jobs). |
Nothing sensitive at all. It contains only information on the crawl itself that could in theory be regenerated using the source code and the WARCs (but you might turn suicidal trying to do that): URLs, their relations (parent, root), recursion info (level, inline level), crawl info (status, try count, priority [currently unused]), and some info on the content ('link type', status code). POST data and local filenames would also end up there but are not used by AB. Sometime in the future, cookies will also be there, but again, nothing that couldn't be reconstructed from the WARCs anyway (and the IRC commands if we add manual cookie control). |
It's pretty difficult (and would have to process a ton of data) to reconstruct this. Size is not a problem (relative to total WARC size). Since there's nothing sensitive in the DB, let's do it. We could gzip it and upload together with the JSON and WARCs. |
Yep. It's possible in theory but completely unfeasible in practice. I'll play around with gzip vs zstd a bit. It'll be a .db.gz or .db.zst file with the same filename structure as everything else. |
I ran a few tests on large-ish databases on a busy pipeline in a terminal:
The implications are pretty obvious. I'll probably switch the log compression (on crashed/aborted jobs) to zstd as well. Although zstd actually produces a larger file than even |
I looked into this a bit again. I took the DB from 5nbpflkse0rs1tlgch8n4efud (2.94 GB, 13 million URLs, runtime before crashing about a week) and the partial log file from 3pwf0useacbmua9uwp4idpale (3.64 GB, 12 million URLs, runtime about a month so far) and compressed them at most levels of zstd and gzip. I ran this on a fairly busy AB pipeline (jap-kakapo), so it should be representative of what the runtime might look like in reality. The jobs are obviously among the larger ones running through AB. My analysis consisted of staring at shitty graphs of user time vs compression ratio in LibreOffice Calc. Test resultsDatabasezstd
gzip
Logzstd(Only ran it up to level 15 because it was getting ridiculous...)
gzip
(Raw terminal output in case I screwed up the tabulation somewhere)
My conclusion: the sweet spot with zstd seems to be 10 for databases and 8 for logs. Up to that, there is an acceptable increase in runtime with significant space savings. Beyond that, the large increase in compression time outweighs the relatively small size reduction. Unless someone yells at me, that's what I'll implement soon™. Fun side note: even |
A complication is SQLite's Write-Ahead Log (which records changes to the DB that aren't merged into the main database file yet). When the DB gets closed, it gets merged, and only wpull.db remains (but is this guaranteed behaviour?). This is what happens on aborting, for example. But when wpull crashes, |
Might be worth dumping the SQLite databases to SQL and compressing that instead of compressing the raw binary database files. |
When a job crashes or is aborted, its DB and therefore its queue is simply deleted while the other things (WARC so far, JSON, and since #396 the log file) are kept. I think we should also retain the database. It may even be worth considering keeping it for all jobs.
Besides preserving the remaining queue for crashed and aborted jobs, it also allows for easier access to the crawl information. For example, it's much easier to extract all URLs that failed three times or that resulted in a particular status code from the DB than painful processing of the log file. It could also allow for running 'update crawls' (outside of ArchiveBot) at a later time by reusing the DB of a job to skip (some) URLs that were already retrieved without having to construct such a DB from the log file.
The obvious downside is the data/storage size. However, in the grand scheme of things, this doesn't make a big difference. As a point of reference, job 6recrrotn072khaaje73k60kh – one of the largest jobs currently running at 65 million URLs – has a DB file of 15.8 GiB. This is pretty much insignificant compared to the job's data size of 4.8 TiB, especially as compression decreases the size further by a factor 4-5 (
zstd
without tuning: 3.57 GiB or 22.6 %). So this is an increase in data per job on the order of 1 ‰ (except in the rare extreme cases where the vast majority of URLs is ignored).The text was updated successfully, but these errors were encountered: