Keep and upload database of finished jobs #465

JustAnotherArchivist · 2020-09-20T03:37:54Z

When a job crashes or is aborted, its DB and therefore its queue is simply deleted while the other things (WARC so far, JSON, and since #396 the log file) are kept. I think we should also retain the database. It may even be worth considering keeping it for all jobs.

Besides preserving the remaining queue for crashed and aborted jobs, it also allows for easier access to the crawl information. For example, it's much easier to extract all URLs that failed three times or that resulted in a particular status code from the DB than painful processing of the log file. It could also allow for running 'update crawls' (outside of ArchiveBot) at a later time by reusing the DB of a job to skip (some) URLs that were already retrieved without having to construct such a DB from the log file.

The obvious downside is the data/storage size. However, in the grand scheme of things, this doesn't make a big difference. As a point of reference, job 6recrrotn072khaaje73k60kh – one of the largest jobs currently running at 65 million URLs – has a DB file of 15.8 GiB. This is pretty much insignificant compared to the job's data size of 4.8 TiB, especially as compression decreases the size further by a factor 4-5 (zstd without tuning: 3.57 GiB or 22.6 %). So this is an increase in data per job on the order of 1 ‰ (except in the rare extreme cases where the vast majority of URLs is ignored).

The text was updated successfully, but these errors were encountered:

Arkiver2 · 2020-09-21T17:50:31Z

This is a great idea, I think size is not a problem here.

I'm not sure what is exactly stored in the DB, any sensitive information? If not, we should definitely preserve the DB (also on finished jobs).

JustAnotherArchivist · 2020-09-21T17:57:13Z

Nothing sensitive at all. It contains only information on the crawl itself that could in theory be regenerated using the source code and the WARCs (but you might turn suicidal trying to do that): URLs, their relations (parent, root), recursion info (level, inline level), crawl info (status, try count, priority [currently unused]), and some info on the content ('link type', status code). POST data and local filenames would also end up there but are not used by AB. Sometime in the future, cookies will also be there, but again, nothing that couldn't be reconstructed from the WARCs anyway (and the IRC commands if we add manual cookie control).

Arkiver2 · 2020-09-21T18:08:46Z

It's pretty difficult (and would have to process a ton of data) to reconstruct this. Size is not a problem (relative to total WARC size). Since there's nothing sensitive in the DB, let's do it.

We could gzip it and upload together with the JSON and WARCs.

JustAnotherArchivist · 2020-09-21T20:46:48Z

Yep. It's possible in theory but completely unfeasible in practice.

I'll play around with gzip vs zstd a bit. It'll be a .db.gz or .db.zst file with the same filename structure as everything else.

JustAnotherArchivist · 2020-09-21T23:58:38Z

I ran a few tests on large-ish databases on a busy pipeline in a terminal:

Job	Original size	`gzip -6` size	... time	`gzip -9` size	... time	`zstd` size	... time
1m71j820n4qka3ob7w6dlja3y	23.7 GiB			3.93 GiB	17 mn	3.58 GiB	2.5 mn
9hdfwijhzx86os1k3tm1wgq3i	1034 MiB	241 MiB	33 s	239 MiB	50 s	221 MiB	4.3 s
2g2xqrj2na5od7mk5mql0q3bn	326 MiB	67.3 MiB	10 s	66.7 MiB	23 s	64.7 MiB	2.3 s
73pjjo1i8uyububkhbpaf6ndr	5.00 GiB	0.915 GiB	2 mn 20 s			0.890 GiB	30 s

The implications are pretty obvious.

I'll probably switch the log compression (on crashed/aborted jobs) to zstd as well. Although zstd actually produces a larger file than even gzip -6 with the default settings in a test, it only takes a slight increase of the compression level to fix that. zstd -10 takes about the same time as gzip -9 on my partial test log from job 9hdfwijhzx86os1k3tm1wgq3i (1010 MiB) at 33 s but produces a file of 85 MiB compared to gzip's 99 MiB. I'll do some more testing to find the sweet spot there.

JustAnotherArchivist · 2021-02-21T04:15:00Z

I looked into this a bit again. I took the DB from 5nbpflkse0rs1tlgch8n4efud (2.94 GB, 13 million URLs, runtime before crashing about a week) and the partial log file from 3pwf0useacbmua9uwp4idpale (3.64 GB, 12 million URLs, runtime about a month so far) and compressed them at most levels of zstd and gzip. I ran this on a fairly busy AB pipeline (jap-kakapo), so it should be representative of what the runtime might look like in reality. The jobs are obviously among the larger ones running through AB. My analysis consisted of staring at shitty graphs of user time vs compression ratio in LibreOffice Calc.

Test results

Database

zstd

Compression level	Original size	Compressed size	Compression ratio	Real time	User time	Sys time
1	2944745472	721835831	24.51%	13.616	11.565	1.751
2	2944745472	700663107	23.79%	13.192	13.300	1.097
3	2944745472	682199494	23.17%	16.743	16.700	1.231
4	2944745472	677610833	23.01%	21.281	21.426	1.187
5	2944745472	661601839	22.47%	50.899	50.691	1.414
6	2944745472	657653273	22.33%	56.692	56.470	1.247
7	2944745472	630368182	21.41%	68.079	67.727	1.542
8	2944745472	625318048	21.24%	79.158	79.157	1.252
9	2944745472	622723235	21.15%	93.913	93.947	1.114
10	2944745472	613131472	20.82%	117.855	117.652	1.381
11	2944745472	610937389	20.75%	131.157	130.767	1.344
12	2944745472	609634199	20.70%	176.475	176.017	1.516
13	2944745472	609777705	20.71%	201.050	196.477	2.412
14	2944745472	607311093	20.62%	218.251	215.267	2.175
15	2944745472	605756166	20.57%	265.313	262.540	2.719
16	2944745472	588934765	20.00%	572.187	561.204	3.635
17	2944745472	562606051	19.11%	697.623	690.251	4.968
18	2944745472	538896215	18.30%	1085.334	1077.788	5.231
19	2944745472	530637003	18.02%	1519.945	1512.603	4.898

gzip

Compression level	Original size	Compressed size	Compression ratio	Real time	User time	Sys time
1	2944745472	806176600	27.38%	47.730	42.625	1.891
2	2944745472	800534883	27.19%	47.969	44.378	1.551
3	2944745472	770088717	26.15%	56.418	54.249	1.833
4	2944745472	736143418	25.00%	65.347	62.334	1.711
5	2944745472	723571018	24.57%	71.107	68.941	1.759
6	2944745472	717027291	24.35%	89.407	87.594	1.560
7	2944745472	713746787	24.24%	103.271	100.502	1.680
8	2944745472	711333243	24.16%	126.486	124.023	1.536
9	2944745472	711214985	24.15%	138.508	134.626	1.927

Log

zstd

(Only ran it up to level 15 because it was getting ridiculous...)

Compression level	Original size	Compressed size	Compression ratio	Real time	User time	Sys time
1	3641670876	440404842	12.09%	11.606	11.189	1.098
2	3641670876	435763309	11.97%	12.000	11.859	1.232
3	3641670876	432647510	11.88%	15.586	15.240	1.290
4	3641670876	433149771	11.89%	18.272	17.941	1.072
5	3641670876	402242867	11.05%	39.903	39.730	1.240
6	3641670876	395880291	10.87%	43.403	43.543	1.198
7	3641670876	379345921	10.42%	58.751	58.411	1.505
8	3641670876	369124857	10.14%	72.449	71.646	1.644
9	3641670876	367090066	10.08%	87.926	87.384	1.644
10	3641670876	365891660	10.05%	103.167	103.085	1.317
11	3641670876	365068174	10.02%	124.915	124.942	1.265
12	3641670876	363906198	9.99%	164.777	163.296	1.347
13	3641670876	359998040	9.89%	228.925	228.116	2.024
14	3641670876	358985335	9.86%	267.016	265.797	2.248
15	3641670876	358212227	9.84%	334.854	333.494	2.032

gzip

Compression level	Original size	Compressed size	Compression ratio	Real time	User time	Sys time
1	3641670876	506536391	13.91%	36.188	33.271	1.340
2	3641670876	493880878	13.56%	32.974	31.712	1.168
3	3641670876	483203714	13.27%	36.171	33.592	1.383
4	3641670876	452770760	12.43%	48.991	45.913	1.296
5	3641670876	436844810	12.00%	47.157	45.902	1.175
6	3641670876	418611901	11.50%	63.652	60.297	1.332
7	3641670876	416090448	11.43%	70.472	68.945	1.268
8	3641670876	400631037	11.00%	88.818	87.520	1.128
9	3641670876	400425421	11.00%	114.291	112.666	1.244

(Raw terminal output in case I screwed up the tabulation somewhere)

> for lvl in {1..22}; do echo $lvl; time zstd -$lvl patriots.win-inf-20210123-012541-5nbpf-wpull.db -o patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst$lvl; echo; echo; done
1
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 24.51%   (2944745472 => 721835831 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst1) 

real	0m13.616s
user	0m11.565s
sys	0m1.751s


2
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.79%   (2944745472 => 700663107 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst2) 

real	0m13.192s
user	0m13.300s
sys	0m1.097s


3
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.17%   (2944745472 => 682199494 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst3) 

real	0m16.743s
user	0m16.700s
sys	0m1.231s


4
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 23.01%   (2944745472 => 677610833 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst4) 

real	0m21.281s
user	0m21.426s
sys	0m1.187s


5
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 22.47%   (2944745472 => 661601839 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst5) 

real	0m50.899s
user	0m50.691s
sys	0m1.414s


6
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 22.33%   (2944745472 => 657653273 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst6) 

real	0m56.692s
user	0m56.470s
sys	0m1.247s


7
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.41%   (2944745472 => 630368182 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst7) 

real	1m8.079s
user	1m7.727s
sys	0m1.542s


8
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.24%   (2944745472 => 625318048 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst8) 

real	1m19.158s
user	1m19.157s
sys	0m1.252s


9
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 21.15%   (2944745472 => 622723235 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst9) 

real	1m33.913s
user	1m33.947s
sys	0m1.144s


10
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.82%   (2944745472 => 613131472 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst10) 

real	1m57.855s
user	1m57.652s
sys	0m1.381s


11
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.75%   (2944745472 => 610937389 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst11) 

real	2m11.157s
user	2m10.767s
sys	0m1.344s


12
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.70%   (2944745472 => 609634199 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst12) 

real	2m56.475s
user	2m56.017s
sys	0m1.516s


13
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.71%   (2944745472 => 609777705 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst13) 

real	3m21.050s
user	3m16.477s
sys	0m2.412s


14
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.62%   (2944745472 => 607311093 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst14) 

real	3m38.251s
user	3m35.267s
sys	0m2.175s


15
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.57%   (2944745472 => 605756166 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst15) 

real	4m25.313s
user	4m22.540s
sys	0m2.719s


16
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 20.00%   (2944745472 => 588934765 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst16) 

real	9m32.187s
user	9m21.204s
sys	0m3.635s


17
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 19.11%   (2944745472 => 562606051 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst17) 

real	11m37.623s
user	11m30.251s
sys	0m4.968s


18
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 18.30%   (2944745472 => 538896215 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst18) 

real	18m5.334s
user	17m57.788s
sys	0m5.231s


19
patriots.win-inf-20210123-012541-5nbpf-wpull.db : 18.02%   (2944745472 => 530637003 bytes, patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst19) 

real	25m19.945s
user	25m12.603s
sys	0m4.898s

> for lvl in {1..9}; do echo $lvl; time gzip -$lvl <patriots.win-inf-20210123-012541-5nbpf-wpull.db >patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz$lvl; echo; echo; done
1

real	0m47.730s
user	0m42.625s
sys	0m1.891s


2

real	0m47.969s
user	0m44.378s
sys	0m1.551s


3

real	0m56.418s
user	0m54.249s
sys	0m1.833s


4

real	1m5.347s
user	1m2.334s
sys	0m1.711s


5

real	1m11.107s
user	1m8.941s
sys	0m1.759s


6

real	1m29.407s
user	1m27.594s
sys	0m1.560s


7

real	1m43.271s
user	1m40.502s
sys	0m1.680s


8

real	2m6.486s
user	2m4.023s
sys	0m1.536s


9

real	2m18.508s
user	2m14.626s
sys	0m1.927s

> for lvl in {1..19}; do echo $lvl; time zstd -$lvl 3pwf0useacbmua9uwp4idpale.log -o 3pwf0useacbmua9uwp4idpale.log.zst$lvl; echo; echo; done
1
3pwf0useacbmua9uwp4idpale.log : 12.09%   (3641670876 => 440404842 bytes, 3pwf0useacbmua9uwp4idpale.log.zst1) 

real	0m11.606s
user	0m11.189s
sys	0m1.098s


2
3pwf0useacbmua9uwp4idpale.log : 11.97%   (3641670876 => 435763309 bytes, 3pwf0useacbmua9uwp4idpale.log.zst2) 

real	0m12.000s
user	0m11.859s
sys	0m1.232s


3
3pwf0useacbmua9uwp4idpale.log : 11.88%   (3641670876 => 432647510 bytes, 3pwf0useacbmua9uwp4idpale.log.zst3) 

real	0m15.586s
user	0m15.240s
sys	0m1.290s


4
3pwf0useacbmua9uwp4idpale.log : 11.89%   (3641670876 => 433149771 bytes, 3pwf0useacbmua9uwp4idpale.log.zst4) 

real	0m18.272s
user	0m17.941s
sys	0m1.072s


5
3pwf0useacbmua9uwp4idpale.log : 11.05%   (3641670876 => 402242867 bytes, 3pwf0useacbmua9uwp4idpale.log.zst5) 

real	0m39.903s
user	0m39.730s
sys	0m1.240s


6
3pwf0useacbmua9uwp4idpale.log : 10.87%   (3641670876 => 395880291 bytes, 3pwf0useacbmua9uwp4idpale.log.zst6) 

real	0m43.403s
user	0m43.543s
sys	0m1.198s


7
3pwf0useacbmua9uwp4idpale.log : 10.42%   (3641670876 => 379345921 bytes, 3pwf0useacbmua9uwp4idpale.log.zst7) 

real	0m58.751s
user	0m58.411s
sys	0m1.505s


8
3pwf0useacbmua9uwp4idpale.log : 10.14%   (3641670876 => 369124857 bytes, 3pwf0useacbmua9uwp4idpale.log.zst8) 

real	1m12.449s
user	1m11.646s
sys	0m1.644s


9
3pwf0useacbmua9uwp4idpale.log : 10.08%   (3641670876 => 367090066 bytes, 3pwf0useacbmua9uwp4idpale.log.zst9) 

real	1m27.926s
user	1m27.384s
sys	0m1.644s


10
3pwf0useacbmua9uwp4idpale.log : 10.05%   (3641670876 => 365891660 bytes, 3pwf0useacbmua9uwp4idpale.log.zst10) 

real	1m43.167s
user	1m43.085s
sys	0m1.317s


11
3pwf0useacbmua9uwp4idpale.log : 10.02%   (3641670876 => 365068174 bytes, 3pwf0useacbmua9uwp4idpale.log.zst11) 

real	2m4.915s
user	2m4.942s
sys	0m1.265s


12
3pwf0useacbmua9uwp4idpale.log :  9.99%   (3641670876 => 363906198 bytes, 3pwf0useacbmua9uwp4idpale.log.zst12) 

real	2m44.777s
user	2m43.296s
sys	0m1.347s


13
3pwf0useacbmua9uwp4idpale.log :  9.89%   (3641670876 => 359998040 bytes, 3pwf0useacbmua9uwp4idpale.log.zst13) 

real	3m48.925s
user	3m48.116s
sys	0m2.024s


14
3pwf0useacbmua9uwp4idpale.log :  9.86%   (3641670876 => 358985335 bytes, 3pwf0useacbmua9uwp4idpale.log.zst14) 

real	4m27.016s
user	4m25.797s
sys	0m2.248s


15
3pwf0useacbmua9uwp4idpale.log :  9.84%   (3641670876 => 358212227 bytes, 3pwf0useacbmua9uwp4idpale.log.zst15) 

real	5m34.854s
user	5m33.494s
sys	0m2.032s

> for lvl in {1..9}; do echo $lvl; time gzip -$lvl <3pwf0useacbmua9uwp4idpale.log >3pwf0useacbmua9uwp4idpale.log.gz$lvl; echo; echo; done
1

real	0m36.188s
user	0m33.271s
sys	0m1.340s


2

real	0m32.974s
user	0m31.712s
sys	0m1.168s


3

real	0m36.171s
user	0m33.592s
sys	0m1.383s


4

real	0m48.991s
user	0m45.913s
sys	0m1.296s


5

real	0m47.157s
user	0m45.902s
sys	0m1.175s


6

real	1m3.652s
user	1m0.297s
sys	0m1.332s


7

real	1m10.472s
user	1m8.945s
sys	0m1.268s


8

real	1m28.818s
user	1m27.520s
sys	0m1.128s


9

real	1m54.291s
user	1m52.666s
sys	0m1.244s

> ll
total 34151264
drwxr-xr-x  2 archivebot archivebot       4096 Feb 21 03:42 .
drwxr-xr-x 20 archivebot archivebot       4096 Feb 21 02:55 ..
-rw-r--r--  1 archivebot archivebot 3641670876 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log
-rw-r--r--  1 archivebot archivebot  506536391 Feb 21 03:33 3pwf0useacbmua9uwp4idpale.log.gz1
-rw-r--r--  1 archivebot archivebot  493880878 Feb 21 03:33 3pwf0useacbmua9uwp4idpale.log.gz2
-rw-r--r--  1 archivebot archivebot  483203714 Feb 21 03:34 3pwf0useacbmua9uwp4idpale.log.gz3
-rw-r--r--  1 archivebot archivebot  452770760 Feb 21 03:35 3pwf0useacbmua9uwp4idpale.log.gz4
-rw-r--r--  1 archivebot archivebot  436844810 Feb 21 03:35 3pwf0useacbmua9uwp4idpale.log.gz5
-rw-r--r--  1 archivebot archivebot  418611901 Feb 21 03:36 3pwf0useacbmua9uwp4idpale.log.gz6
-rw-r--r--  1 archivebot archivebot  416090448 Feb 21 03:38 3pwf0useacbmua9uwp4idpale.log.gz7
-rw-r--r--  1 archivebot archivebot  400631037 Feb 21 03:39 3pwf0useacbmua9uwp4idpale.log.gz8
-rw-r--r--  1 archivebot archivebot  400425421 Feb 21 03:41 3pwf0useacbmua9uwp4idpale.log.gz9
-rw-r--r--  1 archivebot archivebot  440404842 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst1
-rw-r--r--  1 archivebot archivebot  365891660 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst10
-rw-r--r--  1 archivebot archivebot  365068174 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst11
-rw-r--r--  1 archivebot archivebot  363906198 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst12
-rw-r--r--  1 archivebot archivebot  359998040 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst13
-rw-r--r--  1 archivebot archivebot  358985335 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst14
-rw-r--r--  1 archivebot archivebot  358212227 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst15
-rw-r--r--  1 archivebot archivebot  435763309 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst2
-rw-r--r--  1 archivebot archivebot  432647510 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst3
-rw-r--r--  1 archivebot archivebot  433149771 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst4
-rw-r--r--  1 archivebot archivebot  402242867 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst5
-rw-r--r--  1 archivebot archivebot  395880291 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst6
-rw-r--r--  1 archivebot archivebot  379345921 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst7
-rw-r--r--  1 archivebot archivebot  369124857 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst8
-rw-r--r--  1 archivebot archivebot  367090066 Feb 21 02:56 3pwf0useacbmua9uwp4idpale.log.zst9
-rw-r--r--  2 archivebot archivebot 2944745472 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db
-rw-r--r--  1 archivebot archivebot  806176600 Feb 21 02:17 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz1
-rw-r--r--  1 archivebot archivebot  800534883 Feb 21 02:18 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz2
-rw-r--r--  1 archivebot archivebot  770088717 Feb 21 02:19 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz3
-rw-r--r--  1 archivebot archivebot  736143418 Feb 21 02:20 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz4
-rw-r--r--  1 archivebot archivebot  723571018 Feb 21 02:21 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz5
-rw-r--r--  1 archivebot archivebot  717027291 Feb 21 02:22 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz6
-rw-r--r--  1 archivebot archivebot  713746787 Feb 21 02:24 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz7
-rw-r--r--  1 archivebot archivebot  711333243 Feb 21 02:26 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz8
-rw-r--r--  1 archivebot archivebot  711214985 Feb 21 02:28 patriots.win-inf-20210123-012541-5nbpf-wpull.db.gz9
-rw-r--r--  1 archivebot archivebot  721835831 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst1
-rw-r--r--  1 archivebot archivebot  613131472 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst10
-rw-r--r--  1 archivebot archivebot  610937389 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst11
-rw-r--r--  1 archivebot archivebot  609634199 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst12
-rw-r--r--  1 archivebot archivebot  609777705 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst13
-rw-r--r--  1 archivebot archivebot  607311093 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst14
-rw-r--r--  1 archivebot archivebot  605756166 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst15
-rw-r--r--  1 archivebot archivebot  588934765 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst16
-rw-r--r--  1 archivebot archivebot  562606051 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst17
-rw-r--r--  1 archivebot archivebot  538896215 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst18
-rw-r--r--  1 archivebot archivebot  530637003 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst19
-rw-r--r--  1 archivebot archivebot  700663107 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst2
-rw-r--r--  1 archivebot archivebot  682199494 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst3
-rw-r--r--  1 archivebot archivebot  677610833 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst4
-rw-r--r--  1 archivebot archivebot  661601839 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst5
-rw-r--r--  1 archivebot archivebot  657653273 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst6
-rw-r--r--  1 archivebot archivebot  630368182 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst7
-rw-r--r--  1 archivebot archivebot  625318048 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst8
-rw-r--r--  1 archivebot archivebot  622723235 Jan 30 09:39 patriots.win-inf-20210123-012541-5nbpf-wpull.db.zst9

My conclusion: the sweet spot with zstd seems to be 10 for databases and 8 for logs. Up to that, there is an acceptable increase in runtime with significant space savings. Beyond that, the large increase in compression time outweighs the relatively small size reduction. Unless someone yells at me, that's what I'll implement soon™.

Fun side note: even zstd -2 compresses better than gzip -9 – and at a 10 times shorter runtime!

JustAnotherArchivist · 2021-08-07T22:49:11Z

A complication is SQLite's Write-Ahead Log (which records changes to the DB that aren't merged into the main database file yet). When the DB gets closed, it gets merged, and only wpull.db remains (but is this guaranteed behaviour?). This is what happens on aborting, for example. But when wpull crashes, wpull.db-wal and wpull.db-shm remain. Merging explicitly is possible using sqlite3 wpull.db 'PRAGMA wal_checkpoint' (docs, possibly an argument would be better), but I'm not sure whether that always works. Perhaps there'd need to be a fallback to preserve all three files in case the wal_checkpoint fails to merge them together.

pabs3 · 2024-05-10T04:31:51Z

Might be worth dumping the SQLite databases to SQL and compressing that instead of compressing the raw binary database files.

JustAnotherArchivist added enhancement pipeline uploader labels Sep 20, 2020

JustAnotherArchivist changed the title ~~Keep and upload database on crashes and aborts~~ Keep and upload database of finished jobs Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep and upload database of finished jobs #465

Keep and upload database of finished jobs #465

JustAnotherArchivist commented Sep 20, 2020

Arkiver2 commented Sep 21, 2020

JustAnotherArchivist commented Sep 21, 2020 •

edited

Loading

Arkiver2 commented Sep 21, 2020

JustAnotherArchivist commented Sep 21, 2020

JustAnotherArchivist commented Sep 21, 2020

JustAnotherArchivist commented Feb 21, 2021

Database

zstd

gzip

Log

zstd

gzip

JustAnotherArchivist commented Aug 7, 2021

pabs3 commented May 10, 2024

Keep and upload database of finished jobs #465

Keep and upload database of finished jobs #465

Comments

JustAnotherArchivist commented Sep 20, 2020

Arkiver2 commented Sep 21, 2020

JustAnotherArchivist commented Sep 21, 2020 • edited Loading

Arkiver2 commented Sep 21, 2020

JustAnotherArchivist commented Sep 21, 2020

JustAnotherArchivist commented Sep 21, 2020

JustAnotherArchivist commented Feb 21, 2021

Database

zstd

gzip

Log

zstd

gzip

JustAnotherArchivist commented Aug 7, 2021

pabs3 commented May 10, 2024

JustAnotherArchivist commented Sep 21, 2020 •

edited

Loading