-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
75 lines (58 loc) · 3.09 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
ezdedup: deduplication workload, made easy.
Overview
========
The ezdedup workload is taken from the original Princeton Application Repository
for Shared-Memory Computers (PARSEC 3.0) suite: http://parsec.cs.princeton.edu
It is simplified for kernel development/performance purposes, and as such its
usage does not rely on anything from PARSEC. This makes the program usage
significantly more straightforward.
As described by the original 2011 "Benchmarking Modern Multiprocessors" Ph.D
thesis by Christian Bienia, deduplication is a form of compression stream with a
combination of global compression and local compression in order to achieve high
compression ratios. The dedup workload uses a pipeline model for function level
parallelism, with five stages:
(1) Read the input file from disk and determines the locations where the
data is to be split up by jumping a fixed length in the buffer for each
chunk. The resulting data blocks are enqueued for the next stages. These
are coarse grained chunks.
(2) Identifies brief sequences in the data stream that are identical with
sufficiently high probability (anchoring) by using a rolling hash to
segment data based on its contents. The data is then broken up into two
separate blocks at the determined location.
(3) Computes a SHA1 checksum for each chunk and checks for duplicate blocks
with the use of a global database
(4) Compresses each data segment with the Ziv-Lempel algorithm and builds a
global hash table that maps hash values to data. Every data block is
compressed only once because the previous stage does not send duplicates
to the compression stage.
(5) Assembles the deduplicated output stream consisting of hash values and
compressed data segments.
Note that stages (i) and (v) are serial. Please refer to the document described
above for complete details.
Usage
=====
dedup [-cusfvh] [-w gzip/bzip2/none] [-i file] [-o file] [-t number_of_threads]
-c compress
-u uncompress
-p preloading (for benchmarking purposes)
-w compression type: gzip/bzip2/none
-i file the input file
-o file the output file
-t number of threads per stage
-v verbose output
-h help
Examples:
o Compress a qemu image with qcow2 compression, each parallel
stage will use two threads:
$> dedup -c -v -p -t 2 -i linux.qcow2 -o outfile
PARSEC Benchmark Suite
Total input size: 2433.81 MB
Total output size: 1108.74 MB
Effective compression factor: 2.20x
Mean data chunk size: 0.22 KB (stddev: 4022.08 KB)
Amount of duplicate chunks: 95.99%
Data size after deduplication: 2028.62 MB (compression factor: 1.20x)
Data size after compression: 762.95 MB (compression factor: 2.66x)
Output overhead: 31.19%
o Uncompress the output file and restore its original size:
$> dedup -u -v -p -t 2 -i outfile -o originalfile