GitHub - davidlohr/ezdedup: dedup benchmark, without the hassle

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
doc		doc
openssl		openssl
Makefile		Makefile
README		README
binheap.c		binheap.c
binheap.h		binheap.h
config.h		config.h
debug.h		debug.h
decoder.c		decoder.c
decoder.h		decoder.h
dedup.c		dedup.c
dedupdef.h		dedupdef.h
encoder.c		encoder.c
encoder.h		encoder.h
hashtable.c		hashtable.c
hashtable.h		hashtable.h
hashtable_private.h		hashtable_private.h
mbuffer.c		mbuffer.c
mbuffer.h		mbuffer.h
queue.c		queue.c
queue.h		queue.h
rabin.c		rabin.c
rabin.h		rabin.h
sha.c		sha.c
sha.h		sha.h
tree.c		tree.c
tree.h		tree.h
util.c		util.c
util.h		util.h
version		version

Repository files navigation

ezdedup: deduplication workload, made easy.

Overview
========

The ezdedup workload is taken from the original Princeton Application Repository
for Shared-Memory Computers (PARSEC 3.0) suite: http://parsec.cs.princeton.edu
It is simplified for kernel development/performance purposes, and as such its
usage does not rely on anything from PARSEC. This makes the program usage
significantly more straightforward.

As described by the original 2011 "Benchmarking Modern Multiprocessors" Ph.D
thesis by Christian Bienia, deduplication is a form of compression stream with a
combination of global compression and local compression in order to achieve high
compression ratios. The dedup workload uses a pipeline model for function level
parallelism, with five stages:

   (1) Read the input file from disk and determines the locations where the
       data is to be split up by jumping a fixed length in the buffer for each
       chunk. The resulting data blocks are enqueued for the next stages. These
       are coarse grained chunks.

   (2) Identifies brief sequences in the data stream that are identical with
       sufficiently high probability (anchoring) by using a rolling hash to
       segment data based on its contents. The data is then broken up into two
       separate blocks at the determined location.

   (3) Computes a SHA1 checksum for each chunk and checks for duplicate blocks
       with the use of a global database

   (4) Compresses each data segment with the Ziv-Lempel algorithm and builds a
       global hash table that maps hash values to data. Every data block is
       compressed only once because the previous stage does not send duplicates
       to the compression stage.

   (5) Assembles the deduplicated output stream consisting of hash values and
       compressed data segments.

Note that stages (i) and (v) are serial. Please refer to the document described
above for complete details.

Usage
=====

dedup [-cusfvh] [-w gzip/bzip2/none] [-i file] [-o file] [-t number_of_threads]
-c                      compress
-u                      uncompress
-p                      preloading (for benchmarking purposes)
-w                      compression type: gzip/bzip2/none
-i file                 the input file
-o file                 the output file
-t                      number of threads per stage
-v                      verbose output
-h                      help

Examples:

o Compress a qemu image with qcow2 compression, each parallel
  stage will use two threads:

$> dedup -c -v -p -t 2 -i linux.qcow2 -o outfile
PARSEC Benchmark Suite
Total input size:                     2433.81 MB
Total output size:                    1108.74 MB
Effective compression factor:            2.20x

Mean data chunk size:                    0.22 KB (stddev: 4022.08 KB)
Amount of duplicate chunks:             95.99%
Data size after deduplication:        2028.62 MB (compression factor: 1.20x)
Data size after compression:           762.95 MB (compression factor: 2.66x)
Output overhead:                        31.19%

o Uncompress the output file and restore its original size:

$> dedup -u -v -p -t 2 -i outfile -o originalfile