From d7d0574378196d349ee2f9c4ea4ff50318a15534 Mon Sep 17 00:00:00 2001 From: Bar David Date: Wed, 27 Apr 2022 09:44:12 +0300 Subject: [PATCH] adding some roadmap to dedupe Signed-off-by: Bar David --- DEDUPE-TODO | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/DEDUPE-TODO b/DEDUPE-TODO index 4b0bfd1d62..29b940b5b9 100644 --- a/DEDUPE-TODO +++ b/DEDUPE-TODO @@ -14,3 +14,22 @@ The storage subsystem usually identifies the similar buffers using locality-sensitive hashing or other methods. +- Varying compression ratios on a single job + We could accept a list of 2d tuples in form of + [(probability,compression_ratio), ...] such that the compression ratios + are generated according to their set probability + +- Rework verification with dedupe and compression. + +- Reduce memory required to manage to dedupe_working_set. + Currently we require to maintain a seed (12-16 bytes) per page in + the working set. With large files we waste a lot of memory. + Either leverage disk space for that, or recalculate the seeds during + buffer generation phase + +- Dedupe hot spots. + Maintain different probabilities within the dedupe_working_set such that when + generating dedupe buffers we choose the seeds non uniformly in motivation to + simulate real-world use-cases better. + +- Add examples of fio jobs utilizing deduplication and/or compression.