A C++ library for summarizing data (data streams, in particular).
StreamingCC
implements various streaming algorithms and probabilistic data structures. They can be used to effectively summarize the data stream even when data is too large to fit into memory.
Algorithms/Data Structures (will be) included in StreamingCC
:
- Count-Min Sketch
- Count-Sketch
- AMS Sketch
- Distinct Elements Counter(section 3)
- Reservoir sampling
- Streaming Submodular Maximization
- Bloom Filter and its variants
- ...
- CMake (>= 2.8.7)
- C++11 support required
- Armadillo (optional, required by some features)
The source code compiles to static library.
Step 1, clone to local machine
$ git clone https://github.com/jiecchen/StreamingCC.git
Step 2, compile the library
$ cd StreamingCC/
$ cmake .
$ make
Step 3, install the library to system
$ sudo make install
Suppose you have sampling.cc
with the following code,
#include <streamingcc>
#include <iostream>
using namespace streamingcc;
int main() {
// create an object which will maintain
// 10 samples (with replacement) dynamically
ReservoirSampler<int> rsmp(10);
// sample from a data stream with length 1,000,000
for (int i = 0; i < 1000000; i++)
rsmp.ProcessItem(i);
// print the samples
for (auto sample: rsmp.GetSamples())
std::cout << sample << " ";
std::cout << std::endl;
return 0;
}
Now compile the code:
$ g++ -std=c++11 -O3 -o sampling sampling.cc -lstreamingcc
It will generate an executable file sampling
. Run the binary with
$ ./sampling
916749 93283 843814 534073 877348 445467 369729 163394 67058 212209
- library to sample from noisy data, see readme for more information.
- library for construct coreset for a dataset so that one can do clustering with outliers, see readme for more information.
- library for embedding that preserve edit distance, see readme for details.
- Add more docs
- Add more python wrappers
- Add more examples
- Unify the interface for recently added library
- Jiecao Chen [email protected] (currently supported by NSF CCF-1525024)
- Qin Zhang [email protected]
- Haoyu Zhang [email protected]
MIT License