-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress indexes #43
base: master
Are you sure you want to change the base?
Compress indexes #43
Conversation
* Will only be called after prepareToRead(). | ||
* @return number of block offsets that will be read back. | ||
*/ | ||
public int numBlocks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit problematic one to implement and forces us to process the whole file at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Rewrote LzoTinyOffsets to use VarInt implementation from Mahout, and got rid of numBlocks() method in the interface. |
os.writeInt(firstBlockSize); | ||
wroteFirstBlock = true; | ||
} else { | ||
int delta = ((int) (offset - currOffset)) - firstBlockSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about writing delta from previous block size?
this will also adapt well to compressibility changes changes over a large file (extremely rare).
then we don't need wroteFirstBlock, since prevBlockSize would be zero.
@sjlee check out this ancient pull request. The goal here is to make lzo indexes significantly smaller, making split calculation, etc, much faster. It's meant to be backwards-compatible (new hadoop-lzo can read both new and old indexes; old hadoop-lzo can't read new indexes of course). Also introduces versioning, in case we want to mess with this further. If this is interesting, I can take a pass at making this mergeable with current master. |
It does sound interesting. Could you give it a shot and let me know? Thanks. |
|
I added an interface to index reading/writing and provided an alternate representation of the index, which should drop the size of our index files about 4x. Haven't tested on real data, but unit tests pass. Please comment.
TODO: make the order of index serdes tried configurable via properties a-la hadoop's compression, and make the writer configurable as well (right now I just hardcode the writer implementation).