Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user provided compression/decompression for record batches #81

Merged
merged 7 commits into from
Sep 20, 2024

Conversation

pdeva
Copy link
Contributor

@pdeva pdeva commented Sep 15, 2024

PR for #80

this makes #79 a non-issue and allows this lib to focus on kafka protocol instead of various decompression routines.

@pdeva pdeva changed the title user provided decompression for record batches user provided compression/decompression for record batches Sep 15, 2024
@tychedelia tychedelia requested a review from rukai September 16, 2024 19:08
@tychedelia tychedelia added the enhancement New feature or request label Sep 16, 2024
@tychedelia
Copy link
Owner

I'm not sure I agree that compression is unrelated to the library's function. Is there a reason why we can't improve the ability to configure the particular codec configuration rather than rip everything out? Alternatively, we could add a "bring your own compression lib" option without removing the existing options?

As it stands, this PR would cause churn for our existing production users without much benefit to them.

@pdeva
Copy link
Contributor Author

pdeva commented Sep 16, 2024

this library has never really supported compression fully.
2 of the 4 compression algos have been contributed by members of our team. and we are constantly having to create bug reports to add fixes to it.

the compression part simply isnt production ready. it doesnt have the testing it needs and scenarios like choosing custom compression levels are not covered.

making the compression user provided is the best way to go.

@rukai
Copy link
Collaborator

rukai commented Sep 16, 2024

I think that compression is very much a part of the kafka protocol.
And since this crate allows users to encode and decode the entire kafka protocol it should support all kafka supported compression out of the box.

However:

  • We should put each compression algorithm behind a feature flag enabled by default
  • Exposing a way for users to provide custom compression algorithms could be nice, I can see use cases where a service might be kafka compatible but support extra compression algorithms.

For your team this should be equivalent to ripping out the existing compression and adding custom compression logic: just disable all the compression features and then provide a custom compression algorithm.

If collaborating with upstream is too much overhead for your team feel free to fork the project and implement it yourself.
Since there is a clear split between the kafka protocol and the record level protocol (where compression occurs) you can continue using upstream version of this library for the kafka protocol, and then use a fork of this library for the record level protocol.

FWIW my project doesnt currently make use of the record level protocol encoding/decoding, so thats why we haven't encountered any issues with it. We might start using it in the future though.

@pdeva
Copy link
Contributor Author

pdeva commented Sep 17, 2024

i have modified the PR so the change might work for all parties.

The provided custom compression/decompression method is now an Option.
So if the user wants to use the inbuilt compress/decompress of this lib, he can just pass None.
Otherwise he can provide a function via Some(my_func) to do custom compression/decompression as needed.

this way existing users only have to pass a None when upgrading.

Copy link
Collaborator

@rukai rukai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This largely seems fine, but I've left a few inline comments.

Also, there are a few variations on API I can think of:

  • encode/decode methods could rely on a generic type to provide the custom compression logic without needing an argument to be passed in.
    • downside is more complexity
  • we could match the compression API to take in the decode/encode function like we currently do for the default encode/decode logic
    • I suspect the current implementation might have more room for optimization but its hard to say without actually attempting such optimization which
    • I think their might actually be a third option that would be better than both? Maybe we just want to be able to preallocate the second BytesMut to the same size + buffer as the original BytesMut?

But I dont see any of these variations as clearly better than the current PR, so assuming @tychedelia is ok with it, I think lets just go ahead with what we have. (after addressing the inline comments I've left)

Copy link
Collaborator

@rukai rukai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I'll give @tychedelia some time to give any input before merging.

Copy link
Owner

@tychedelia tychedelia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes to support both paths, and while I understand you view the existing compression options as too buggy to use, I hope that you'll consider upstreaming any fixes as you learn more.

@tychedelia tychedelia merged commit cabe835 into tychedelia:main Sep 20, 2024
3 checks passed
jshearer added a commit to estuary/flow that referenced this pull request Sep 23, 2024
While investigating the cause of LZ4 compression issues related to franz-go (see comments here #1651), I found `lz4_flex` which is a pure-Rust lz4 implementation which appears to be safer and faster than `lz4`/`lz4-sys` that `kafka-protocol` is using. Now that tychedelia/kafka-protocol-rs#81 allows us to use our own compression, and `lz4`'s configuration of block checksums is broken (fix here 10XGenomics/lz4-rs#52), I thought it would be a good time to swap to `lz4_flex`.
jshearer added a commit to estuary/flow that referenced this pull request Sep 23, 2024
While investigating the cause of LZ4 compression issues related to franz-go (see comments here #1651), I found `lz4_flex` which is a pure-Rust lz4 implementation which appears to be safer and faster than `lz4`/`lz4-sys` that `kafka-protocol` is using. Now that tychedelia/kafka-protocol-rs#81 allows us to use our own compression, and `lz4`'s configuration of block checksums is broken (fix here 10XGenomics/lz4-rs#52), I thought it would be a good time to swap to `lz4_flex`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants