Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

14044/enhancement/add xxhash algorithms in expression api #14367

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

Spaarsh
Copy link
Contributor

@Spaarsh Spaarsh commented Jan 30, 2025

WIP

Which issue does this PR close?

Closes #14044.

Rationale for this change

Lack of xxhash (a quick, non-cryptographic hashing technique) functions.

What changes are included in this PR?

Are these changes tested?

Yes.

Are there any user-facing changes?

The users shall now be able to use xxhash32 and xxhash64 functions.

@Omega359
Copy link
Contributor

Omega359 commented Feb 3, 2025

Can you fix the clippy issue and generate the documentation (./dev/update_function_docs.sh) ?


fn process_array<T: Hasher>(array: &dyn Array, mut hasher: T, hash_type: HashType) -> Result<Vec<String>> {
let mut hash_results: Vec<String> = Vec::with_capacity(array.len());
for i in 0..array.len() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of downcasting the array for every single element in it see if you can do it once up front.

Self {
signature: Signature::uniform(
1,
vec![Utf8View, Utf8, LargeUtf8, Binary, LargeBinary],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types you say you accept here do not match the types handled in process_array or to_string_from_scalar. You should likely handle all these and the int ones you are handling in those.

Copy link
Contributor Author

@Spaarsh Spaarsh Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll fix this. I had actually made several changes that had not been pushed. Due do a glitch in my device they got wiped off 😅

I removed the Int types from being accepted as input since I was thinking that allowing any other form of input other than a string wouldn't be ideal and I am anyway casting them to the Utf8 type. And all the hash functions expect their input to be of that form only. Even the crate I am using expects the same. Allowing for multiple types doesn't serve much of a purpose now that I think about it. I would love your thoughts on this!

And should I convert this issue to draft? Since I am still adding the optional seed argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's still a wip, sure, mark as draft with a WIP in the description helps a lot. Some hashes do allow binary input (See md5 for example, so that is normally reasonable.

@Omega359
Copy link
Contributor

Omega359 commented Feb 3, 2025

Beyond the comments I've added it would be very welcome to include .slt tests for these hash functions. Search for md5 or sha256 for examples of current tests.

@Spaarsh Spaarsh marked this pull request as draft February 4, 2025 09:55
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 4, 2025
@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 4, 2025

I will add support for Null values and write the tests as well!

@Spaarsh Spaarsh marked this pull request as ready for review February 5, 2025 04:04
@alamb alamb marked this pull request as draft February 5, 2025 11:02
@alamb
Copy link
Contributor

alamb commented Feb 5, 2025

This PR isn't passing CI checks so I am marking it as draft to clear the review queue.

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 5, 2025

Beyond the comments I've added it would be very welcome to include .slt tests for these hash functions. Search for md5 or sha256 for examples of current tests.

@Omega359 I found out that the md5 tests are found in functions.slt and expr.slt as well. While there are many other functions that have their own .slt file as well as tests in expr.slt

Since I am planning to add wywash hash functions soon, should I make a separate hash.slt and include a few tests in the expr.slt as well?

Thanks!

@Omega359
Copy link
Contributor

Omega359 commented Feb 5, 2025

Beyond the comments I've added it would be very welcome to include .slt tests for these hash functions. Search for md5 or sha256 for examples of current tests.

@Omega359 I found out that the md5 tests are found in functions.slt and expr.slt as well. While there are many other functions that have their own .slt file as well as tests in expr.slt

Since I am planning to add wywash hash functions soon, should I make a separate hash.slt and include a few tests in the expr.slt as well?

The slt tests are known to need a bit of a cleanup in spots. A hash.slt I think is a good idea and we can have a followup PR to move the other hash functions in there for their tests.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 5, 2025
@Spaarsh
Copy link
Contributor Author

Spaarsh commented Feb 5, 2025

@Omega359 I had not pushed the updated cargo.lock file, hence the tests were failing. I think now this PR is ready for review. The hash.stl file has been added as well. There is a small conflict in the Cargo.lock that needs resolving.

Utf8 | Utf8View | LargeUtf8 => {
let string_array: &dyn Array =
if array.data_type() == &Utf8 || array.data_type() == &Utf8View {
array.as_any().downcast_ref::<StringArray>().unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this works for a StringViewArray which is what an array with data_type Utf8View should be. Can you add a test where the array is of type Utf8View to show that this works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may want to look at using StringArrayType instead of this pattern. See an example @

match args[0].data_type() {

Copy link
Contributor Author

@Spaarsh Spaarsh Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I entered an extremely long input string (assuming that it is long enough so that some of it might be stored inline or externally depending on its length) and it ran successfully. I will add a test for this.

mut hasher: T,
hash_type: HashType,
) -> Result<Vec<String>> {
let mut hash_results: Vec<String> = Vec::with_capacity(array.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let mut hash_results = StringBuilder::new();

Using the above to build the results and returning that is a bit more efficient than using an intermediate vec and building the StringArray from that.

@Omega359
Copy link
Contributor

Omega359 commented Feb 6, 2025

Thanks for continuing to work through this feature! ❤️

I left a number of feedback items. I think the largest concern I have is the downcast_ref code not handling StringViewArray where you likely want to switch to using StringArrayType.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement xxhash algorithms as part of the expression API
3 participants