-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
14044/enhancement/add xxhash algorithms in expression api #14367
base: main
Are you sure you want to change the base?
14044/enhancement/add xxhash algorithms in expression api #14367
Conversation
Can you fix the clippy issue and generate the documentation (./dev/update_function_docs.sh) ? |
|
||
fn process_array<T: Hasher>(array: &dyn Array, mut hasher: T, hash_type: HashType) -> Result<Vec<String>> { | ||
let mut hash_results: Vec<String> = Vec::with_capacity(array.len()); | ||
for i in 0..array.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of downcasting the array for every single element in it see if you can do it once up front.
Self { | ||
signature: Signature::uniform( | ||
1, | ||
vec![Utf8View, Utf8, LargeUtf8, Binary, LargeBinary], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types you say you accept here do not match the types handled in process_array
or to_string_from_scalar
. You should likely handle all these and the int ones you are handling in those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'll fix this. I had actually made several changes that had not been pushed. Due do a glitch in my device they got wiped off 😅
I removed the Int types from being accepted as input since I was thinking that allowing any other form of input other than a string wouldn't be ideal and I am anyway casting them to the Utf8 type. And all the hash functions expect their input to be of that form only. Even the crate I am using expects the same. Allowing for multiple types doesn't serve much of a purpose now that I think about it. I would love your thoughts on this!
And should I convert this issue to draft? Since I am still adding the optional seed argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's still a wip, sure, mark as draft with a WIP in the description helps a lot. Some hashes do allow binary input (See md5 for example, so that is normally reasonable.
Beyond the comments I've added it would be very welcome to include .slt tests for these hash functions. Search for md5 or sha256 for examples of current tests. |
I will add support for Null values and write the tests as well! |
This PR isn't passing CI checks so I am marking it as draft to clear the review queue. |
@Omega359 I found out that the md5 tests are found in functions.slt and expr.slt as well. While there are many other functions that have their own .slt file as well as tests in expr.slt Since I am planning to add wywash hash functions soon, should I make a separate hash.slt and include a few tests in the expr.slt as well? Thanks! |
The slt tests are known to need a bit of a cleanup in spots. A hash.slt I think is a good idea and we can have a followup PR to move the other hash functions in there for their tests. |
@Omega359 I had not pushed the updated cargo.lock file, hence the tests were failing. I think now this PR is ready for review. The hash.stl file has been added as well. There is a small conflict in the Cargo.lock that needs resolving. |
Utf8 | Utf8View | LargeUtf8 => { | ||
let string_array: &dyn Array = | ||
if array.data_type() == &Utf8 || array.data_type() == &Utf8View { | ||
array.as_any().downcast_ref::<StringArray>().unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe this works for a StringViewArray which is what an array with data_type Utf8View should be. Can you add a test where the array is of type Utf8View to show that this works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may want to look at using StringArrayType instead of this pattern. See an example @
match args[0].data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I entered an extremely long input string (assuming that it is long enough so that some of it might be stored inline or externally depending on its length) and it ran successfully. I will add a test for this.
mut hasher: T, | ||
hash_type: HashType, | ||
) -> Result<Vec<String>> { | ||
let mut hash_results: Vec<String> = Vec::with_capacity(array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut hash_results = StringBuilder::new();
Using the above to build the results and returning that is a bit more efficient than using an intermediate vec and building the StringArray from that.
Thanks for continuing to work through this feature! ❤️ I left a number of feedback items. I think the largest concern I have is the downcast_ref code not handling StringViewArray where you likely want to switch to using StringArrayType. |
WIP
Which issue does this PR close?
Closes #14044.
Rationale for this change
Lack of xxhash (a quick, non-cryptographic hashing technique) functions.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
The users shall now be able to use xxhash32 and xxhash64 functions.