-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The dict_id
was lost when constructing the logic plan.
#6784
Comments
I would love to know what you think about dict_id handling in general -- from what I can see so far it is not well supported in arrow-rs. We have similar problems with I am also not 100% clear if dict_id is supposed to (potentially) different per record batch or if it would be the same for the entire plan One thing that might be possible is to compare the pointer for the dictionary array to decide if it was the same dictionary rather than trying to keep |
Thank you very much for your reply. I think that after providing a reasonable By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community. |
I wonder if you can explain how you are using
Thank you for your efforts so far -- I did look (briefly) at the code and while it follows the existing patterns for metadata / nullable it felt to me like it was going to be hard to ensure all cases were covered properly (aka it was going to be hard to use) |
Thank you for your reply. I got what your mean. Currently |
Got it -- thank you for the clarification @tanruixiang
I agree -- I think the right place to start might be in arrow-rs (aka to where the kernels that implement |
I'm not sure this is something that can be handled at this level, as the dict_id is part of the schema not the arrays, I think it needs to be handled at the DF level - perhaps by selecting specialized dictionary operators |
🤔 it might be time to write up some sort of description of how this would work more generally |
## Rationale When schema has more than one dict fields, datafusion may lose the `dict_id`, which will cause an error in the record batch encode, and the client decoded the record batch via ipc will throw following errors: ``` DecodeArrowPayload(InvalidArgumentError("Value at position 0 out of bounds: 0 (should be in [0, -1])")) ``` More context: apache/datafusion#6784 ## Detailed Changes - Assign unique dict id when there are more than one dict fields. ## Test Plan UT and integration tests.
Describe the bug
One of the simplest sql statements:
select * from table;
In the construction of the logical plan,Projection
will useto_field
at the bottom to constructDFField
, and thento_field
ignore thedict_id
. This will lead to encoding errors when using IPC if there are dictionary columns.We are glad to contribute to the community and solve this problem.
To solve this problem, it may be necessary to add interfaces to the
ExprSchema
, for example by addingdict_is_ordered
anddict_id
interfaces, or by adding a directget_dffield
interface. Or there is a better way other than the two mentioned above. Both methods have a certain amount of work and we are not sure which one to use or if there is a better way. We hope community can provide some comments and help.Here is a draft of one of the methods:
CeresDB#3
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: