The `dict_id` was lost when constructing the logic plan. #6784

tanruixiang · 2023-06-28T06:28:54Z

Describe the bug

One of the simplest sql statements: select * from table; In the construction of the logical plan, Projection will use to_field at the bottom to construct DFField, and then to_field ignore the dict_id. This will lead to encoding errors when using IPC if there are dictionary columns.
We are glad to contribute to the community and solve this problem.
To solve this problem, it may be necessary to add interfaces to the ExprSchema, for example by adding dict_is_ordered and dict_id interfaces, or by adding a direct get_dffield interface. Or there is a better way other than the two mentioned above. Both methods have a certain amount of work and we are not sure which one to use or if there is a better way. We hope community can provide some comments and help.

Here is a draft of one of the methods:
CeresDB#3

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2023-06-30T20:12:34Z

I would love to know what you think about dict_id handling in general -- from what I can see so far it is not well supported in arrow-rs. We have similar problems with metadata which can be hung off a schema or a field and gets lost frequently

I am also not 100% clear if dict_id is supposed to (potentially) different per record batch or if it would be the same for the entire plan

One thing that might be possible is to compare the pointer for the dictionary array to decide if it was the same dictionary rather than trying to keep dict_id all the way through the plan 🤔

tanruixiang · 2023-07-02T04:01:53Z

I would love to know what you think about dict_id handling in general -- from what I can see so far it is not well supported in arrow-rs. We have similar problems with metadata which can be hung off a schema or a field and gets lost frequently

I am also not 100% clear if dict_id is supposed to (potentially) different per record batch or if it would be the same for the entire plan

One thing that might be possible is to compare the pointer for the dictionary array to decide if it was the same dictionary rather than trying to keep dict_id all the way through the plan 🤔

Thank you very much for your reply. I think that after providing a reasonable provider, dict_id should not be lost when building the plan, because dict_id and other variables such as name are located at the same level(In pub struct Field), and since name will not be lost, dict_id should not be lost either.
At the same time we get the name, we can get the dict_id, so we need to keep the dict_id at least for the whole process of building the plan. (In particular, I think it's always better to keep the dict_id as much as possible than to just discard it and use the default value).

By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community.

alamb · 2023-07-02T10:58:40Z

I wonder if you can explain how you are using dict_id in Schema? That field in particular is quite old in the code, and @tustvold and I have discussed at various times removing it as it doesn't seem to be widely used / supported. However we couldn't convince ourselves that was a good idea -- mostly because we don't know what people are using it for

By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community.

Thank you for your efforts so far -- I did look (briefly) at the code and while it follows the existing patterns for metadata / nullable it felt to me like it was going to be hard to ensure all cases were covered properly (aka it was going to be hard to use)

tanruixiang · 2023-07-09T02:26:59Z

I wonder if you can explain how you are using dict_id in Schema? That field in particular is quite old in the code, and @tustvold and I have discussed at various times removing it as it doesn't seem to be widely used / supported. However we couldn't convince ourselves that was a good idea -- mostly because we don't know what people are using it for

By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community.

Thank you for your efforts so far -- I did look (briefly) at the code and while it follows the existing patterns for metadata / nullable it felt to me like it was going to be hard to ensure all cases were covered properly (aka it was going to be hard to use)

Thank you for your reply. I got what your mean. Currently dict_id is only used for IPC, my point is that we can further use on dict_id and dict_is_ordered, for some operations, such as Eq, if left and right have the same dict_id ,it can only need to compare whether the Key is the same(In case of string type, the complexity changes: O(min(leftstring_ength,rightstring_length)) -> O(1)). For sorting operations, further speedups can also be done based on dict_is_ordered.
Of course, this is a relatively big feature.

alamb · 2023-07-10T13:56:10Z

Got it -- thank you for the clarification @tanruixiang

Of course, this is a relatively big feature.

I agree -- I think the right place to start might be in arrow-rs (aka to where the kernels that implement Eq are) and then once we have a good story about how that will work we can update/adjust DataFusion to account for them

tustvold · 2023-07-10T13:58:21Z

I agree -- I think the right place to start might be in arrow-rs (aka to where the kernels that implement Eq are)

I'm not sure this is something that can be handled at this level, as the dict_id is part of the schema not the arrays, I think it needs to be handled at the DF level - perhaps by selecting specialized dictionary operators

alamb · 2023-07-10T15:58:35Z

🤔 it might be time to write up some sort of description of how this would work more generally

## Rationale When schema has more than one dict fields, datafusion may lose the `dict_id`, which will cause an error in the record batch encode, and the client decoded the record batch via ipc will throw following errors: ``` DecodeArrowPayload(InvalidArgumentError("Value at position 0 out of bounds: 0 (should be in [0, -1])")) ``` More context: apache/datafusion#6784 ## Detailed Changes - Assign unique dict id when there are more than one dict fields. ## Test Plan UT and integration tests.

tanruixiang added the bug Something isn't working label Jun 28, 2023

jiacai2050 mentioned this issue Jul 13, 2023

fix: add missing dict_id apache/horaedb#1068

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `dict_id` was lost when constructing the logic plan. #6784

The `dict_id` was lost when constructing the logic plan. #6784

tanruixiang commented Jun 28, 2023 •

edited

Loading

alamb commented Jun 30, 2023

tanruixiang commented Jul 2, 2023 •

edited

Loading

alamb commented Jul 2, 2023

tanruixiang commented Jul 9, 2023 •

edited

Loading

alamb commented Jul 10, 2023

tustvold commented Jul 10, 2023

alamb commented Jul 10, 2023

The dict_id was lost when constructing the logic plan. #6784

The dict_id was lost when constructing the logic plan. #6784

Comments

tanruixiang commented Jun 28, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Jun 30, 2023

tanruixiang commented Jul 2, 2023 • edited Loading

alamb commented Jul 2, 2023

tanruixiang commented Jul 9, 2023 • edited Loading

alamb commented Jul 10, 2023

tustvold commented Jul 10, 2023

alamb commented Jul 10, 2023

The `dict_id` was lost when constructing the logic plan. #6784

The `dict_id` was lost when constructing the logic plan. #6784

tanruixiang commented Jun 28, 2023 •

edited

Loading

tanruixiang commented Jul 2, 2023 •

edited

Loading

tanruixiang commented Jul 9, 2023 •

edited

Loading