Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The dict_id was lost when constructing the logic plan. #6784

Open
tanruixiang opened this issue Jun 28, 2023 · 7 comments
Open

The dict_id was lost when constructing the logic plan. #6784

tanruixiang opened this issue Jun 28, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@tanruixiang
Copy link
Member

tanruixiang commented Jun 28, 2023

Describe the bug

One of the simplest sql statements: select * from table; In the construction of the logical plan, Projection will use to_field at the bottom to construct DFField, and then to_field ignore the dict_id. This will lead to encoding errors when using IPC if there are dictionary columns.
We are glad to contribute to the community and solve this problem.
To solve this problem, it may be necessary to add interfaces to the ExprSchema, for example by adding dict_is_ordered and dict_id interfaces, or by adding a direct get_dffield interface. Or there is a better way other than the two mentioned above. Both methods have a certain amount of work and we are not sure which one to use or if there is a better way. We hope community can provide some comments and help.

Here is a draft of one of the methods:
CeresDB#3

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@tanruixiang tanruixiang added the bug Something isn't working label Jun 28, 2023
@alamb
Copy link
Contributor

alamb commented Jun 30, 2023

I would love to know what you think about dict_id handling in general -- from what I can see so far it is not well supported in arrow-rs. We have similar problems with metadata which can be hung off a schema or a field and gets lost frequently

I am also not 100% clear if dict_id is supposed to (potentially) different per record batch or if it would be the same for the entire plan

One thing that might be possible is to compare the pointer for the dictionary array to decide if it was the same dictionary rather than trying to keep dict_id all the way through the plan 🤔

@tanruixiang
Copy link
Member Author

tanruixiang commented Jul 2, 2023

I would love to know what you think about dict_id handling in general -- from what I can see so far it is not well supported in arrow-rs. We have similar problems with metadata which can be hung off a schema or a field and gets lost frequently

I am also not 100% clear if dict_id is supposed to (potentially) different per record batch or if it would be the same for the entire plan

One thing that might be possible is to compare the pointer for the dictionary array to decide if it was the same dictionary rather than trying to keep dict_id all the way through the plan 🤔

Thank you very much for your reply. I think that after providing a reasonable provider, dict_id should not be lost when building the plan, because dict_id and other variables such as name are located at the same level(In pub struct Field), and since name will not be lost, dict_id should not be lost either.
At the same time we get the name, we can get the dict_id, so we need to keep the dict_id at least for the whole process of building the plan. (In particular, I think it's always better to keep the dict_id as much as possible than to just discard it and use the default value).

By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community.

@alamb
Copy link
Contributor

alamb commented Jul 2, 2023

I wonder if you can explain how you are using dict_id in Schema? That field in particular is quite old in the code, and @tustvold and I have discussed at various times removing it as it doesn't seem to be widely used / supported. However we couldn't convince ourselves that was a good idea -- mostly because we don't know what people are using it for

By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community.

Thank you for your efforts so far -- I did look (briefly) at the code and while it follows the existing patterns for metadata / nullable it felt to me like it was going to be hard to ensure all cases were covered properly (aka it was going to be hard to use)

@tanruixiang
Copy link
Member Author

tanruixiang commented Jul 9, 2023

I wonder if you can explain how you are using dict_id in Schema? That field in particular is quite old in the code, and @tustvold and I have discussed at various times removing it as it doesn't seem to be widely used / supported. However we couldn't convince ourselves that was a good idea -- mostly because we don't know what people are using it for

By the way, if you think my draft implementation works, I'd be happy to continue implementing it and contributing to the community.

Thank you for your efforts so far -- I did look (briefly) at the code and while it follows the existing patterns for metadata / nullable it felt to me like it was going to be hard to ensure all cases were covered properly (aka it was going to be hard to use)

Thank you for your reply. I got what your mean. Currently dict_id is only used for IPC, my point is that we can further use on dict_id and dict_is_ordered, for some operations, such as Eq, if left and right have the same dict_id ,it can only need to compare whether the Key is the same(In case of string type, the complexity changes: O(min(leftstring_ength,rightstring_length)) -> O(1)). For sorting operations, further speedups can also be done based on dict_is_ordered.
Of course, this is a relatively big feature.

@alamb
Copy link
Contributor

alamb commented Jul 10, 2023

Got it -- thank you for the clarification @tanruixiang

Of course, this is a relatively big feature.

I agree -- I think the right place to start might be in arrow-rs (aka to where the kernels that implement Eq are) and then once we have a good story about how that will work we can update/adjust DataFusion to account for them

@tustvold
Copy link
Contributor

I agree -- I think the right place to start might be in arrow-rs (aka to where the kernels that implement Eq are)

I'm not sure this is something that can be handled at this level, as the dict_id is part of the schema not the arrays, I think it needs to be handled at the DF level - perhaps by selecting specialized dictionary operators

@alamb
Copy link
Contributor

alamb commented Jul 10, 2023

🤔 it might be time to write up some sort of description of how this would work more generally

jiacai2050 added a commit to apache/horaedb that referenced this issue Jul 14, 2023
## Rationale
When schema has more than one dict fields, datafusion may lose the
`dict_id`, which will cause an error in the record batch encode, and the
client decoded the record batch via ipc will throw following errors:
```
DecodeArrowPayload(InvalidArgumentError("Value at position 0 out of bounds: 0 (should be in [0, -1])"))
```
More context: apache/datafusion#6784
## Detailed Changes
- Assign unique dict id when there are more than one dict fields. 

## Test Plan
UT and integration tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants