Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating indexes on compressed columns #17

Open
Hakkin opened this issue Sep 12, 2022 · 3 comments
Open

Creating indexes on compressed columns #17

Hakkin opened this issue Sep 12, 2022 · 3 comments

Comments

@Hakkin
Copy link

Hakkin commented Sep 12, 2022

I tried using this on a database I'm using to store a lot of raw JSON objects, which I then have a view that uses SQLite's JSON support to extract the contents for querying. I originally had a few indexes on JSON expressions (something like CREATE INDEX idx_data_uuid ON data(json_extract(body, '$.data.uuid'))). SQLite's query planner uses this index even when querying from the view, but after compressing the data, there's no more obvious way to create indexes like this since you can't create indexes on views.

I tried creating an index like _data_zstd(json_extract(zstd_decompress_col(body, 1, _body_dict, true), '$.data.uuid')), but SQLite's query planner won't select this when querying from a view. I can make it use the index by replacing my data_view.uuid with the literal json_extract(zstd_decompress_col(body, 1, _body_dict, true), '$.data.uuid') in the SELECT statement, but that kind of defeats the purpose of the view

I see proper index support is on the "Future Work" section, so I assume there may not be any way to do this currently, but I wonder if you have any hints or experience with being able to coerce the query planner into using an index like the one above, since I believe it "should" be possible in theory.

In any case, I hope work on this extension continues, since the results I got were extremely promising. My database compressed to around 6% the original size, and for queries that didn't need those indexes, there was seemingly no loss in performance.

@Hakkin
Copy link
Author

Hakkin commented Sep 12, 2022

While looking around I actually just stumbled on your original blog post for this project which more or less describes this exact situation to a tee, you would think I was copying straight from the blog post.

I see your suggested solution in the blog is to just break out the values you want to index into real columns, which I suppose works well enough, though proper indexing support would definitely be nice. I wonder if you ever manged to get anything else working with GENERATED columns?

@phiresky
Copy link
Owner

Sadly I don't think there's any solution for this right now except storing a copy of the extracted values separately. I'm not sure why SQLite doesn't use the existing index when querying the view even though the expression should be the same. It might be useful to ask this in the SQLite forum, it might be something that's easy to fix for the SQLite devs in an update.

With GENERATED columns the idea is that the whole thing would basically not use a view and backing table, but instead it would just add the _xxxx_dict to the original table, rename the xxxx column to _xxxx_zstd, and add add column xxxx generated always as (zstd_decompress_col(_xxxx_zstd, 1, _xxxx_dict, true). Then an index on json_extract(xxxx, ...) might work, and it should have lower overhead compared to using views.

I haven't tried this yet (should be easy to create an example manually even with the extension in it's current state), but I think there was some logical issue with it that I don't remember right now.

@Hakkin
Copy link
Author

Hakkin commented Sep 14, 2022

Just tested it out, can confirm adding (virtual) generated columns to the _zstd table (ALTER TABLE _data_zstd ADD COLUMN uuid TEXT GENERATED ALWAYS AS (json_extract(zstd_decompress_col(body, 1, _body_dict, true), '$.data.uuid')) VIRTUAL), creating an index on that column (CREATE INDEX idx_data_uuid ON _data_zstd(uuid)), modifying the view the extension creates to also SELECT the new row, then modify my own view to SELECT that row instead of doing json_extract on the body row, does work!

It's a little contrived, but it definitely works and allows me to use this more or less as is with a little housework on the schema.
The fact that the generated column is VIRTUAL as opposed to STORED also means there should be no additional size overhead besides the index itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants