Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sorting of the *-bucket consistent and controllable #439

Open
pdobacz opened this issue Aug 23, 2022 · 3 comments
Open

Make sorting of the *-bucket consistent and controllable #439

pdobacz opened this issue Aug 23, 2022 · 3 comments

Comments

@pdobacz
Copy link
Contributor

pdobacz commented Aug 23, 2022

The spot taken by the *-bucket row in the results of a query can vary from query to query (and sometimes for the same query issued before or after ANALYZE;!).

In a nutshell, the Sort plan node may or may not be pushed down by the planner, to a spot before the Agg / BucketScan nodes. If it is then the ORDER BY sorting will not impact it, meaning it will come first. In other circumstances, if the Sort node comes last, it is impacted by ORDER BY.

Example query where it didn't get pushed down:

SELECT age, count(*), diffix.is_suppress_bin(*) FROM customers GROUP BY 1 ORDER BY 1 NULLS FIRST
 Sort
   Sort Key: age
   ->  Custom Scan (BucketScan)
         ->  HashAggregate
               Group Key: age
               ->  Seq Scan on customers

And where it did (because id has many unique columns per ANALYZE;, and the HashAggregate has been changed to Sort -> GroupAggregate:

prop_test=# explain SELECT id, count(*), diffix.is_suppress_bin(*) FROM customers GROUP BY 1 ORDER BY 1 NULLS LAST;
 Custom Scan (BucketScan)
   ->  GroupAggregate
         Group Key: id
         ->  Sort
               Sort Key: id
               ->  Seq Scan on customers

Ideas from slack thread:

  1. ORDER BY volatile_func(id) where that function does nothing but is marked as VOLATILE
  2. Extract the ORDER BY clause from the query plan and insert the *-bucket conformant to this order. As in, whilst we emit the buckets, we "wait" for the first bucket which doesn't satisfy the sorting criteria, and then emit the *-bucket
  3. prepend Append an ORDER BY is_suppress_bin(*) and let the planner figure it out
@pdobacz
Copy link
Contributor Author

pdobacz commented Aug 23, 2022

Unfortunately, when trying (3.) manually, I ran into more problems, where the *-bucket was missorted, likely due to an Incremental Sort node being used:

SET pg_diffix.session_access_level = 'direct'; DELETE FROM pg_seclabel WHERE provider = 'pg_diffix' AND label = 'aid'; SECURITY LABEL FOR pg_diffix ON COLUMN taxi.hack IS 'aid'; SET pg_diffix.strict = off; SET pg_diffix.noise_layer_sd = 0.0;SET pg_diffix.low_count_layer_sd = 0.5 ; SET pg_diffix.low_count_min_threshold = 1;SET pg_diffix.low_count_mean_gap = 0;SET pg_diffix.outlier_count_min = 0;SET pg_diffix.outlier_count_max = 0;SET pg_diffix.top_count_min = 1;SET pg_diffix.top_count_max = 1; SET pg_diffix.salt = "diffix"; SET pg_diffix.session_access_level = 'anonymized_trusted'; \pset pager off

SELECT sf_flag, dropoff_datetime FROM taxi GROUP BY 1, 2 ORDER BY 1 NULLS FIRST, 2 NULLS FIRST, diffix.is_suppress_bin(*) DESC LIMIT 40;

(...SNIP...)
         | 2013-01-08 00:24:00
         | 2013-01-08 00:24:21
         | 2013-01-08 00:25:00
         | 2013-01-08 00:26:00
         | 2013-01-08 00:27:00
 *       | *
         | 2013-01-08 00:28:00
         | 2013-01-08 00:29:00
         | 2013-01-08 00:30:00
         | 2013-01-08 00:31:00
         | 2013-01-08 00:32:00
         | 2013-01-08 00:33:00
         | 2013-01-08 00:34:00
         | 2013-01-08 00:35:00
(40 rows)

As you can see above, the *-bucket row doesn't observe the requested sorting. reference is getting this right.

The Incremental Sort node:

prop_test=# explain SELECT sf_flag, dropoff_datetime FROM taxi GROUP BY 1, 2 ORDER BY 1 NULLS FIRST, 2 NULLS FIRST, diffix.is_suppress_bin(*) DESC LIMIT 40;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Limit
   ->  Incremental Sort
         Sort Key: sf_flag NULLS FIRST, dropoff_datetime NULLS FIRST, (diffix.is_suppress_bin(*)) DESC
         Presorted Key: sf_flag, dropoff_datetime
         ->  Custom Scan (BucketScan)
               ->  GroupAggregate
                     Group Key: sf_flag, dropoff_datetime
                     ->  Sort
                           Sort Key: sf_flag NULLS FIRST, dropoff_datetime NULLS FIRST
                           ->  Seq Scan on taxi

@edongashi
Copy link
Member

Can we use get_relation_stats hook to give the column a distribution that does not trigger that plan path?

@pdobacz
Copy link
Contributor Author

pdobacz commented Aug 24, 2022

Can we use get_relation_stats hook to give the column a distribution that does not trigger that plan path?

This sounds a little complex and might have some unexpected side-effects - judging just after throwing a first glance. Let's keep it as another possible solution.

Also: SET enable_incremental_sort=off; seems to fix the Incremental Sort problem, but might not be the solution we're looking for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants