Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash Join of a hypertable with an index over an integer field will less than 200 join keys is suboptimal #2426

Open
leventov opened this issue Sep 21, 2020 · 1 comment

Comments

@leventov
Copy link

This is extracted from timescale/pg_prometheus#55, but is not specific to pg_prometheus.

Schema

CREATE TABLE IF NOT EXISTS series_values (
    time TIMESTAMPTZ NOT NULL, 
    value DOUBLE PRECISION NOT NULL, 
    series_id INTEGER NOT NULL,
    seq BIGSERIAL
);

SELECT create_hypertable('series_values', 'time');

ALTER TABLE series_values SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'series_id',
  timescaledb.compress_orderby = 'time DESC, seq DESC'
);

CREATE INDEX IF NOT EXISTS series_values_series_id_idx ON series_values USING BTREE (series_id, time desc);

Our data has about 250 unique series, all with 1 data point per second, i. e. about 250 data points per second in total.

Problem - Hash Join when selecting just 190 indexed series_id is suboptimal

EXPLAIN ANALYZE SELECT series_id, avg("value") from series_values where series_id IN (select id from series where metric = 'some_metric_name') and time between '2020-09-19 09:00:00' and '2020-09-19 11:00:00' group by series_id;

                                                                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=65310.35..65386.85 rows=200 width=12) (actual time=24017.786..24023.399 rows=192 loops=1)
   Group Key: _hyper_1_40_chunk.series_id
   ->  Gather Merge  (cost=65310.35..65381.35 rows=600 width=36) (actual time=24017.712..24106.440 rows=768 loops=1)
         Workers Planned: 3
         Workers Launched: 3
         ->  Sort  (cost=64310.31..64310.81 rows=200 width=36) (actual time=23901.632..23901.857 rows=192 loops=4)
               Sort Key: _hyper_1_40_chunk.series_id
               Sort Method: quicksort  Memory: 42kB
               Worker 0:  Sort Method: quicksort  Memory: 42kB
               Worker 1:  Sort Method: quicksort  Memory: 42kB
               Worker 2:  Sort Method: quicksort  Memory: 42kB
               ->  Partial HashAggregate  (cost=64300.67..64302.67 rows=200 width=36) (actual time=23892.381..23893.048 rows=192 loops=4)
                     Group Key: _hyper_1_40_chunk.series_id
                     ->  Hash Join  (cost=15.29..62614.73 rows=337187 width=12) (actual time=5.554..22098.783 rows=325488 loops=4)
                           Hash Cond: (_hyper_1_40_chunk.series_id = series.id)
                           ->  Parallel Append  (cost=0.43..61307.99 rows=484707 width=12) (actual time=4.074..20498.857 rows=349109 loops=4)
                                 ->  Parallel Index Scan using _hyper_1_40_chunk_series_values_time_idx on _hyper_1_40_chunk  (cost=0.43..58884.45 rows=484707 width=12) (actual time=4.070..19998.534 rows=349109 loops=4)
                                       Index Cond: (("time" >= '2020-09-19 09:00:00+00'::timestamp with time zone) AND ("time" <= '2020-09-19 11:00:00+00'::timestamp with time zone))
                           ->  Hash  (cost=12.45..12.45 rows=192 width=4) (actual time=1.217..1.218 rows=192 loops=4)
                                 Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                 ->  Seq Scan on series  (cost=0.00..12.45 rows=192 width=4) (actual time=0.143..0.865 rows=192 loops=4)
                                       Filter: (metric = 'some_metric_name'::text)
                                       Rows Removed by Filter: 84
 Planning Time: 3.240 ms
 Execution Time: 24109.908 ms

Query planner decided to do a Hash Join with just 192 rows from the series table, which doesn't make sense.

When I replace the nested SELECT query with series_id IN (21,22,...212), i. e. literally all the selected values, Timescale does the same (I don't post the query and the result because the query string is very long).

But I can make Timescale to do an Index Scan instead of a Hash Join by using a series_id between 21 and 212 clause. This is 40% faster:

EXPLAIN ANALYZE SELECT series_id, avg("value") from series_values where series_id BETWEEN 21 and 212 and time between '2020-09-19 09:00:00' and '2020-09-19 11:00:00' group by series_id;
                                                                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=66793.95..66870.45 rows=200 width=12) (actual time=14220.642..14227.494 rows=192 loops=1)
   Group Key: _hyper_1_40_chunk.series_id
   ->  Gather Merge  (cost=66793.95..66864.95 rows=600 width=36) (actual time=14220.567..14299.768 rows=768 loops=1)
         Workers Planned: 3
         Workers Launched: 3
         ->  Sort  (cost=65793.91..65794.41 rows=200 width=36) (actual time=14098.952..14099.158 rows=192 loops=4)
               Sort Key: _hyper_1_40_chunk.series_id
               Sort Method: quicksort  Memory: 42kB
               Worker 0:  Sort Method: quicksort  Memory: 42kB
               Worker 1:  Sort Method: quicksort  Memory: 42kB
               Worker 2:  Sort Method: quicksort  Memory: 42kB
               ->  Partial HashAggregate  (cost=65784.27..65786.27 rows=200 width=36) (actual time=14096.069..14096.652 rows=192 loops=4)
                     Group Key: _hyper_1_40_chunk.series_id
                     ->  Parallel Append  (cost=0.43..63525.22 rows=451809 width=12) (actual time=4.246..12750.656 rows=325488 loops=4)
                           ->  Parallel Index Scan using _hyper_1_40_chunk_series_values_time_idx on _hyper_1_40_chunk  (cost=0.43..61266.18 rows=451809 width=12) (actual time=4.241..12429.488 rows=325488 loops=4)
                                 Index Cond: (("time" >= '2020-09-19 09:00:00+00'::timestamp with time zone) AND ("time" <= '2020-09-19 11:00:00+00'::timestamp with time zone))
                                 Filter: ((series_id >= 21) AND (series_id <= 212))
                                 Rows Removed by Filter: 23621
 Planning Time: 11.483 ms
 Execution Time: 14304.284 ms
@q3kep
Copy link

q3kep commented Sep 11, 2024

Hi! I guess it's quite a generic issue of joins with hypertables, joins tend to go hash, this leading to no partiton pruning or mass uncompress in case it is compressed.
Or even if it goes to Nested loops, it has all chunks in the plan and skipping is done on runtime, but it is less efficient than startup exclusion.
Few test cases:

create table test_tags(tag_id bigint, time timestamptz);
create index on test_tags (tag_id);
insert into test_tags(tag_id, time) values (1, '2024-06-25 04:00:00+00');
insert into test_tags(tag_id, time) values (2, '2024-06-25 08:00:00+00');
insert into test_tags(tag_id, time) values (3, '2024-06-25 08:59:00+00');
--select * from test_tags;

--drop table test;
create table test (time timestamptz, tag_id bigint, a int, b int, c int, d int, e int, f int, g int, h int, i int, j int, k int, l int, m int, n int, o int, p int, q int, r int, s int, t int, u int, v int, w int, x int, y int, z int);

--make it hypertable
--select create_hypertable('test', 'time', chunk_time_interval => interval '4h');

insert into test (select '2024-06-25T00:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test (select '2024-06-25T01:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test (select '2024-06-25T02:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test (select '2024-06-25T03:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test (select '2024-06-25T04:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test (select '2024-06-25T05:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test (select '2024-06-25T06:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test (select '2024-06-25T07:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test (select '2024-06-25T08:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
create index on test (tag_id, time);

If it's just a table in next query we get nested loop with index access:

explain (analyze, verbose, costs, buffers)
select tst.* 
from test tst, test_tags tst_t
where tst.tag_id = tst_t.tag_id and tst.time = tst_t.time
  and tst_t.tag_id = 1
;

If we make it hypertable - we get inefficient hash join.

Lateral and group by or limit help to avoid hash join

explain (analyze)
SELECT mt.* FROM 
(
 select min(tag_id) tid, min(time) tm from test_tags
	where tag_id = 1
) AS tst_t
inner JOIN lateral (select * from test t
where t.tag_id = tst_t.tid and t.time = tst_t.tm) mt on true
;
--or
explain (analyze, verbose, costs, buffers)
with tst_t as (
 select tag_id, time from test_tags
	where tag_id = 1
	limit 1
)
SELECT tst.* FROM test tst, tst_t
where tst.tag_id = tst_t.tag_id and tst.time = tst_t.time
;

But if we compress

alter table test 
set (
  timescaledb.compress, 
  timescaledb.compress_segmentby='tag_id',
  timescaledb.compress_orderby='time'
);

both these queries with Nested loops will lead to all chunks decompression.
Only making a scalar subquery will help to skip chunks on runtime:

explain (analyze, verbose, costs, buffers)
select tst.* from test tst
where (tst.tag_id, tst.time) = (select tag_id, time from test_tags where tag_id = 1)
;
--"  Chunks excluded during runtime: 2"

But on big tables and parallel queries we noticed it performs much slower(x10 at least) than if we use function with two separate queries.
And if we need >1 rows, scalar subquery is not an option - we'll have to write your own Nested loops implementation. It is not only akward, but native solution could perform better(at least no sql/pgplsql context switch).

UPD. Checked same with simple partitioned(with partman) postgress table - for all join queries we get Nested Loop with runtime exclusion:

drop table test2;
create table test2 
(time timestamptz not null, tag_id bigint, a int, b int, c int, d int, e int, 
f int, g int, h int, i int, j int, k int, l int, m int, n int, o int, 
p int, q int, r int, s int, t int, u int, v int, w int, x int, y int, z int)
partition by range (time)
;
create index on test2 (tag_id, time);

SELECT partman.create_parent( p_parent_table => 'public.test2',
p_control => 'time',
p_interval=> '4h');

insert into test2 (select '2024-06-25T00:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test2 (select '2024-06-25T01:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test2 (select '2024-06-25T02:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test2 (select '2024-06-25T03:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test2 (select '2024-06-25T04:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test2 (select '2024-06-25T05:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test2 (select '2024-06-25T06:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test2 (select '2024-06-25T07:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);
insert into test2 (select '2024-06-25T08:00:00Z'::timestamptz + make_interval(mins => i/10000), i%10000, i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10, i+11, i+12, i+13, i+14, i+15, i+16, i+17, i+18, i+19, i+20, i+21, i+22, i+23, i+24, i+25 from (select generate_series(0,60*10000-1) as i) a);

CALL partman.run_maintenance_proc();
CALL partman.partition_data_proc('public.test2');
VACUUM ANALYZE public.test2;
/*
select * from test2_default;
\d+ public.test2
Partition key: RANGE ("time")
Indexes:
    "test2_tag_id_time_idx" btree (tag_id, "time")
Partitions: test2_p20240624_230000 FOR VALUES FROM ('2024-06-24 23:00:00+00') TO ('2024-06-25 03:00:00+00'),
            test2_p20240625_030000 FOR VALUES FROM ('2024-06-25 03:00:00+00') TO ('2024-06-25 07:00:00+00'),
            test2_p20240625_070000 FOR VALUES FROM ('2024-06-25 07:00:00+00') TO ('2024-06-25 11:00:00+00')
*/

create table test_tags(tag_id bigint, time timestamptz);
create index on test_tags (tag_id);
insert into test_tags(tag_id, time) values (1, '2024-06-25 04:00:00+00');
insert into test_tags(tag_id, time) values (2, '2024-06-25 08:00:00+00');
insert into test_tags(tag_id, time) values (3, '2024-06-25 08:59:00+00');
select * from test_tags;

--one partition, as expected
explain (analyze, verbose, costs, buffers)
select tst.* from test2 tst
where tst.tag_id = 1 and tst.time = '2024-06-25 04:00:00+00'
;

--runtime exclusion
explain (analyze, verbose, costs, buffers)
select tst.* 
from test2 tst, test_tags tst_t
where tst.tag_id = tst_t.tag_id and tst.time = tst_t.time
  and tst_t.tag_id = 1
; 

Not sure if bad performance(for scalar query where it works) it's a runtime exclusion problem with compressed chunks or generic runtime exclusion problem - I do not have big and loaded simple partitioned table. But plans without this exclusion or even with hash join are clearly issue connected with Timescale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants