You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add guide for query Troubleshooting to documentation
Improve error logging that causes PageTransportTimeoutException
TL;DR
A lot of errors in Trino reported look like
io.trino.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://11.111.111.111:8080/v1/task/20241225_173448_00002_rhmz3.0.7.4/results/0/129 - 7 failures, failure duration 62.26s, total failed request time 71.62s)
at io.trino.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:505)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed
While worker with IP 11.111.111.111 had never been shutdowned. So the reason why query failed is unclear.
My case of error
I have project with dbt that performs some insert into table select * from view_tmp queries. So a query have worked without any problems for long time until today (no Trino updates or configuration changes). And it has started to fail with message upper. At first glance it looks like network error but why network error appeared only for this single query? I started to examine the issus and:
No logs.
All I have are:
2024-12-25T17:35:20.215Z WARN async-http-response-5 io.trino.server.IoExceptionSuppressingWriterInterceptor Could not write to output: EofException(null)
That appeared during other queries and looks like do not belong to the my query problem
2024-12-25T17:34:59.236Z INFO Notification Thread io.airlift.stats.JmxGcMonitor Major GC: application 0ms, stopped 424ms: 804.31MB -> 693.54MB
Message abouot GC collection, that appeared in one worker (there are 10 workers) and 424ms is much smaller then 62s in Exception.
No workers went down
So why it unavailable is unclear.
CPU load (for worker with IP 11.111.111.111) has 3 spikes up to 30%, 25% and 20). It is not very high and spikes also very "slim".
This is not the first time when I faced with such error. In somecase decreasing data (adding partition) fixed the issue. Sometimes this happens when there are small data (may be several GB) go to one worker. Anyway the error is very unclear it there are too much data that goes to single worker it should be reported in other way.
Total timeout
I have searched in source code and increased some timeouts but my settings had no effect for Total timeout 10000 ms elapsed Does trino support to increase it?
Cluster paramerters
16 cores and 64Gb RAM, 1 coordinator and 10 workers
Trino version is 464
Thanks in advance
Any links or helping about debugging in comments are extremelly welcome.
The text was updated successfully, but these errors were encountered:
nikita-sheremet-java-developer
changed the title
Better error description and/or documetation for Trino troubleshooting
Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException
Dec 25, 2024
PageTransportTimeoutException
TL;DR
A lot of errors in Trino reported look like
While worker with IP
11.111.111.111
had never been shutdowned. So the reason why query failed is unclear.My case of error
I have project with dbt that performs some
insert into table select * from view_tmp
queries. So a query have worked without any problems for long time until today (no Trino updates or configuration changes). And it has started to fail with message upper. At first glance it looks like network error but why network error appeared only for this single query? I started to examine the issus and:All I have are:
That appeared during other queries and looks like do not belong to the my query problem
Message abouot GC collection, that appeared in one worker (there are 10 workers) and 424ms is much smaller then 62s in Exception.
So why it unavailable is unclear.
11.111.111.111
) has 3 spikes up to 30%, 25% and 20). It is not very high and spikes also very "slim".This is not the first time when I faced with such error. In somecase decreasing data (adding partition) fixed the issue. Sometimes this happens when there are small data (may be several GB) go to one worker. Anyway the error is very unclear it there are too much data that goes to single worker it should be reported in other way.
Total timeout
I have searched in source code and increased some timeouts but my settings had no effect for
Total timeout 10000 ms elapsed
Does trino support to increase it?Cluster paramerters
16 cores and 64Gb RAM, 1 coordinator and 10 workers
Trino version is 464
Thanks in advance
Any links or helping about debugging in comments are extremelly welcome.
The text was updated successfully, but these errors were encountered: