Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException #24579

nikita-sheremet-java-developer · 2024-12-25T18:13:53Z

Add guide for query Troubleshooting to documentation
Improve error logging that causes PageTransportTimeoutException

TL;DR

A lot of errors in Trino reported look like

io.trino.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://11.111.111.111:8080/v1/task/20241225_173448_00002_rhmz3.0.7.4/results/0/129 - 7 failures, failure duration 62.26s, total failed request time 71.62s)
	at io.trino.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:505)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed

While worker with IP 11.111.111.111 had never been shutdowned. So the reason why query failed is unclear.

My case of error

I have project with dbt that performs some insert into table select * from view_tmp queries. So a query have worked without any problems for long time until today (no Trino updates or configuration changes). And it has started to fail with message upper. At first glance it looks like network error but why network error appeared only for this single query? I started to examine the issus and:

No logs.
All I have are:

2024-12-25T17:35:20.215Z	WARN	async-http-response-5	io.trino.server.IoExceptionSuppressingWriterInterceptor	Could not write to output: EofException(null)

That appeared during other queries and looks like do not belong to the my query problem

2024-12-25T17:34:59.236Z	INFO	Notification Thread	io.airlift.stats.JmxGcMonitor	Major GC: application 0ms, stopped 424ms: 804.31MB -> 693.54MB

Message abouot GC collection, that appeared in one worker (there are 10 workers) and 424ms is much smaller then 62s in Exception.

No workers went down
So why it unavailable is unclear.
CPU load (for worker with IP 11.111.111.111) has 3 spikes up to 30%, 25% and 20). It is not very high and spikes also very "slim".

This is not the first time when I faced with such error. In somecase decreasing data (adding partition) fixed the issue. Sometimes this happens when there are small data (may be several GB) go to one worker. Anyway the error is very unclear it there are too much data that goes to single worker it should be reported in other way.

Total timeout

I have searched in source code and increased some timeouts but my settings had no effect for Total timeout 10000 ms elapsed Does trino support to increase it?

Cluster paramerters

16 cores and 64Gb RAM, 1 coordinator and 10 workers
Trino version is 464

Thanks in advance

Any links or helping about debugging in comments are extremelly welcome.

The text was updated successfully, but these errors were encountered:

nikita-sheremet-java-developer changed the title ~~Better error description and/or documetation for Trino troubleshooting~~ Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException #24579

Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException #24579

nikita-sheremet-java-developer commented Dec 25, 2024 •

edited

Loading

Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException #24579

Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException #24579

Comments

nikita-sheremet-java-developer commented Dec 25, 2024 • edited Loading

TL;DR

My case of error

Total timeout

Cluster paramerters

nikita-sheremet-java-developer commented Dec 25, 2024 •

edited

Loading