Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException #24579

Open
nikita-sheremet-java-developer opened this issue Dec 25, 2024 · 0 comments

Comments

@nikita-sheremet-java-developer
Copy link

nikita-sheremet-java-developer commented Dec 25, 2024

  1. Add guide for query Troubleshooting to documentation
  2. Improve error logging that causes PageTransportTimeoutException

TL;DR

A lot of errors in Trino reported look like

io.trino.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes. (http://11.111.111.111:8080/v1/task/20241225_173448_00002_rhmz3.0.7.4/results/0/129 - 7 failures, failure duration 62.26s, total failed request time 71.62s)
	at io.trino.operator.HttpPageBufferClient$1.onFailure(HttpPageBufferClient.java:505)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1575)
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException: Total timeout 10000 ms elapsed

While worker with IP 11.111.111.111 had never been shutdowned. So the reason why query failed is unclear.

My case of error

I have project with dbt that performs some insert into table select * from view_tmp queries. So a query have worked without any problems for long time until today (no Trino updates or configuration changes). And it has started to fail with message upper. At first glance it looks like network error but why network error appeared only for this single query? I started to examine the issus and:

  1. No logs.
    All I have are:
2024-12-25T17:35:20.215Z	WARN	async-http-response-5	io.trino.server.IoExceptionSuppressingWriterInterceptor	Could not write to output: EofException(null)

That appeared during other queries and looks like do not belong to the my query problem

2024-12-25T17:34:59.236Z	INFO	Notification Thread	io.airlift.stats.JmxGcMonitor	Major GC: application 0ms, stopped 424ms: 804.31MB -> 693.54MB

Message abouot GC collection, that appeared in one worker (there are 10 workers) and 424ms is much smaller then 62s in Exception.

  1. No workers went down
    So why it unavailable is unclear.
  2. CPU load (for worker with IP 11.111.111.111) has 3 spikes up to 30%, 25% and 20). It is not very high and spikes also very "slim".

This is not the first time when I faced with such error. In somecase decreasing data (adding partition) fixed the issue. Sometimes this happens when there are small data (may be several GB) go to one worker. Anyway the error is very unclear it there are too much data that goes to single worker it should be reported in other way.

Total timeout

I have searched in source code and increased some timeouts but my settings had no effect for Total timeout 10000 ms elapsed Does trino support to increase it?

Cluster paramerters

16 cores and 64Gb RAM, 1 coordinator and 10 workers
Trino version is 464

Thanks in advance

Any links or helping about debugging in comments are extremelly welcome.

@nikita-sheremet-java-developer nikita-sheremet-java-developer changed the title Better error description and/or documetation for Trino troubleshooting Create Trino query Troubleshooting guide / Log causes that lead to PageTransportTimeoutException Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant