Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection Reset Error Occurs Occasionally with Client-v2 0.7.2 #2070

Open
lelewolf opened this issue Jan 7, 2025 · 10 comments
Open

Connection Reset Error Occurs Occasionally with Client-v2 0.7.2 #2070

lelewolf opened this issue Jan 7, 2025 · 10 comments
Labels
bug client-api-v2 investigating Investigation of a root cause is on going network network and IO related issues

Comments

@lelewolf
Copy link

lelewolf commented Jan 7, 2025

Describe the bug

There is an issue where the ClickHouse client fails to execute a query, resulting in a “Connection reset” error. The request is being terminated unexpectedly.

Steps to reproduce

1.	Run the query on the ClickHouse client with a high load or specific network conditions.
2.	Observe that the connection resets with a SocketException: Connection reset error.
3.	The issue happens after the socket connection times out or gets interrupted.

Expected behaviour

The client should execute the query successfully without encountering a connection reset error, even with high traffic or under timeout conditions.

Code example

package com.opay.finder.analysis.config;

import com.clickhouse.client.api.Client;
import com.clickhouse.client.config.ClickHouseClientOption;
import com.clickhouse.client.config.ClickHouseHealthCheckMethod;
import java.time.temporal.ChronoUnit;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

/**
 * @author lang
 * @description clickhouse-client配置类
 * @date 2024/10/24 10:58
 */
@Configuration
public class ClickHouseClientConfig {

  private final ClickHouseConfig clickHouseConfig;

  public ClickHouseClientConfig(ClickHouseConfig clickHouseConfig) {
    this.clickHouseConfig = clickHouseConfig;
  }

  @Bean
  public Client clickhouseClient() {
    return new Client.Builder()
        .addEndpoint(clickHouseConfig.getUrl())
        .setUsername(clickHouseConfig.getUsername())
        .setPassword(clickHouseConfig.getPassword())
        .setSocketTimeout(clickHouseConfig.getSocketTimeout(), ChronoUnit.HOURS)
        .setSocketKeepAlive(Boolean.TRUE)
        .setConnectTimeout(clickHouseConfig.getConnectionTimeout(), ChronoUnit.HOURS)
        .setConnectionTTL(clickHouseConfig.getConnectionTtl(), ChronoUnit.MINUTES)
        .setMaxConnections(clickHouseConfig.getMaxConnection())
        .enableConnectionPool(Boolean.TRUE)
        .setMaxRetries(3)
        .setOption(ClickHouseClientOption.ASYNC.getKey(), "false")
        .setOption(ClickHouseClientOption.AUTO_DISCOVERY.getKey(), "true")
//        .setSocketKeepAlive(true)
        .setOption(ClickHouseClientOption.LOAD_BALANCING_POLICY.getKey(), "roundRobin")
        .setOption(ClickHouseClientOption.HEALTH_CHECK_INTERVAL.getKey(), "60000")
        //这个研究一下,修改为获取系统当前负载的方式;默认是select 1
        .setOption(ClickHouseClientOption.HEALTH_CHECK_METHOD.getKey(), ClickHouseHealthCheckMethod.SELECT_ONE.name())
        .setConnectionRequestTimeout(clickHouseConfig.getConnectionRequestTimeout(), ChronoUnit.MINUTES)
        .build();
  }
}
### Error log
2025-01-07 07:10:43.143 opay-finder-web  [ForkJoinPool.commonPool-worker-129] DEBUG org.apache.hc.client5.http.wire [wire:106]- http-outgoing-399 << "[read] I/O error: Connection reset"
2025-01-07 07:10:43.144 opay-finder-web  [ForkJoinPool.commonPool-worker-129] DEBUG o.a.h.c.h.i.i.DefaultManagedHttpClientConnection [close:155]- http-outgoing-399 Close connection
2025-01-07 07:10:43.144 opay-finder-web  [ForkJoinPool.commonPool-worker-129] DEBUG o.a.h.c.h.i.c.InternalHttpClient [discardEndpoint:261]- ep-0000001120 endpoint closed
2025-01-07 07:10:43.144 opay-finder-web  [ForkJoinPool.commonPool-worker-129] DEBUG o.a.h.c.h.i.c.InternalHttpClient [discardEndpoint:265]- ep-0000001120 discarding endpoint
2025-01-07 07:10:43.144 opay-finder-web  [ForkJoinPool.commonPool-worker-129] DEBUG o.a.h.c.h.i.i.PoolingHttpClientConnectionManager [release:424]- ep-0000001120 releasing endpoint
2025-01-07 07:10:43.144 opay-finder-web  [ForkJoinPool.commonPool-worker-129] DEBUG o.a.h.c.h.i.i.PoolingHttpClientConnectionManager [release:455]- ep-0000001120 connection is not kept alive)
2025-01-07 07:10:43.144 opay-finder-web  [ForkJoinPool.commonPool-worker-129] DEBUG o.a.h.c.h.i.i.PoolingHttpClientConnectionManager [release:465]- ep-0000001120 connection released [route: {}->[http://10.166.16.117:8123]][total available: 1; route allocated: 3 of 15; total allocated: 3 of 15]
2025-01-07 07:10:43.145 opay-finder-web  [ForkJoinPool.commonPool-worker-129] ERROR c.o.f.a.b.i.QuerySchedulerServiceImpl [lambda$executeScheduledQuery$0:221]- 查询 [FX_PRO_10000009_FUNNEL_dF717CsZbAzhEql174CG9AUCYYSWXUWI_t00] 执行失败: Failed to execute request com.clickhouse.client.api.ClientException: Failed to execute request
	at com.clickhouse.client.api.internal.HttpAPIClientHelper.executeRequest(HttpAPIClientHelper.java:404)
	at com.clickhouse.client.api.Client.lambda$query$11(Client.java:1706)
	at com.clickhouse.client.api.Client.runAsyncOperation(Client.java:2116)
	at com.clickhouse.client.api.Client.query(Client.java:1782)
	at com.clickhouse.client.api.Client.query(Client.java:1647)
	at com.opay.finder.analysis.biz.impl.ClickHouseQueryServiceImpl.executeQuery(ClickHouseQueryServiceImpl.java:57)
	at com.opay.finder.analysis.biz.impl.ClickHouseQueryServiceImpl.executeSQLQuery(ClickHouseQueryServiceImpl.java:165)
	at com.opay.finder.analysis.biz.impl.ClickHouseQueryServiceImpl.executeSQL(ClickHouseQueryServiceImpl.java:125)
	at com.opay.finder.analysis.biz.impl.QuerySchedulerServiceImpl.lambda$executeScheduledQuery$0(QuerySchedulerServiceImpl.java:217)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1796)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:318)
	at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:346)
	at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:796)
	at java.base/java.net.Socket$SocketInputStream.read(Socket.java:1099)
	at org.apache.hc.client5.http.impl.io.LoggingInputStream.read(LoggingInputStream.java:83)
	at org.apache.hc.core5.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:149)
	at org.apache.hc.core5.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
	at org.apache.hc.core5.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:250)
	at org.apache.hc.core5.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:56)
	at org.apache.hc.core5.http.impl.io.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:331)
	at org.apache.hc.core5.http.impl.io.HttpRequestExecutor.execute(HttpRequestExecutor.java:193)
	at org.apache.hc.client5.http.impl.classic.InternalExecRuntime.lambda$execute$0(InternalExecRuntime.java:236)
	at org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager$InternalConnectionEndpoint.execute(PoolingHttpClientConnectionManager.java:791)
	at org.apache.hc.client5.http.impl.classic.InternalExecRuntime.execute(InternalExecRuntime.java:233)
	at org.apache.hc.client5.http.impl.classic.MainClientExec.execute(MainClientExec.java:121)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.ConnectExec.execute(ConnectExec.java:199)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.ProtocolExec.execute(ProtocolExec.java:192)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.ContentCompressionExec.execute(ContentCompressionExec.java:150)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.HttpRequestRetryExec.execute(HttpRequestRetryExec.java:113)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.InternalHttpClient.doExecute(InternalHttpClient.java:174)
	at org.apache.hc.client5.http.impl.classic.CloseableHttpClient.execute(CloseableHttpClient.java:87)
	at org.apache.hc.client5.http.impl.classic.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at org.apache.hc.client5.http.classic.HttpClient.executeOpen(HttpClient.java:183)
	at com.clickhouse.client.api.internal.HttpAPIClientHelper.executeRequest(HttpAPIClientHelper.java:377)
	... 15 common frames omitted

Configuration

Environment

  • Client version: clent-v2 0.7.2
  • Language version: Java version “21.0.5” (LTS, 2024-10-15)
  • OS: CentOS Linux 7 (Core)

ClickHouse server

  • ClickHouse Server version: version 23.3.2.1
  • ClickHouse Server non-default settings, if any:
  • CREATE TABLE statements for tables involved:
  • Sample data for all these tables, use clickhouse-obfuscator if necessary
@lelewolf lelewolf added the bug label Jan 7, 2025
@chernser
Copy link
Contributor

chernser commented Jan 7, 2025

Good day, @lelewolf !
Thank you for reporting! I will take a look into it.

@chernser chernser self-assigned this Jan 7, 2025
@chernser
Copy link
Contributor

chernser commented Jan 7, 2025

@lelewolf
I've got a few questions:

  • is problem occurs only when client is loaded? do you have a GC activity graph by a chance?
  • in what environment application is running?

Note:
client cannot prevent system to reset connection but can do a retry. Seem retry is not triggered on timeout.

@chernser chernser added this to the Priority Backlog milestone Jan 7, 2025
@lelewolf
Copy link
Author

lelewolf commented Jan 8, 2025

Hi @chernser,

I’m running the application in a Spring Boot environment, and the issue does not occur during client loading. Below, I’ve provided the gc.log, although I’m not sure if the problem is GC-related.

Currently, there’s a specific SQL query that always triggers this issue. However, when I execute the same SQL in DBeaver, it works fine and returns results as expected. This leads me to suspect that the issue might be related to the client or its configuration.
SELECT time_index, level_index, count(DISTINCT user_id) as event_users FROM ( SELECT (toUInt32((toUInt64(server_time / 1000) - 1733007600) / 86400)) AS time_index, hash_uid AS user_id, windowFunnel(86400)( toUInt64(server_time / 1000), event = 'ac_home_show', ( (event = 'ac_home_search_success_bene_click') OR (event = 'ac_home_recent_bene_click') OR (event = 'ac_home_saved_bene_click') OR (event = 'ac_home_next_click') OR (event = 'ac_bene_list_recent_click') OR (event = 'ac_bene_list_saved_click') OR (event = 'ac_bene_list_search_suc_click') ), event = 'ac_enter_amount_show', event = 'COMMON_pay_window_show', event = 'COMMON_order_response' ) AS level, arrayJoin(arrayEnumerate(arrayWithConstant(level, 1))) AS level_index FROM events_all WHERE app_id IN (10000009) AND ( event_date >= '2024-12-01' AND event_date <= '2024-12-31' AND toUInt64(server_time / 1000) >= 1733007600 AND toUInt64(server_time / 1000) <= 1735689599 ) AND ( (event = 'ac_home_show' AND (ifNull(string_params['user_type'], 'null') IN ('old_ac_user'))) OR ( (event = 'ac_home_search_success_bene_click') OR (event = 'ac_home_recent_bene_click') OR (event = 'ac_home_saved_bene_click') OR (event = 'ac_home_next_click') OR (event = 'ac_bene_list_recent_click') OR (event = 'ac_bene_list_saved_click') OR (event = 'ac_bene_list_search_suc_click') OR (event = 'ac_enter_amount_show') OR ( event = 'COMMON_pay_window_show' AND (ifNull(string_params['service_type'], 'null') IN ('bank')) ) OR ( event = 'COMMON_order_response' AND (ifNull(string_params['st'], 'null') IN ('0')) ) ) ) GROUP BY user_id, time_index ) GROUP BY time_index, level_index ORDER BY level_index, time_index ASC;

Here’s the gc.log file and the analysis result:
GC log:
gc.log

Analysis Result:

Let me know if you need additional details!

@lelewolf
Copy link
Author

lelewolf commented Jan 8, 2025

Hi @chernser,

After comparing DBeaver’s configuration with my own client configuration, I made some adjustments, and the “Connection reset” issue no longer occurs.

DBeaver’s configuration:
(Provide the relevant DBeaver configuration details here)
image

My latest configuration:
@Bean public Client clickhouseClient() { return new Client.Builder() .addEndpoint(clickHouseConfig.getUrl()) .setUsername(clickHouseConfig.getUsername()) .setPassword(clickHouseConfig.getPassword()) .setSocketTimeout(clickHouseConfig.getSocketTimeout(), ChronoUnit.HOURS) .setSocketKeepAlive(Boolean.TRUE) .setConnectTimeout(clickHouseConfig.getConnectionTimeout(), ChronoUnit.HOURS) .setConnectionTTL(clickHouseConfig.getConnectionTtl(), ChronoUnit.MINUTES) .setMaxConnections(clickHouseConfig.getMaxConnection()) .enableConnectionPool(Boolean.TRUE) .setMaxRetries(3) .setOption(ClickHouseClientOption.ASYNC.getKey(), "false") .setOption(ClickHouseClientOption.AUTO_DISCOVERY.getKey(), "true") .setOption(ClickHouseClientOption.LOAD_BALANCING_POLICY.getKey(), "roundRobin") .setOption(ClickHouseClientOption.HEALTH_CHECK_INTERVAL.getKey(), "60000") .useHttpCompression(Boolean.TRUE) .compressClientRequest(Boolean.TRUE) .setOption(ClickHouseHttpOption.CONNECTION_PROVIDER.getKey(), HttpConnectionProvider.HTTP_URL_CONNECTION.name()) //这个研究一下,修改为获取系统当前负载的方式;默认是select 1 .setOption(ClickHouseClientOption.HEALTH_CHECK_METHOD.getKey(), ClickHouseHealthCheckMethod.SELECT_ONE.name()) .setConnectionRequestTimeout(clickHouseConfig.getConnectionRequestTimeout(), ChronoUnit.MINUTES) .build(); }
I downgraded the clickhouse-java version from 0.7.2 to 0.7.1-patch1.

Let me know if you need further details or assistance!

@abcfy2
Copy link
Contributor

abcfy2 commented Jan 9, 2025

I got the same issue when setMaxConnections(<number>) is a large number (like 1000).

When using INSERT INTO <mytable> .... SETTINGS async_insert=1, wait_for_async_insert=0. Then execute a lot of async inserts will cause this error.

But use bulk insert instead of a lot of async insert will solve this error.

@abcfy2
Copy link
Contributor

abcfy2 commented Jan 9, 2025

client cannot prevent system to reset connection but can do a retry. Seem retry is not triggered on timeout.

Seems we don't catch java.net.SocketException in:

try {
ClassicHttpResponse httpResponse = httpClient.executeOpen(null, req, context);
boolean serverCompression = MapUtils.getFlag(requestConfig, chConfiguration, ClientConfigProperties.COMPRESS_SERVER_RESPONSE.getKey());
httpResponse.setEntity(wrapResponseEntity(httpResponse.getEntity(), httpResponse.getCode(), serverCompression, useHttpCompression));
if (httpResponse.getCode() == HttpStatus.SC_PROXY_AUTHENTICATION_REQUIRED) {
throw new ClientMisconfigurationException("Proxy authentication required. Please check your proxy settings.");
} else if (httpResponse.getCode() == HttpStatus.SC_BAD_GATEWAY) {
httpResponse.close();
throw new ClientException("Server returned '502 Bad gateway'. Check network and proxy settings.");
} else if (httpResponse.getCode() >= HttpStatus.SC_BAD_REQUEST || httpResponse.containsHeader(ClickHouseHttpProto.HEADER_EXCEPTION_CODE)) {
try {
throw readError(httpResponse);
} finally {
httpResponse.close();
}
}
return httpResponse;
} catch (UnknownHostException e) {
LOG.warn("Host '{}' unknown", server.getHost());
throw new ClientException("Unknown host", e);
} catch (ConnectException | NoRouteToHostException e) {
LOG.warn("Failed to connect to '{}': {}", server.getHost(), e.getMessage());
throw new ClientException("Failed to connect", e);
} catch (ConnectionRequestTimeoutException | ServerException | NoHttpResponseException | ClientException e) {
throw e;
} catch (Exception e) {
throw new ClientException("Failed to execute request", e);
}
}

But I've commit another PR to fix Connection refused not been retried: #2063

@chernser chernser added the network network and IO related issues label Jan 9, 2025
@chernser chernser removed their assignment Jan 10, 2025
@chernser
Copy link
Contributor

@lelewolf thank you for information!
Is downgrade required to solve the issue?

Just to highlight - here is you configuration but client-v2 ignores some options:

return new Client.Builder() 
    .addEndpoint(clickHouseConfig.getUrl()) 
    .setUsername(clickHouseConfig.getUsername()) 
    .setPassword(clickHouseConfig.getPassword()) 
    .setSocketTimeout(clickHouseConfig.getSocketTimeout(), ChronoUnit.HOURS) 
    .setSocketKeepAlive(Boolean.TRUE) 
    .setConnectTimeout(clickHouseConfig.getConnectionTimeout(), ChronoUnit.HOURS) 
    .setConnectionTTL(clickHouseConfig.getConnectionTtl(), ChronoUnit.MINUTES) 
    .setMaxConnections(clickHouseConfig.getMaxConnection()) 
    .enableConnectionPool(Boolean.TRUE) .setMaxRetries(3) 
    .setOption(ClickHouseClientOption.ASYNC.getKey(), "false") // by default 
    .setOption(ClickHouseClientOption.AUTO_DISCOVERY.getKey(), "true") // ignored 
    .setOption(ClickHouseClientOption.LOAD_BALANCING_POLICY.getKey(), "roundRobin")  // ignored
    .setOption(ClickHouseClientOption.HEALTH_CHECK_INTERVAL.getKey(), "60000")  // ignored 
    .useHttpCompression(Boolean.TRUE) .compressClientRequest(Boolean.TRUE) 
    .setOption(ClickHouseHttpOption.CONNECTION_PROVIDER.getKey(), HttpConnectionProvider.HTTP_URL_CONNECTION.name()) // ignored

So question is what DBeaver configuration property resolve the issue?

As for the query:

  • DBeaver is using JDBC that is still using client-v1.
  • That is very helpful and we will look into it. Your assumption is correct - there is something in new client.

@chernser
Copy link
Contributor

Good day, @lelewolf!

One more question: how do you read result data?

Let me explain why I'm asking:
Connection reset happens because server closes socket. If server is slow then read time out would happen.
In your case you have mentioned load and in logs I see ForkJoinPool.commonPool-worker-129 what means you have many workers.
at the same time you have only 3 of 15 connections used according to ep-0000001120 connection released [route: {}->[http://10.166.16.117:8123]][total available: 1; route allocated: 3 of 15; total allocated: 3 of 15] . So it is not a connection limit problem.
Workers in such conditions may be parked for too long and data is not read from a socket. In such case server will timeout on write and will reset connection.

Thanks!

@lelewolf
Copy link
Author

Hi @chernser,

Thank you for your detailed analysis and insights!

The purpose of downgrading was to enable the configuration:
java .setOption(ClickHouseHttpOption.CONNECTION_PROVIDER.getKey(), HttpConnectionProvider.HTTP_URL_CONNECTION.name()) // ignored
In the newer version, I couldn’t find this configuration option. I believe this configuration is critical to the issue. Other configurations haven’t significantly changed.

During the troubleshooting process, there were no changes made to the server configuration. Additionally, the application code was not updated.

Let me know if you need any further details.

Thanks!

@chernser chernser added the investigating Investigation of a root cause is on going label Jan 16, 2025
@chernser chernser modified the milestones: 0.8.0, Priority Backlog Jan 16, 2025
@chernser
Copy link
Contributor

Good day, @lelewolf !
Thank you for additional information.

How do you read information from a response?
Is it the same thread that does a request?
How long does it take to execute such query?
What read/write timeouts are set on the server?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug client-api-v2 investigating Investigation of a root cause is on going network network and IO related issues
Projects
None yet
Development

No branches or pull requests

4 participants