-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
None recovering Redis command timed out #3010
Comments
This would be a hard problem to help with. I can't comment on why you are getting timeouts, as this seems to be related to the infrastructure (especially since this started to happen after changing the infra). In this case an increased latency is unavoidable, as the application does not have a way to handle this differently. Periodic refresh would help for new requests, but the ones that have been sent already need to all timeout. A TCP_USER_TIMEOUT would probably help, especially if you are not getting an ACK from the server (you could verify that by sniffing your traffic). Now depending on the load you might get your commands queue getting filled quite rapidly and reach OOM issues. Again this is avoidable with configuration tweaking (lower timeout, increase memory). All of these are generic advises though. To help more we would need details picture of what happens and ideally a way to reproduce it. |
Thanks for the hints, this is hard to reproduce, and we tried multiple times in our dev env to reproduce and no luck. We see it only happening in prod env. Also, AWS have improved, so this issue was happening 2–3 times a week, reduced to 1 a month and did not happen now for about 6 weeks. One thing to clarify, is that when timeouts start it never ends and recover it this special case, there are multiple instances of a task running, eg 12 task in 3 Availability Zones, and there are even other services with multiple tasks connecting to same Redis cluster in some cases and only one task faces this issue, and this happens after we see some reconnect logs. My expectation is that with periodic refresh, the new request should start to work, but they don't, so somehow periodic refresh does not work and get stocked. We have enabled compression and tweaked some configs around memory, and we don't see OOM any more. |
Periodic refresh does not affect already queued commands. So if the service experienced a slowdown just before stopping the queue might be full with a lot of commands (and this is likely true due to the fact you experienced OOM). In this case even if the periodic refresh detects the node is down already queued commands are still executed against it. The periodic refresh would only affects newly executed commands. This behavior is by design - we are not sure why a command ended up being sent to a specific node (whether it was a MOVED response, design choice by the app developer or random) so we can't be sure where we need to resend it. |
In this case, when timeouts start, it does not stop. All new requests are also timing out, and there is no reconnect. There were cases this was ongoing for 3-4 hours. |
How do you verify that? Did you use the event bus to monitor for reconnection events? What is your strategy to trace the problem? Did you start recording more verbose (debug) log messages? |
We log reconnect events, when timeouts start in a specific container, I don't see any reconnect after that in that container. all requests put/get are getting timed out, and I know they are new, as I see other application logs of new event with initiate timestamp being processed and after that Redis time out, and then I see s3 call, that is in case of timeout or cache miss. Furthermore, old event should expire after few minutes, as I said, when this happens it does not stop, and I see timeout continuously. For tracing, I have enabled profiling, sth similar to getting heap dump when we had the issue with memory. As this was happening in production env only, I can't put in debug mode, both from cost and PCI perspective. |
Reconnect would happen only if the remote service recovers. I assume the correct thing to expect would be for the driver to - instead - route newly issued commands to other nodes, after it refreshes the topology.
If newly issued commands (after the topology change) are getting expired than it is reasonable to assume that the driver never picks up the new topology and thus continues to send commands to the old (failed) node. Have you verified if the remote address for the commands is changing or is it the same? Have you enabled some debug logging to check if the topology is actually refreshed? You can do that when you subscribe to the event bus and listen for events such as public static void main(String[] args) {
RedisURI redisURI = RedisURI.Builder
.redis("acme.com")
.withPort(1234)
.build();
try (RedisClient client = RedisClient.create(redisURI)) {
EventBus eventBus = client.getResources().eventBus();
eventBus.get().subscribe(e -> {
logger.info(e.toString());
});
... At this point this is the most reasonable assumption (for some reason topology is not being refreshed). Does the driver report topology refresh events? Enabling debug messages should output |
Bug Report
One container starts to have "Redis command timed out" and it does not recover.
Current Behavior
we have started to use aws redis serveless, since the switch from redis cluster-mode enabled to redis serverless, out of the blue one client from many tasks running starts to have command timed out continuously. Depending on the service and size of objects and if put or get commands are being timed out, we see different side effects, sometimes it results in #705, sometimes only increased latency.
Note that in the services this happens, usually there are more than 12 tasks running, and only one container has this issue.
I could correlate any CPU spike with the event causing the timeouts.
Expected behavior/code
the expected behaviour is that with periodic_refresh, the issue does not persist.
Environment
aws redis serverless 7.1
<lettuce.version>6.3.2.RELEASE</lettuce.version>
<netty.version>4.1.109.Final</netty.version>
related configurations:
ssl: true
timeout: 5s
connect-timeout: 500ms
additional:
dns-ttl: 5s
periodic-refresh: 10s
reconnect-delay-min: 100ms
reconnect-delay-max: 5s
read-from: replicaPreferred
Possible Solution
RedisCommandTimeoutException intermittently:
in some cases this could be related, as we sometimes see timeouts for a minute, and they are recovered, however in this case it persists.
Lettuce cannot recover from connection problems:
maybe setting TCP_USER_TIMEOUT would help, default is 1 minutes, so for the recovering case, maybe this helps, but for the cases that this continues and gets stacked probably not.
The text was updated successfully, but these errors were encountered: