-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] mirrorAttempts does not exit out of resolveCtx early when exceeded #607
Comments
I have had a look at the code and I am not sure if this is a bug. There are two or three situations that can really lead to a 404 response. This first is that we are not able to resolve enough peers for the peer count within the timeout. By default this would be 3 peers within 20ms. In this situation the lookup context would timeout and the peer channel is closed. The second option is that we resolve three peers but every request to those peers returns a non 200 response because the layer has since been deleted from the peer but the TTL has not expired. I think this is a less likely scenario to happen. There is a third situation that can happen which is that libp2p has searched the whole network and concluded that the key cannot be found within any peer before the timeout happens. This does happen at times but is less likely in large clusters. In my head it is pretty reasonable that most 404 response happen at the timeout duration. This is one reason why I have set it to so low by default. In clusters with 100 nodes keys that exist are resolved in 1-3 ms so setting it to 20 ms gives a pretty good buffer for most people without adding too much latency. Could you share where you think the code would continue running past all attempts? |
Ah! Yep! So, now that you bring this up, which I should have mentioned, is this is exactly what is happening here for my/our use case and I think why I spotted / see it so regularly! Right now as we're trialing it out, we're on a rather decent sized (~50 node, ~3-5,000 digest advertised) development cluster and it's getting deploys nearly constantly, so there are quite a few new layers being pulled all the time.
This is really useful information to know :D Thank you! I completely agree that your current default of 20ms is completely fine / acceptable - in fact, I felt a bit silly opening this issue for this exact reason~ however I was curious if people would run into this who had a higher timeout set (well, and lets be honest, its fun to tune thing to the max at times 😄) Thanks as always for the awesome information and quick replies! |
The timeout was a lot higher initially. I had to set a default so pulled 5 seconds out from nowhere without really thinking about it. In small test clusters this worked, because the lookup would exhaust all hosts before the timeout would actually happen. It wasn't until I started looking at larger clusters and eventually the benchmarks that I discovered that the timeout actually mattered and I started looking for a better default. Fine tuning the timeout could be useful at times but you are really saving a couple of milliseconds at the risk of higher cache misses. All good, I will take some of the information from this issue and add it to the FAQ as it could be useful for future users to understand the significance of the timeout. |
Spegel version
v0.0.27
Kubernetes distribution
N/A
Kubernetes version
N/A
CNI
N/A
Describe the bug
Pretty benign issue - but wanted to bring it up for investigation. While looking through logs, I noticed a lot of 404 requests showing at the 20.x ms mark, even though they were of 404 type.
Through some looking around, I think there may be a bug where if the mirrorAttempts is met/exceeded, the function will still hang around until the timeout is met; meaning the resolve time for the 404 could have been for example 5ms to reach all 3 attempts on other nodes instead of 20ms in the event of images not being anywhere.
The text was updated successfully, but these errors were encountered: