-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_max_size
test is failing
#2102
Comments
My initial suspicion was that it’s another bug with Here’s the test file: https://github.com/timberio/vector/blob/0b92159518732a27837ca1142884ab9a7d34de84/tests/buffering.rs It seems like there are issues with topology shutdown. |
@stbrody would any of your changes affect this? Or, could you weigh in since you're familiar with this code? I'm curious if you could spot something. |
Basically this test fails if this time https://github.com/timberio/vector/blob/0b92159518732a27837ca1142884ab9a7d34de84/tests/buffering.rs#L147 is not long enough for all events to pass through the system and hit the max size. The assert is failing because it looks like none of the events got written to the disk buffer in time. This can happen when a machine is noisy and there are a lot of tests running that take up CPU time but the timeout is still going. Our best bet would be to increase the timeout imo. It may also be that these tests are old and we want to refactor them. |
Yeah, I see what's going on. Why do we even need a timeout there in the first place? From the looks of it, we could replace with block_on(topology.stop()).unwrap();
shutdown_on_idle(rt); just the same way as couple lines below. It doesn't seem to affect the semantics of the test, I think, it will even improve the reliability and quality of the test. |
By the way, we've encountered issues with these tests before: #1508 |
if we use topology stop it should fully flush and give us what we want, here we are testing if somehow vector crashes. Which shutdown_now will force cancel all tasks. |
I understand this was the intention, however, I don't think this test is adequately designed to assert a crash a scenario. At the same time, there are no tests that check that buffering actually performs as expected under proper conditions. To properly assert the crash scenario we should trigger an actual crash - otherwise, it's not a fair assertion. How about we add this separately to the test harness? |
In this test - even if we do trigger a crash - we want to trigger it after all the events reached the sink, and it has put them into the buffer. Using a timeout for that is what's causing the race. We can probably leave this However, in my practice, tests that rely on timeouts for synchronization are a constant source of trouble. They east up maintenance time and don't provide the required confidence either way. This is why I still suggest this tradeoff - to lose some of the assertion coverage in this case, but to gain value in test reliability. |
I've been experimenting with replacing sleeps and switching to |
While it does seem like using |
Also, FWIW, I don't think anything I've done that's already been committed is likely to have affected the timing here. The patch I'm planning on committing today theoretically could change the timing here slightly, but only once we actually switch to using |
You are 100% correct, and this is what I'm currently thinking on how to solve! While we can easily eliminate all the other To solve this, we can switch from TCP source to an in-process source, that would eliminate this race. It's makes sense IMO, because involving this TCP race in the test doesn't help with anything - it only complicates things. It's a race condition of it;s own that's unrelated to the ones that would be corrected by switcing to We could also get rid of TCP sink too, replacing it with a specially crafted test sink that would allow more accurate behavior assertion. Buffers are attached at the topology level and outside of the sink itself, so the test composition semantics would remain the same. I think adding those test source/sink would be very beneficial in general - they'll provide a reliable and solid grounds for unit testing. They should also make tests more concise.
The timings are skewed by the migration from |
@stbrody neat, I didn't realize we already have |
@stbrody I see what the problem is, apparently doing a Although, I was expecting some guarantees from the source interface... Maybe we want to change that. I believe, this is related to our generic efforts in making event handling reliable (i.e. sending acks when an event is processed doesn't help if the topology doesn't even recognize event existence), and may be relevant for #807. |
So, I ended up digging deep into these tests and applying corrections and reliability improvements. Most of the work was done with graceful termination, and then, once the test reliability without crashes was high up again, I re-added abrupt termination. |
Very nice. Appreciate you digging into these problems. |
test_max_size
test is failing, we should investigate. This started happening more frequently withtokio-compat
migration, however this migration is suspected not to be root cause, as it seems like this issue was happening before.We need to investigate the problem with
test_max_size
and surrounding tests and fix it.A sample failed CI run: https://circleci.com/gh/timberio/vector/90109
The text was updated successfully, but these errors were encountered: