Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_max_size test is failing #2102

Closed
MOZGIII opened this issue Mar 20, 2020 · 17 comments · Fixed by #2106
Closed

test_max_size test is failing #2102

MOZGIII opened this issue Mar 20, 2020 · 17 comments · Fixed by #2106
Assignees
Labels
domain: tests Anything related to Vector's internal tests type: tech debt A code change that does not add user value.

Comments

@MOZGIII
Copy link
Contributor

MOZGIII commented Mar 20, 2020

test_max_size test is failing, we should investigate. This started happening more frequently with tokio-compat migration, however this migration is suspected not to be root cause, as it seems like this issue was happening before.
We need to investigate the problem with test_max_size and surrounding tests and fix it.

A sample failed CI run: https://circleci.com/gh/timberio/vector/90109

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

My initial suspicion was that it’s another bug with tokio-compat Runtime. After examining the test code, I wasn’t able to find a direct confirmation of that, but I’m not ruling it out yet.

Here’s the test file:

https://github.com/timberio/vector/blob/0b92159518732a27837ca1142884ab9a7d34de84/tests/buffering.rs

It seems like there are issues with topology shutdown.
I think this is where the problem comes from, and I’ll try to reproduce it in a dedicated test case.

@MOZGIII MOZGIII self-assigned this Mar 20, 2020
@binarylogic
Copy link
Contributor

@stbrody would any of your changes affect this? Or, could you weigh in since you're familiar with this code? I'm curious if you could spot something.

@LucioFranco
Copy link
Contributor

Basically this test fails if this time https://github.com/timberio/vector/blob/0b92159518732a27837ca1142884ab9a7d34de84/tests/buffering.rs#L147 is not long enough for all events to pass through the system and hit the max size. The assert is failing because it looks like none of the events got written to the disk buffer in time. This can happen when a machine is noisy and there are a lot of tests running that take up CPU time but the timeout is still going. Our best bet would be to increase the timeout imo. It may also be that these tests are old and we want to refactor them.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

Yeah, I see what's going on. Why do we even need a timeout there in the first place?

From the looks of it, we could replace

https://github.com/timberio/vector/blob/0b92159518732a27837ca1142884ab9a7d34de84/tests/buffering.rs#L149

with

block_on(topology.stop()).unwrap();
shutdown_on_idle(rt);

just the same way as couple lines below.

It doesn't seem to affect the semantics of the test, I think, it will even improve the reliability and quality of the test.
It looks like this race condition was added intentionally, however, it doesn't add value to the test and just makes it racy.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

By the way, we've encountered issues with these tests before: #1508
Might not be related to tokio-compat, but now is a good time to revisit it as well, since we're on it.

@LucioFranco
Copy link
Contributor

if we use topology stop it should fully flush and give us what we want, here we are testing if somehow vector crashes. Which shutdown_now will force cancel all tasks.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

I understand this was the intention, however, I don't think this test is adequately designed to assert a crash a scenario. At the same time, there are no tests that check that buffering actually performs as expected under proper conditions.
I'd suggest we convert those tests to assert buffering behavior under proper conditions - as that would at least be reliable and sane.

To properly assert the crash scenario we should trigger an actual crash - otherwise, it's not a fair assertion. How about we add this separately to the test harness?

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

In this test - even if we do trigger a crash - we want to trigger it after all the events reached the sink, and it has put them into the buffer. Using a timeout for that is what's causing the race. We can probably leave this drop if we figure out a way to reliably wait for that to happen.

However, in my practice, tests that rely on timeouts for synchronization are a constant source of trouble. They east up maintenance time and don't provide the required confidence either way.

This is why I still suggest this tradeoff - to lose some of the assertion coverage in this case, but to gain value in test reliability.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

I've been experimenting with replacing sleeps and switching to topology.stop(), and found a bug in test_util. Tests clearly are unreliable, and we definitely need to improve in that area.

@stbrody
Copy link
Contributor

stbrody commented Mar 20, 2020

While it does seem like using topology.stop() would improve things, don't you still have a race around whether or not the TCP source has actually read in the input lines off the socket before you start the shutdown? Using topology.stop() instead of rt.shutdown_now() ensures that any events that have made it to the TCP source will make it to the buffer, but it doesn't ensure that all messages sent to the socket make it to the TCP source.

@stbrody
Copy link
Contributor

stbrody commented Mar 20, 2020

Also, FWIW, I don't think anything I've done that's already been committed is likely to have affected the timing here. The patch I'm planning on committing today theoretically could change the timing here slightly, but only once we actually switch to using topology.stop(), and not in a way that actually affects the root cause of the failure.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

While it does seem like using topology.stop() would improve things, don't you still have a race around whether or not the TCP source has actually read in the input lines off the socket before you start the shutdown? Using topology.stop() instead of rt.shutdown_now() ensures that any events that have made it to the TCP source will make it to the buffer, but it doesn't ensure that all messages sent to the socket make it to the TCP source.

You are 100% correct, and this is what I'm currently thinking on how to solve! While we can easily eliminate all the other sleeps, this one seems to be the one that can't go away, because of how TCP operation are carried on in general. We don't require an acknowledgment for data sent over a TCP to be picked up by the source, and this is sort of a fundamental design limitation of the TCP source of vector. It may be a problem not just in tests, but in the real world too.

To solve this, we can switch from TCP source to an in-process source, that would eliminate this race. It's makes sense IMO, because involving this TCP race in the test doesn't help with anything - it only complicates things. It's a race condition of it;s own that's unrelated to the ones that would be corrected by switcing to topology.stop().

We could also get rid of TCP sink too, replacing it with a specially crafted test sink that would allow more accurate behavior assertion. Buffers are attached at the topology level and outside of the sink itself, so the test composition semantics would remain the same.

I think adding those test source/sink would be very beneficial in general - they'll provide a reliable and solid grounds for unit testing. They should also make tests more concise.

Also, FWIW, I don't think anything I've done that's already been committed is likely to have affected the timing here. The patch I'm planning on committing today theoretically could change the timing here slightly, but only once we actually switch to using topology.stop(), and not in a way that actually affects the root cause of the failure.

The timings are skewed by the migration from tokio 0.1 to tokio-compat that we merged recently - #1922. We've noticed these racy tests were failing during testing. Your work on shutdown might've also affected this, but we know for sure that #1922 caused a major timings turbulence here.

@stbrody
Copy link
Contributor

stbrody commented Mar 20, 2020

@MOZGIII, check out the changes I made to the MockSourceConfig in this PR. You might be able to use the event_counter I added to wait for the source the receive the proper number of events. That's basically exactly what I use it for in the topology_shutdown_while_active test I add in that same PR.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

@stbrody neat, I didn't realize we already have MockSourceConfig and other mocks! Thanks for pointing me at them!
This is perfect, and I don't think we even need a counter here - we simply need a way to ingest the events without a race condition, and a blocking receiver should to the trick!

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 20, 2020

@stbrody I see what the problem is, apparently doing a block_on(send).unwrap() is not enough for the topology to take received events into account, even though I'm using a 0-size mpsc. I'll have to rely on the same trick that you implemented with the counter.

Although, I was expecting some guarantees from the source interface... Maybe we want to change that. I believe, this is related to our generic efforts in making event handling reliable (i.e. sending acks when an event is processed doesn't help if the topology doesn't even recognize event existence), and may be relevant for #807.

@MOZGIII
Copy link
Contributor Author

MOZGIII commented Mar 21, 2020

So, I ended up digging deep into these tests and applying corrections and reliability improvements. Most of the work was done with graceful termination, and then, once the test reliability without crashes was high up again, I re-added abrupt termination.
Here's the PR: #2106

@binarylogic
Copy link
Contributor

Very nice. Appreciate you digging into these problems.

@binarylogic binarylogic added domain: tests Anything related to Vector's internal tests type: tech debt A code change that does not add user value. labels Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: tests Anything related to Vector's internal tests type: tech debt A code change that does not add user value.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants