`test_max_size` test is failing #2102

MOZGIII · 2020-03-20T09:19:15Z

test_max_size test is failing, we should investigate. This started happening more frequently with tokio-compat migration, however this migration is suspected not to be root cause, as it seems like this issue was happening before.
We need to investigate the problem with test_max_size and surrounding tests and fix it.

A sample failed CI run: https://circleci.com/gh/timberio/vector/90109

The text was updated successfully, but these errors were encountered:

MOZGIII · 2020-03-20T10:37:11Z

My initial suspicion was that it’s another bug with tokio-compat Runtime. After examining the test code, I wasn’t able to find a direct confirmation of that, but I’m not ruling it out yet.

Here’s the test file:

https://github.com/timberio/vector/blob/0b92159518732a27837ca1142884ab9a7d34de84/tests/buffering.rs

It seems like there are issues with topology shutdown.
I think this is where the problem comes from, and I’ll try to reproduce it in a dedicated test case.

binarylogic · 2020-03-20T14:21:59Z

@stbrody would any of your changes affect this? Or, could you weigh in since you're familiar with this code? I'm curious if you could spot something.

LucioFranco · 2020-03-20T14:27:10Z

Basically this test fails if this time https://github.com/timberio/vector/blob/0b92159518732a27837ca1142884ab9a7d34de84/tests/buffering.rs#L147 is not long enough for all events to pass through the system and hit the max size. The assert is failing because it looks like none of the events got written to the disk buffer in time. This can happen when a machine is noisy and there are a lot of tests running that take up CPU time but the timeout is still going. Our best bet would be to increase the timeout imo. It may also be that these tests are old and we want to refactor them.

MOZGIII · 2020-03-20T15:57:57Z

Yeah, I see what's going on. Why do we even need a timeout there in the first place?

From the looks of it, we could replace

https://github.com/timberio/vector/blob/0b92159518732a27837ca1142884ab9a7d34de84/tests/buffering.rs#L149

with

block_on(topology.stop()).unwrap();
shutdown_on_idle(rt);

just the same way as couple lines below.

It doesn't seem to affect the semantics of the test, I think, it will even improve the reliability and quality of the test.
It looks like this race condition was added intentionally, however, it doesn't add value to the test and just makes it racy.

MOZGIII · 2020-03-20T16:11:09Z

By the way, we've encountered issues with these tests before: #1508
Might not be related to tokio-compat, but now is a good time to revisit it as well, since we're on it.

LucioFranco · 2020-03-20T16:15:24Z

if we use topology stop it should fully flush and give us what we want, here we are testing if somehow vector crashes. Which shutdown_now will force cancel all tasks.

MOZGIII · 2020-03-20T16:25:22Z

I understand this was the intention, however, I don't think this test is adequately designed to assert a crash a scenario. At the same time, there are no tests that check that buffering actually performs as expected under proper conditions.
I'd suggest we convert those tests to assert buffering behavior under proper conditions - as that would at least be reliable and sane.

To properly assert the crash scenario we should trigger an actual crash - otherwise, it's not a fair assertion. How about we add this separately to the test harness?

MOZGIII · 2020-03-20T16:34:45Z

In this test - even if we do trigger a crash - we want to trigger it after all the events reached the sink, and it has put them into the buffer. Using a timeout for that is what's causing the race. We can probably leave this drop if we figure out a way to reliably wait for that to happen.

However, in my practice, tests that rely on timeouts for synchronization are a constant source of trouble. They east up maintenance time and don't provide the required confidence either way.

This is why I still suggest this tradeoff - to lose some of the assertion coverage in this case, but to gain value in test reliability.

MOZGIII · 2020-03-20T17:11:07Z

I've been experimenting with replacing sleeps and switching to topology.stop(), and found a bug in test_util. Tests clearly are unreliable, and we definitely need to improve in that area.

stbrody · 2020-03-20T18:19:08Z

While it does seem like using topology.stop() would improve things, don't you still have a race around whether or not the TCP source has actually read in the input lines off the socket before you start the shutdown? Using topology.stop() instead of rt.shutdown_now() ensures that any events that have made it to the TCP source will make it to the buffer, but it doesn't ensure that all messages sent to the socket make it to the TCP source.

stbrody · 2020-03-20T18:21:11Z

Also, FWIW, I don't think anything I've done that's already been committed is likely to have affected the timing here. The patch I'm planning on committing today theoretically could change the timing here slightly, but only once we actually switch to using topology.stop(), and not in a way that actually affects the root cause of the failure.

MOZGIII · 2020-03-20T18:42:49Z

While it does seem like using topology.stop() would improve things, don't you still have a race around whether or not the TCP source has actually read in the input lines off the socket before you start the shutdown? Using topology.stop() instead of rt.shutdown_now() ensures that any events that have made it to the TCP source will make it to the buffer, but it doesn't ensure that all messages sent to the socket make it to the TCP source.

You are 100% correct, and this is what I'm currently thinking on how to solve! While we can easily eliminate all the other sleeps, this one seems to be the one that can't go away, because of how TCP operation are carried on in general. We don't require an acknowledgment for data sent over a TCP to be picked up by the source, and this is sort of a fundamental design limitation of the TCP source of vector. It may be a problem not just in tests, but in the real world too.

To solve this, we can switch from TCP source to an in-process source, that would eliminate this race. It's makes sense IMO, because involving this TCP race in the test doesn't help with anything - it only complicates things. It's a race condition of it;s own that's unrelated to the ones that would be corrected by switcing to topology.stop().

We could also get rid of TCP sink too, replacing it with a specially crafted test sink that would allow more accurate behavior assertion. Buffers are attached at the topology level and outside of the sink itself, so the test composition semantics would remain the same.

I think adding those test source/sink would be very beneficial in general - they'll provide a reliable and solid grounds for unit testing. They should also make tests more concise.

Also, FWIW, I don't think anything I've done that's already been committed is likely to have affected the timing here. The patch I'm planning on committing today theoretically could change the timing here slightly, but only once we actually switch to using topology.stop(), and not in a way that actually affects the root cause of the failure.

The timings are skewed by the migration from tokio 0.1 to tokio-compat that we merged recently - #1922. We've noticed these racy tests were failing during testing. Your work on shutdown might've also affected this, but we know for sure that #1922 caused a major timings turbulence here.

stbrody · 2020-03-20T18:48:07Z

@MOZGIII, check out the changes I made to the MockSourceConfig in this PR. You might be able to use the event_counter I added to wait for the source the receive the proper number of events. That's basically exactly what I use it for in the topology_shutdown_while_active test I add in that same PR.

MOZGIII · 2020-03-20T18:58:26Z

@stbrody neat, I didn't realize we already have MockSourceConfig and other mocks! Thanks for pointing me at them!
This is perfect, and I don't think we even need a counter here - we simply need a way to ingest the events without a race condition, and a blocking receiver should to the trick!

MOZGIII · 2020-03-20T20:26:52Z

@stbrody I see what the problem is, apparently doing a block_on(send).unwrap() is not enough for the topology to take received events into account, even though I'm using a 0-size mpsc. I'll have to rely on the same trick that you implemented with the counter.

Although, I was expecting some guarantees from the source interface... Maybe we want to change that. I believe, this is related to our generic efforts in making event handling reliable (i.e. sending acks when an event is processed doesn't help if the topology doesn't even recognize event existence), and may be relevant for #807.

MOZGIII · 2020-03-21T02:04:43Z

So, I ended up digging deep into these tests and applying corrections and reliability improvements. Most of the work was done with graceful termination, and then, once the test reliability without crashes was high up again, I re-added abrupt termination.
Here's the PR: #2106

binarylogic · 2020-03-21T02:06:06Z

Very nice. Appreciate you digging into these problems.

MOZGIII self-assigned this Mar 20, 2020

MOZGIII mentioned this issue Mar 20, 2020

chore: Racy buffering tests correction #2106

Merged

binarylogic added domain: tests Anything related to Vector's internal tests type: tech debt A code change that does not add user value. labels Mar 23, 2020

MOZGIII closed this as completed in #2106 Mar 23, 2020

MOZGIII mentioned this issue May 14, 2020

test_unix_stream_syslog fails at CI #2604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`test_max_size` test is failing #2102

`test_max_size` test is failing #2102

MOZGIII commented Mar 20, 2020 •

edited

Loading

MOZGIII commented Mar 20, 2020

binarylogic commented Mar 20, 2020

LucioFranco commented Mar 20, 2020

MOZGIII commented Mar 20, 2020 •

edited

Loading

MOZGIII commented Mar 20, 2020

LucioFranco commented Mar 20, 2020

MOZGIII commented Mar 20, 2020

MOZGIII commented Mar 20, 2020

MOZGIII commented Mar 20, 2020 •

edited

Loading

stbrody commented Mar 20, 2020

stbrody commented Mar 20, 2020

MOZGIII commented Mar 20, 2020

stbrody commented Mar 20, 2020

MOZGIII commented Mar 20, 2020 •

edited

Loading

MOZGIII commented Mar 20, 2020 •

edited

Loading

MOZGIII commented Mar 21, 2020

binarylogic commented Mar 21, 2020

test_max_size test is failing #2102

test_max_size test is failing #2102

Comments

MOZGIII commented Mar 20, 2020 • edited Loading

MOZGIII commented Mar 20, 2020

binarylogic commented Mar 20, 2020

LucioFranco commented Mar 20, 2020

MOZGIII commented Mar 20, 2020 • edited Loading

MOZGIII commented Mar 20, 2020

LucioFranco commented Mar 20, 2020

MOZGIII commented Mar 20, 2020

MOZGIII commented Mar 20, 2020

MOZGIII commented Mar 20, 2020 • edited Loading

stbrody commented Mar 20, 2020

stbrody commented Mar 20, 2020

MOZGIII commented Mar 20, 2020

stbrody commented Mar 20, 2020

MOZGIII commented Mar 20, 2020 • edited Loading

MOZGIII commented Mar 20, 2020 • edited Loading

MOZGIII commented Mar 21, 2020

binarylogic commented Mar 21, 2020

`test_max_size` test is failing #2102

`test_max_size` test is failing #2102

MOZGIII commented Mar 20, 2020 •

edited

Loading

MOZGIII commented Mar 20, 2020 •

edited

Loading

MOZGIII commented Mar 20, 2020 •

edited

Loading

MOZGIII commented Mar 20, 2020 •

edited

Loading

MOZGIII commented Mar 20, 2020 •

edited

Loading