-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TT-Mesh tests to N300 post commit #17474
Conversation
class T3000MeshDeviceFixture : public MeshDeviceFixture<2, 4> {}; | ||
class N300MeshDeviceFixture : public MeshDeviceFixture<2, 1> {}; | ||
|
||
using MeshFixtureTypes = ::testing::Types<T3000MeshDeviceFixture, N300MeshDeviceFixture>; | ||
template <typename T> | ||
class MeshTestSuite : public T {}; | ||
|
||
// Register the N300 and T3K fixture types | ||
TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of thoughts:
- I don't think testing all of mesh devices exhaustively makes sense universally. E.g. config validation test is device agnostic and can even be tested without a device connected to host. Given our shortage of CI resources, I'd avoid adopting this pattern universally. Same applies to exhaustively testing for all different combinations of inputs, in all different configurations. The particular issue with
TYPED_TEST
is also that it is harder to work with (worse integration with TEST_P, some templates need extratypename
). - We should move
TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes);
totest_.cpp
files, and instantiate the test suites there explicitly. Before you could usegtest_filter
to easily find all ofMeshBufferTest
. Also not sure how this would work with customization; e.g. if you create a subclass ofMeshTestSuite
, I assume you need anotherTYPED_TEST_SUITE
, and now it isn't obvious where the tests are instantiated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think testing all of mesh devices exhaustively makes sense universally. E.g. config validation test is device agnostic and can even be tested without a device connected to host. Given our shortage of CI resources, I'd avoid adopting this pattern universally. Same applies to exhaustively testing for all different combinations of inputs, in all different configurations. The particular issue with TYPED_TEST is also that it is harder to work with (worse integration with TEST_P, some templates need extra typename).
I agree that in general, there is no need to include certain tests in multiple suites if they are device agnostic. For the MeshBufferConfigValidation
test, this is not the case: the shape of the device shards is a function of the mesh shape and we want to verify that this particular config that we believe works on both a 2x4 grid and a 2x1 grid is actually valid according to our infra. Given the 1.6s runtime for this test on N300 I believe that this test serves as a lightweight sanity mechanism.
I also agree that exhaustively testing a feature in post-commit with multiple configurations, can overload our CI infra. For our case, most of the time is spent on the heavy-weight MeshWorkload
and MeshBuffer
sweep tests. This is why the tests were not included in post-commit. Currently, running the full distributed post-commit suite takes under 5 minutes: https://github.com/tenstorrent/tt-metal/actions/runs/13342768478/job/37269641118.
We should and will be aware of keeping our post commit suite lightweight as we add tests for more distributed features.
We should move TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes); to test_.cpp files, and instantiate the test suites there explicitly. Before you could use gtest_filter to easily find all of MeshBufferTest. Also not sure how this would work with customization; e.g. if you create a subclass of MeshTestSuite, I assume you need another TYPED_TEST_SUITE, and now it isn't obvious where the tests are instantiated.
Test suite instantiations have been moved to the cpp files and include the name of the feature being tested. I understand that the use of TYPED_TEST
can lead to complexities with gtest sweep functionality in the future. We can resolve these on a case-by-case basis and don't think this should block our efforts to get post-commit testing up.
EXPECT_ANY_THROW(MeshBuffer::create( | ||
ShardedBufferConfig{.global_size = 16 << 10, .global_buffer_shape = {64, 128}, .shard_shape = {32, 120}}, | ||
device_local_config, | ||
this->mesh_device_.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make the change to use this->
? Imo using this->
for identifying data members is a weak and unreliable signal, as it is entirely optional. Otoh, using _
suffix gives you better readability, regardless of whether anyone forgot to use this->
by accident or now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original setup, where we simply accessed mesh_device_
without this->
causes a build failure with the templated test suite. Unexpected, I'm aware but I couldn't see an easier way of getting this up without the use of this->
.
|
||
EXPECT_EQ(dst_vec, src_vec); | ||
} | ||
} | ||
} | ||
|
||
TEST_F(MeshBufferTest, RowMajorShardingAndReplication) { | ||
// MeshBuffer tests on N300 and T3000 | ||
TYPED_TEST(MeshTestSuite, ConfigValidation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To illustrate the example from my other comment: before we had a descriptive MeshBufferTest.ConfigValidation
, and now we have MeshTestSuite.ConfigValidation
which isn't clear. It might be clear if you look at the file name, but in CI all of this runs as a one-shot "distributed" test suite, so it is valuable to get faster feedback there.
Not sure if this one is an actual issue or not - something might not work if we have conflicts by re-using the same names across 2 different files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have:
MeshBufferTestSuite
MeshEventsTestSuite
MeshSubDeviceTestSuite
MeshWorkloadTestSuite
in addition to our T3K tests. I think this should lead to better debuggability.
6ce941a
to
6580cc5
Compare
Passing post commit: https://github.com/tenstorrent/tt-metal/actions/runs/13342768478 |
// Individual test fixtures for T3K and N300 (single and multi-cq) | ||
class T3000MeshDeviceFixture : public MeshDeviceFixture<2, 4, 1> {}; | ||
class N300MeshDeviceFixture : public MeshDeviceFixture<2, 1, 1> {}; | ||
class T3000MeshDeviceMultiCQFixture : public MeshDeviceFixture<2, 4, 2> {}; | ||
class N300MeshDeviceMultiCQFixture : public MeshDeviceFixture<2, 1, 2> {}; | ||
// Generic fixtures allowing a test to run on both N300 and T3K | ||
using MeshFixtureTypes = ::testing::Types<T3000MeshDeviceFixture, N300MeshDeviceFixture>; | ||
using MultiCQMeshFixtureTypes = ::testing::Types<T3000MeshDeviceMultiCQFixture, N300MeshDeviceMultiCQFixture>; | ||
// Generic class that can be bound to specific test suites | ||
template <typename T> | ||
class MeshTestSuite : public T {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an opportunity to simplify the set up, and make it more maintainable / easier to use. What do you think about the following approach:
- One base
MeshDeviceFixtureBase
fixture that takes in a number of CQs as a parameter, and another optional one for the device type. T3000MeshDeviceFixture
andN300MeshDeviceFixture
that inherit fromMeshDeviceFixtureBase
, and allow to skip the test if the connected mesh device does not correspond to either T3K or N300.MeshDeviceFixture
that you can use to run a test for either N300 or T3K. Maybe name itAnyMeshDeviceFixture
to emphasize?- The "MultiCQ" variants for all 3. Depending on the use, they can be added here, or perhaps they can be defined in the local file for
test_mesh_events.cpp
, if they aren't useful for other cases.
Something like https://pastecode.io/s/khkii1uh (sorry I got carried away with this a little bit).
A couple of tangential benefits:
- No unexpected surprises like that you had to add
this->
to make it compile, easier integration with TEST_P.TYPED_TEST
also makes some of the type names dependent, so you might be required to addtypename
in random places. - Faster to compile, as we no longer template any of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion Oleg! Requested changes have been made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad it worked!
6580cc5
to
9fc5acd
Compare
9fc5acd
to
b3abc77
Compare
Ticket
No Ticket.
Problem description
MeshWorkload
andMeshBuffer
tests were not running on post-commit, which can cause breaking commits to get merged.What's changed
Add N300 Post-Commit TT-Mesh test suite and disable tests running
MeshWorkloads
on active eth (currently broken and a legacy feature ... fix staged on this PR: #17462).Checklist