Add TT-Mesh tests to N300 post commit #17474

tt-asaigal · 2025-02-01T20:35:03Z

Ticket

No Ticket.

Problem description

MeshWorkload and MeshBuffer tests were not running on post-commit, which can cause breaking commits to get merged.

What's changed

Add N300 Post-Commit TT-Mesh test suite and disable tests running MeshWorkloads on active eth (currently broken and a legacy feature ... fix staged on this PR: #17462).

Checklist

Post commit CI passes
Blackhole Post commit (if applicable)
Model regression CI testing passes (if applicable)
Device performance regression CI testing passes (if applicable)
(For models and ops writers) Full new models tests passes
New/Existing tests provide coverage for changes

omilyutin-tt · 2025-02-01T22:03:38Z

tests/tt_metal/tt_metal/common/multi_device_fixture.hpp

+class T3000MeshDeviceFixture : public MeshDeviceFixture<2, 4> {};
+class N300MeshDeviceFixture : public MeshDeviceFixture<2, 1> {};
+
+using MeshFixtureTypes = ::testing::Types<T3000MeshDeviceFixture, N300MeshDeviceFixture>;
+template <typename T>
+class MeshTestSuite : public T {};
+
+// Register the N300 and T3K fixture types
+TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes);


A couple of thoughts:

I don't think testing all of mesh devices exhaustively makes sense universally. E.g. config validation test is device agnostic and can even be tested without a device connected to host. Given our shortage of CI resources, I'd avoid adopting this pattern universally. Same applies to exhaustively testing for all different combinations of inputs, in all different configurations. The particular issue with TYPED_TEST is also that it is harder to work with (worse integration with TEST_P, some templates need extra typename).

We should move TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes); to test_.cpp files, and instantiate the test suites there explicitly. Before you could use gtest_filter to easily find all of MeshBufferTest. Also not sure how this would work with customization; e.g. if you create a subclass of MeshTestSuite, I assume you need another TYPED_TEST_SUITE, and now it isn't obvious where the tests are instantiated.

I don't think testing all of mesh devices exhaustively makes sense universally. E.g. config validation test is device agnostic and can even be tested without a device connected to host. Given our shortage of CI resources, I'd avoid adopting this pattern universally. Same applies to exhaustively testing for all different combinations of inputs, in all different configurations. The particular issue with TYPED_TEST is also that it is harder to work with (worse integration with TEST_P, some templates need extra typename).

I agree that in general, there is no need to include certain tests in multiple suites if they are device agnostic. For the MeshBufferConfigValidation test, this is not the case: the shape of the device shards is a function of the mesh shape and we want to verify that this particular config that we believe works on both a 2x4 grid and a 2x1 grid is actually valid according to our infra. Given the 1.6s runtime for this test on N300 I believe that this test serves as a lightweight sanity mechanism.

I also agree that exhaustively testing a feature in post-commit with multiple configurations, can overload our CI infra. For our case, most of the time is spent on the heavy-weight MeshWorkload and MeshBuffer sweep tests. This is why the tests were not included in post-commit. Currently, running the full distributed post-commit suite takes under 5 minutes: https://github.com/tenstorrent/tt-metal/actions/runs/13342768478/job/37269641118.

We should and will be aware of keeping our post commit suite lightweight as we add tests for more distributed features.

We should move TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes); to test_.cpp files, and instantiate the test suites there explicitly. Before you could use gtest_filter to easily find all of MeshBufferTest. Also not sure how this would work with customization; e.g. if you create a subclass of MeshTestSuite, I assume you need another TYPED_TEST_SUITE, and now it isn't obvious where the tests are instantiated.

Test suite instantiations have been moved to the cpp files and include the name of the feature being tested. I understand that the use of TYPED_TEST can lead to complexities with gtest sweep functionality in the future. We can resolve these on a case-by-case basis and don't think this should block our efforts to get post-commit testing up.

omilyutin-tt · 2025-02-01T22:06:03Z

tests/tt_metal/distributed/test_mesh_buffer.cpp

+    EXPECT_ANY_THROW(MeshBuffer::create(
+        ShardedBufferConfig{.global_size = 16 << 10, .global_buffer_shape = {64, 128}, .shard_shape = {32, 120}},
+        device_local_config,
+        this->mesh_device_.get()));


Why make the change to use this->? Imo using this-> for identifying data members is a weak and unreliable signal, as it is entirely optional. Otoh, using _ suffix gives you better readability, regardless of whether anyone forgot to use this-> by accident or now.

The original setup, where we simply accessed mesh_device_ without this-> causes a build failure with the templated test suite. Unexpected, I'm aware but I couldn't see an easier way of getting this up without the use of this->.

omilyutin-tt · 2025-02-01T22:07:43Z

tests/tt_metal/distributed/test_mesh_buffer.cpp


            EXPECT_EQ(dst_vec, src_vec);
        }
    }
 }

-TEST_F(MeshBufferTest, RowMajorShardingAndReplication) {
+// MeshBuffer tests on N300 and T3000
+TYPED_TEST(MeshTestSuite, ConfigValidation) {


To illustrate the example from my other comment: before we had a descriptive MeshBufferTest.ConfigValidation, and now we have MeshTestSuite.ConfigValidation which isn't clear. It might be clear if you look at the file name, but in CI all of this runs as a one-shot "distributed" test suite, so it is valuable to get faster feedback there.

Not sure if this one is an actual issue or not - something might not work if we have conflicts by re-using the same names across 2 different files.

We now have:
MeshBufferTestSuite
MeshEventsTestSuite
MeshSubDeviceTestSuite
MeshWorkloadTestSuite

in addition to our T3K tests. I think this should lead to better debuggability.

tt-asaigal · 2025-02-15T17:49:58Z

Passing post commit: https://github.com/tenstorrent/tt-metal/actions/runs/13342768478

omilyutin-tt · 2025-02-16T01:05:44Z

tests/tt_metal/tt_metal/common/multi_device_fixture.hpp

+// Individual test fixtures for T3K and N300 (single and multi-cq)
+class T3000MeshDeviceFixture : public MeshDeviceFixture<2, 4, 1> {};
+class N300MeshDeviceFixture : public MeshDeviceFixture<2, 1, 1> {};
+class T3000MeshDeviceMultiCQFixture : public MeshDeviceFixture<2, 4, 2> {};
+class N300MeshDeviceMultiCQFixture : public MeshDeviceFixture<2, 1, 2> {};
+// Generic fixtures allowing a test to run on both N300 and T3K
+using MeshFixtureTypes = ::testing::Types<T3000MeshDeviceFixture, N300MeshDeviceFixture>;
+using MultiCQMeshFixtureTypes = ::testing::Types<T3000MeshDeviceMultiCQFixture, N300MeshDeviceMultiCQFixture>;
+// Generic class that can be bound to specific test suites
+template <typename T>
+class MeshTestSuite : public T {};


I think there is an opportunity to simplify the set up, and make it more maintainable / easier to use. What do you think about the following approach:

One base MeshDeviceFixtureBase fixture that takes in a number of CQs as a parameter, and another optional one for the device type.

T3000MeshDeviceFixture and N300MeshDeviceFixture that inherit from MeshDeviceFixtureBase, and allow to skip the test if the connected mesh device does not correspond to either T3K or N300.

MeshDeviceFixture that you can use to run a test for either N300 or T3K. Maybe name it AnyMeshDeviceFixture to emphasize?

The "MultiCQ" variants for all 3. Depending on the use, they can be added here, or perhaps they can be defined in the local file for test_mesh_events.cpp, if they aren't useful for other cases.

Something like https://pastecode.io/s/khkii1uh (sorry I got carried away with this a little bit).

A couple of tangential benefits:

No unexpected surprises like that you had to add this-> to make it compile, easier integration with TEST_P. TYPED_TEST also makes some of the type names dependent, so you might be required to add typename in random places.

Faster to compile, as we no longer template any of the tests.

Great suggestion Oleg! Requested changes have been made.

Glad it worked!

tt-asaigal requested review from cfjchu, omilyutin-tt, abhullar-tt, pgkeller, aliuTT, tt-aho, tt-dma, ubcheema and a team as code owners February 1, 2025 20:35

blozano-tt approved these changes Feb 1, 2025

View reviewed changes

abhullar-tt approved these changes Feb 1, 2025

View reviewed changes

omilyutin-tt requested changes Feb 1, 2025

View reviewed changes

tt-asaigal force-pushed the asaigal/mesh_workload_post_commit branch from 6ce941a to 6580cc5 Compare February 15, 2025 17:30

omilyutin-tt requested changes Feb 16, 2025

View reviewed changes

tt-asaigal force-pushed the asaigal/mesh_workload_post_commit branch from 6580cc5 to 9fc5acd Compare February 16, 2025 04:16

omilyutin-tt approved these changes Feb 16, 2025

View reviewed changes

Add TT-Mesh tests to N300 post commit

b3abc77

tt-asaigal force-pushed the asaigal/mesh_workload_post_commit branch from 9fc5acd to b3abc77 Compare February 16, 2025 04:52

tt-asaigal merged commit 7ee3c89 into main Feb 16, 2025
211 checks passed

tt-asaigal deleted the asaigal/mesh_workload_post_commit branch February 16, 2025 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TT-Mesh tests to N300 post commit #17474

Add TT-Mesh tests to N300 post commit #17474

tt-asaigal commented Feb 1, 2025 •

edited

Loading

omilyutin-tt Feb 1, 2025

tt-asaigal Feb 15, 2025

omilyutin-tt Feb 1, 2025

tt-asaigal Feb 2, 2025

omilyutin-tt Feb 1, 2025

tt-asaigal Feb 15, 2025

tt-asaigal commented Feb 15, 2025

omilyutin-tt Feb 16, 2025

tt-asaigal Feb 16, 2025

omilyutin-tt Feb 16, 2025

Add TT-Mesh tests to N300 post commit #17474

Add TT-Mesh tests to N300 post commit #17474

Conversation

tt-asaigal commented Feb 1, 2025 • edited Loading

Ticket

Problem description

What's changed

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tt-asaigal commented Feb 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tt-asaigal commented Feb 1, 2025 •

edited

Loading