Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TT-Mesh tests to N300 post commit #17474

Merged
merged 1 commit into from
Feb 16, 2025

Conversation

tt-asaigal
Copy link
Contributor

@tt-asaigal tt-asaigal commented Feb 1, 2025

Ticket

No Ticket.

Problem description

MeshWorkload and MeshBuffer tests were not running on post-commit, which can cause breaking commits to get merged.

What's changed

Add N300 Post-Commit TT-Mesh test suite and disable tests running MeshWorkloads on active eth (currently broken and a legacy feature ... fix staged on this PR: #17462).

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • (For models and ops writers) Full new models tests passes
  • New/Existing tests provide coverage for changes

Comment on lines 89 to 97
class T3000MeshDeviceFixture : public MeshDeviceFixture<2, 4> {};
class N300MeshDeviceFixture : public MeshDeviceFixture<2, 1> {};

using MeshFixtureTypes = ::testing::Types<T3000MeshDeviceFixture, N300MeshDeviceFixture>;
template <typename T>
class MeshTestSuite : public T {};

// Register the N300 and T3K fixture types
TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts:

  1. I don't think testing all of mesh devices exhaustively makes sense universally. E.g. config validation test is device agnostic and can even be tested without a device connected to host. Given our shortage of CI resources, I'd avoid adopting this pattern universally. Same applies to exhaustively testing for all different combinations of inputs, in all different configurations. The particular issue with TYPED_TEST is also that it is harder to work with (worse integration with TEST_P, some templates need extra typename).
  2. We should move TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes); to test_.cpp files, and instantiate the test suites there explicitly. Before you could use gtest_filter to easily find all of MeshBufferTest. Also not sure how this would work with customization; e.g. if you create a subclass of MeshTestSuite, I assume you need another TYPED_TEST_SUITE, and now it isn't obvious where the tests are instantiated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think testing all of mesh devices exhaustively makes sense universally. E.g. config validation test is device agnostic and can even be tested without a device connected to host. Given our shortage of CI resources, I'd avoid adopting this pattern universally. Same applies to exhaustively testing for all different combinations of inputs, in all different configurations. The particular issue with TYPED_TEST is also that it is harder to work with (worse integration with TEST_P, some templates need extra typename).

I agree that in general, there is no need to include certain tests in multiple suites if they are device agnostic. For the MeshBufferConfigValidation test, this is not the case: the shape of the device shards is a function of the mesh shape and we want to verify that this particular config that we believe works on both a 2x4 grid and a 2x1 grid is actually valid according to our infra. Given the 1.6s runtime for this test on N300 I believe that this test serves as a lightweight sanity mechanism.

I also agree that exhaustively testing a feature in post-commit with multiple configurations, can overload our CI infra. For our case, most of the time is spent on the heavy-weight MeshWorkload and MeshBuffer sweep tests. This is why the tests were not included in post-commit. Currently, running the full distributed post-commit suite takes under 5 minutes: https://github.com/tenstorrent/tt-metal/actions/runs/13342768478/job/37269641118.

We should and will be aware of keeping our post commit suite lightweight as we add tests for more distributed features.

We should move TYPED_TEST_SUITE(MeshTestSuite, MeshFixtureTypes); to test_.cpp files, and instantiate the test suites there explicitly. Before you could use gtest_filter to easily find all of MeshBufferTest. Also not sure how this would work with customization; e.g. if you create a subclass of MeshTestSuite, I assume you need another TYPED_TEST_SUITE, and now it isn't obvious where the tests are instantiated.

Test suite instantiations have been moved to the cpp files and include the name of the feature being tested. I understand that the use of TYPED_TEST can lead to complexities with gtest sweep functionality in the future. We can resolve these on a case-by-case basis and don't think this should block our efforts to get post-commit testing up.

EXPECT_ANY_THROW(MeshBuffer::create(
ShardedBufferConfig{.global_size = 16 << 10, .global_buffer_shape = {64, 128}, .shard_shape = {32, 120}},
device_local_config,
this->mesh_device_.get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make the change to use this->? Imo using this-> for identifying data members is a weak and unreliable signal, as it is entirely optional. Otoh, using _ suffix gives you better readability, regardless of whether anyone forgot to use this-> by accident or now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original setup, where we simply accessed mesh_device_ without this-> causes a build failure with the templated test suite. Unexpected, I'm aware but I couldn't see an easier way of getting this up without the use of this->.


EXPECT_EQ(dst_vec, src_vec);
}
}
}

TEST_F(MeshBufferTest, RowMajorShardingAndReplication) {
// MeshBuffer tests on N300 and T3000
TYPED_TEST(MeshTestSuite, ConfigValidation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To illustrate the example from my other comment: before we had a descriptive MeshBufferTest.ConfigValidation, and now we have MeshTestSuite.ConfigValidation which isn't clear. It might be clear if you look at the file name, but in CI all of this runs as a one-shot "distributed" test suite, so it is valuable to get faster feedback there.

Not sure if this one is an actual issue or not - something might not work if we have conflicts by re-using the same names across 2 different files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have:
MeshBufferTestSuite
MeshEventsTestSuite
MeshSubDeviceTestSuite
MeshWorkloadTestSuite

in addition to our T3K tests. I think this should lead to better debuggability.

@tt-asaigal tt-asaigal force-pushed the asaigal/mesh_workload_post_commit branch from 6ce941a to 6580cc5 Compare February 15, 2025 17:30
@tt-asaigal
Copy link
Contributor Author

Comment on lines 97 to 107
// Individual test fixtures for T3K and N300 (single and multi-cq)
class T3000MeshDeviceFixture : public MeshDeviceFixture<2, 4, 1> {};
class N300MeshDeviceFixture : public MeshDeviceFixture<2, 1, 1> {};
class T3000MeshDeviceMultiCQFixture : public MeshDeviceFixture<2, 4, 2> {};
class N300MeshDeviceMultiCQFixture : public MeshDeviceFixture<2, 1, 2> {};
// Generic fixtures allowing a test to run on both N300 and T3K
using MeshFixtureTypes = ::testing::Types<T3000MeshDeviceFixture, N300MeshDeviceFixture>;
using MultiCQMeshFixtureTypes = ::testing::Types<T3000MeshDeviceMultiCQFixture, N300MeshDeviceMultiCQFixture>;
// Generic class that can be bound to specific test suites
template <typename T>
class MeshTestSuite : public T {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an opportunity to simplify the set up, and make it more maintainable / easier to use. What do you think about the following approach:

  1. One base MeshDeviceFixtureBase fixture that takes in a number of CQs as a parameter, and another optional one for the device type.
  2. T3000MeshDeviceFixture and N300MeshDeviceFixture that inherit from MeshDeviceFixtureBase, and allow to skip the test if the connected mesh device does not correspond to either T3K or N300.
  3. MeshDeviceFixture that you can use to run a test for either N300 or T3K. Maybe name it AnyMeshDeviceFixture to emphasize?
  4. The "MultiCQ" variants for all 3. Depending on the use, they can be added here, or perhaps they can be defined in the local file for test_mesh_events.cpp, if they aren't useful for other cases.

Something like https://pastecode.io/s/khkii1uh (sorry I got carried away with this a little bit).

A couple of tangential benefits:

  • No unexpected surprises like that you had to add this-> to make it compile, easier integration with TEST_P. TYPED_TEST also makes some of the type names dependent, so you might be required to add typename in random places.
  • Faster to compile, as we no longer template any of the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion Oleg! Requested changes have been made.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad it worked!

@tt-asaigal tt-asaigal force-pushed the asaigal/mesh_workload_post_commit branch from 6580cc5 to 9fc5acd Compare February 16, 2025 04:16
@tt-asaigal tt-asaigal force-pushed the asaigal/mesh_workload_post_commit branch from 9fc5acd to b3abc77 Compare February 16, 2025 04:52
@tt-asaigal tt-asaigal merged commit 7ee3c89 into main Feb 16, 2025
211 checks passed
@tt-asaigal tt-asaigal deleted the asaigal/mesh_workload_post_commit branch February 16, 2025 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants