V6e support #912

kelvin-zou · 2025-01-08T23:08:33Z

Add v6e support, including a few changes.

Add a specific meshrule for v6e 1k and 2k.
Fix the global tokens/batch to 16m for llama-v3 model.
Also refactored and moved _save_and_offload_only_these_names_regex into util class for external access.
Tuned the compiler_option a bit for better performance.
Add an additional flag (megascale_grpc_enable_xor_tracer=False) to workaround the megascale OOM issue.

samos123 · 2025-01-08T23:46:03Z

axlearn/common/compiler_options.py

+            xla_sc_disjoint_spmem="false",
+            xla_tpu_enable_sparse_core_collective_offload_all_reduce="true",
+            # TODO(kelvinzou): temporary workaround to avoid memory leak in megascale.
+            megascale_grpc_enable_xor_tracer="false",


current plan is to release a fix in jax 0.4.39 which is planned for Jan 15. The fix is in libtpu.

snapshot

228f34b

samos123 reviewed Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V6e support #912

V6e support #912

kelvin-zou commented Jan 8, 2025

samos123 Jan 8, 2025

V6e support #912

Are you sure you want to change the base?

V6e support #912

Conversation

kelvin-zou commented Jan 8, 2025

samos123 Jan 8, 2025

Choose a reason for hiding this comment