You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This codebase has been functional from the beginning, in large part because it's a descendant of non-open source codebase that accomplished the same thing. Now, we have to do the extra work (lovingly called "schlep") required to make it more maintainable, understandable, and extensible.
This [Epic] and issue
This should be the root node, and index for, efforts that attempt to solve the below gaps.
Areas of improvement
No project is without its tech-debt, and that debt is typically painful. Thankfully, our tech debt is "payable" via green-field contributions. The following is a high-level summary of areas that need improvement:
Functional reorganization
The main training entrypoint, setup functions, and training loop are wound fairly tightly, with functions doing multiple tasks at once. We should refactor to unwind these functions.
Testing
This repo itself only has basic smoketests. In instructlab/instructlab, there are workflow tests that consume this library and confirm that training isn't outright broken, which is a good start.
There are multiple levels of testing that we should aspire to cover.
Unit testing. These ought to prove that our utility functions (e.g. calculating packed batches with FFD) work, that our assertions (e.g. blocking unsupported model architectures) are obeyed, and that our organizational logic (e.g. loading checkpoints and restarting from a given epoch) function correctly.
Correctness and performance testing. If we're making changes to the core training loop, we ought to be able to quickly invoke a test that checks indicators like (a) the behavior of the loss curve, (b) iteration and epoch training time.
Hardware-stack testing: We now support five hardware runtime categories: CPU, MPS, Nvidia, AMD, Intel. We should be able to run appropriate tests on appropriate hardware without having to manually access machines and invoke tests ourselves.
Logging
We have no standard practice for logging currently. We use print statements regularly rather than using a logger, or different logging levels.
Instrumentation
Similarly, we don't instrument the batch / epoch timings or track artifact generation with a tool like Tensorboard or WandB.
Documentation
The success of a project in growing and onboarding newcomers is usually docs-dependent. We currently have minimal docs covering training rationale, interesting optimizations, or longer term objectives, nor do we currently link to the resources that already exist (e.g. the padding-free Transformers blog post).
Code quality
We can get some easy wins by enforcing some CI/CD requirements on the following subjects:
Strict static type checking enforcement
Currently, we don't have strict requirements for variable type-checking with mypy. This can help tremendously when working with deeply nested, complicated projects.
Docstrings
A large number of functions and modules in the project don't have docstrings explaining their use.
Phinx docs
Adopting a style guide
Distributed framework streamlining
We use HF Accelerate to wrap FSDP and DeepSpeed training frameworks. We originally planned to only support FSDP training in the future, which would allow us to use FSDP directly rather than Accelerate.
CPU, MPS, N>=1 GPU training consolidation
Currently, training loops exist both in instructlab/instructlab and in this repo. Bringing all training-related paths under one repo would be helpful, especially regarding testing.
The text was updated successfully, but these errors were encountered:
instructlab/training
maturation and improvementsThis codebase has been functional from the beginning, in large part because it's a descendant of non-open source codebase that accomplished the same thing. Now, we have to do the extra work (lovingly called "schlep") required to make it more maintainable, understandable, and extensible.
This [Epic] and issue
This should be the root node, and index for, efforts that attempt to solve the below gaps.
Areas of improvement
No project is without its tech-debt, and that debt is typically painful. Thankfully, our tech debt is "payable" via green-field contributions. The following is a high-level summary of areas that need improvement:
Functional reorganization
The main training entrypoint, setup functions, and training loop are wound fairly tightly, with functions doing multiple tasks at once. We should refactor to unwind these functions.
Testing
This repo itself only has basic smoketests. In
instructlab/instructlab
, there are workflow tests that consume this library and confirm that training isn't outright broken, which is a good start.There are multiple levels of testing that we should aspire to cover.
Logging
We have no standard practice for logging currently. We use
print
statements regularly rather than using a logger, or different logging levels.Instrumentation
Similarly, we don't instrument the batch / epoch timings or track artifact generation with a tool like Tensorboard or WandB.
Documentation
The success of a project in growing and onboarding newcomers is usually docs-dependent. We currently have minimal docs covering training rationale, interesting optimizations, or longer term objectives, nor do we currently link to the resources that already exist (e.g. the padding-free Transformers blog post).
Code quality
We can get some easy wins by enforcing some CI/CD requirements on the following subjects:
Strict static type checking enforcement
Currently, we don't have strict requirements for variable type-checking with
mypy
. This can help tremendously when working with deeply nested, complicated projects.Docstrings
A large number of functions and modules in the project don't have docstrings explaining their use.
Phinx docs
Adopting a style guide
Distributed framework streamlining
We use HF Accelerate to wrap FSDP and DeepSpeed training frameworks. We originally planned to only support FSDP training in the future, which would allow us to use FSDP directly rather than Accelerate.
CPU, MPS, N>=1 GPU training consolidation
Currently, training loops exist both in
instructlab/instructlab
and in this repo. Bringing all training-related paths under one repo would be helpful, especially regarding testing.The text was updated successfully, but these errors were encountered: