Adds support for knowledge distillation #380

RobotSail · 2024-12-24T18:49:31Z

There are many different forms of model training which exist. One popular form of training is knowledge distillation, where a student model learns the output distributions from a teacher model. This commit introduces support for knowledge distillation in the training library.

This commit also exposes the weight_decay hyperparameter which is often used to help deep learning models generalize.

Lastly, this commit changes the useage from torch.distributed to just dist, as it is a common module used throughout the codebase.

Signed-off-by: Oleg S [email protected]

There are many different forms of model training which exist. One popular form of training is knowledge distillation, where a student model learns the output distributions from a teacher model. This commit introduces support for knowledge distillation in the training library. This commit also exposes the `weight_decay` hyperparameter which is often used to help deep learning models generalize. Lastly, this commit changes the useage from `torch.distributed` to just `dist`, as it is a common module used throughout the codebase. Signed-off-by: Oleg S <[email protected]>

JamesKunstle · 2025-01-08T18:25:39Z

src/instructlab/training/config.py

+
+    temperature: float = Field(1.0, gt=0.0)
+    alpha: float = Field(1.0, le=1.0, ge=0.0)
+    teacher_path: str


if possible would love to standardize on using pathlib.Path rather than str paths.

@JamesKunstle I see your point, would it make sense for it to be a path when it can also take on a HF reference? I understand that references can technically still be paths, but to a consumer reading it might sound like only local models are accepted. Would str | Path be satisfactory?

JamesKunstle · 2025-01-08T18:29:41Z

src/instructlab/training/main_ds.py

+    teacher_model = AutoModelForCausalLM.from_pretrained(
+        model_name_or_path, torch_dtype=torch.bfloat16
+    ).to(device)
+    model_dev = next(teacher_model.parameters()).device


If you're calling .to(device) just above, could you make a note of why you also need to confirm the device locale below?

Yes of course.

JamesKunstle · 2025-01-08T18:31:57Z

src/instructlab/training/config.py

+    weight_decay: float = Field(0.0, ge=0.0)
+
+    # settings for knowledge distillation
+    distillation_options: Optional[DistillationConfig] = None


I've seen that Optional[DistillationConfig] syntax is replaced by DistillationConfig | None in recent Pythonic parlance once the optional annotation was added to the language. This is a nit, not required to change.

Let's do the proposed method to be more consistent with how Python expects optionals in the future.

JamesKunstle · 2025-01-08T18:46:22Z

src/instructlab/training/main_ds.py

+            loss = None
+            if args.distill:
+                # teacher_model should always be provided when `args.distill` is enabled
+                if TYPE_CHECKING:


I think this is supposed to be a runtime check but TYPE_CHECKING is always False at runtime.
https://docs.python.org/3/library/typing.html#typing.TYPE_CHECKING

I think we should fail much earlier if distillation is set but no teacher_model is provided, like before we do any data preprocessing or fire up the GPUs.

Yeah it is, I believe I had errors with type-checking here though and it not knowing that teacher_model is properly set.

JamesKunstle · 2025-01-08T18:48:18Z

src/instructlab/training/main_ds.py

+                    ), "teacher model cannot be None when `distill` is enabled"
+
+                with torch.no_grad():
+                    teacher_output: CausalLMOutput = teacher_model(


You turn off requires_grad on all the params in the teacher_model. You could just be doing this instead, I think this gives the same output.

So they're not fully the same. requires_grad ensures that a tensor never needs its gradient to be computed when .backward() is called at some point in the computation graph and therefore doesn't need to store any additional data for it. Whereas torch.no_grad ensures that the tensor computations within the given context do not count towards the gradient calculation during backprop.

The reason we're doing both is so that:

requires_grad=False --> The teacher model doesn't need to get updated so we don't need to store any additional variables

with torch.no_grad() --> If any other tensors happen to participate in the computation for whatever reason, say for example someone updates this and includes them, their gradients are also not impacted by participating in this calculation.

Having this as an explicit context also allows us to communicate to other developers in the future that this is not intended to participate in backprop, which comes to us at no extra cost really.

Probably you can get away without using torch.no_grad here, but it's just a good practice to do both.

JamesKunstle · 2025-01-08T18:51:16Z

src/instructlab/training/main_ds.py

+            else:
+                loss = output.loss
+
+            assert loss is not None, "loss cannot be equal to None!"


asserts are typically not preferred in comparison to runtime exceptions.

Suggested change

assert loss is not None, "loss cannot be equal to None!"

if loss is None:

raise ValueError("loss was None during distillation training. Something unrecoverable went wrong.")

mostly because they can be removed with -0 when the interpreter is invoked. But we want to check non-null all the time.

Sure I can change this. I was using them as scaffolding when writing this.

I'm gonna make it not be specific to distillation training though since it's more about how we branch out. I suspect as we add other loss calculations (contrastive loss, preference tuning loss, etc.), we will start out by setting it to None and having this final check to ensure it was set to something.

JamesKunstle · 2025-01-08T18:53:42Z

src/instructlab/training/main_ds.py

@@ -511,6 +609,9 @@ def main(args):
    # Third Party
    import yaml

+    if args.distill and not args.teacher_model_name_or_path:


Yeah this early check seems right.

Sweet :party-cat:

JamesKunstle

Nearly ready to go. Just a couple of API questions.

Signed-off-by: Oleg S <[email protected]>

mergify bot added ci-failure and removed ci-failure labels Dec 24, 2024

RobotSail requested review from Maxusmusti, JamesKunstle and aldopareja December 24, 2024 18:56

RobotSail force-pushed the kv-div branch from 7a8043a to c0b9cf3 Compare December 24, 2024 19:45

JamesKunstle reviewed Jan 8, 2025

View reviewed changes

fix: make linter happy

4f495bf

Signed-off-by: Oleg S <[email protected]>

RobotSail force-pushed the kv-div branch from c0b9cf3 to 4f495bf Compare January 8, 2025 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for knowledge distillation #380

Adds support for knowledge distillation #380

RobotSail commented Dec 24, 2024

JamesKunstle Jan 8, 2025

RobotSail Jan 8, 2025

JamesKunstle Jan 8, 2025

RobotSail Jan 8, 2025

JamesKunstle Jan 8, 2025

RobotSail Jan 8, 2025

JamesKunstle Jan 8, 2025

JamesKunstle Jan 8, 2025

RobotSail Jan 8, 2025

JamesKunstle Jan 8, 2025

RobotSail Jan 8, 2025

JamesKunstle Jan 8, 2025

JamesKunstle Jan 8, 2025

RobotSail Jan 8, 2025

RobotSail Jan 8, 2025

JamesKunstle Jan 8, 2025

RobotSail Jan 8, 2025

JamesKunstle left a comment

	assert loss is not None, "loss cannot be equal to None!"
	if loss is None:
	raise ValueError("loss was None during distillation training. Something unrecoverable went wrong.")

Adds support for knowledge distillation #380

Are you sure you want to change the base?

Adds support for knowledge distillation #380

Conversation

RobotSail commented Dec 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesKunstle left a comment

Choose a reason for hiding this comment