You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
The following snippet minimally reproduces the behaviour that I am observing in my model:
import torch.nn as nn
from accelerate.utils.modeling import get_module_size_with_ties
# N must be greater than 10 for this anomalous behaviour to appear.
# hidden_size and modules sizes are set to arbitrary values.
N = 15
hidden_size = 2
module_size = 1000
tied_params = [f"model.layers.{i}.weight" for i in range(1, N)]
module_sizes = dict(**{f"model.layers.{i}": 1000 for i in range(N)}, **{f"model.layers.{i}.weight": 10 for i in range(N)})
modules_to_treat = [(f"model.layers.{i}", nn.Linear(hidden_size, hidden_size)) for i in range(1,N)]
module_size_with_ties, tied_module_names, tied_modules = get_module_size_with_ties(tied_params, module_size, module_sizes, modules_to_treat)
tied_module_names
Where the output is: tied_module_names = ['model.layers.1', 'model.layers.2', 'model.layers.3', 'model.layers.4', 'model.layers.5', 'model.layers.6', 'model.layers.7', 'model.layers.8', 'model.layers.9', 'model.layers.1', 'model.layers.1', 'model.layers.1', 'model.layers.1', 'model.layers.1'] and note that the last five elements of the list are all 'model.layers.1'.
Expected behavior
When a model has children modules contained in a ModuleList, the names of these submodules have a number indicating their position in the container, e.g. "model.layers.0". However, the module to which a tied parameter belongs to is identified by checking that the module name is a prefix of that of the parameter (see https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/modeling.py#L1186)
tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if n in tied_param][0]
However, if, for instance, n="model.layers.1" and tied_param="model.layers.11.weight", the check evaluates to True, even though the parameter does not belong to the given module. Consequently, the tied_modules list returned by get_module_size_with_ties will only contain the modules from model.layers.0 up to model.layers.9, as illustrated in the reproduction code snippet. As a result of this, not all the modules with the tied parameter in modules_to_treat are placed into the appropiate device, resulting in a crash when those are processed afterwards in infer_auto_device_map, since the list [i for i, (n, _) in enumerate(modules_to_treat) if n in tied_param] will be empty.
Is it possible that changing the check to (n + "." in tied_param) in tied_param would suffice to tackle this issue?
tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if (n + ".") in tied_param][0]
By doing so, it would be guaranteed that the parameter belongs to the module. I tested this solution with the code in the attached snippet, and the result is: tied_module_names = ['model.layers.1', 'model.layers.2', 'model.layers.3', 'model.layers.4', 'model.layers.5', 'model.layers.6', 'model.layers.7', 'model.layers.8', 'model.layers.9', 'model.layers.10', 'model.layers.11', 'model.layers.12', 'model.layers.13', 'model.layers.14'].
The text was updated successfully, but these errors were encountered:
Thanks for reporting this issue. Indeed, what you show looks pretty much like a bug and the solution you propose looks sound. Would you be interested in creating a PR to fix this, also including a unit test based on your example?
System Info
- `Accelerate` version: 1.2.0.dev0 - Platform: Linux-5.4.0-186-generic-x86_64-with-glibc2.31 - Python version: 3.12.0
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
The following snippet minimally reproduces the behaviour that I am observing in my model:
Where the output is:
tied_module_names = ['model.layers.1', 'model.layers.2', 'model.layers.3', 'model.layers.4', 'model.layers.5', 'model.layers.6', 'model.layers.7', 'model.layers.8', 'model.layers.9', 'model.layers.1', 'model.layers.1', 'model.layers.1', 'model.layers.1', 'model.layers.1']
and note that the last five elements of the list are all'model.layers.1'
.Expected behavior
When a model has children modules contained in a ModuleList, the names of these submodules have a number indicating their position in the container, e.g. "model.layers.0". However, the module to which a tied parameter belongs to is identified by checking that the module name is a prefix of that of the parameter (see https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/modeling.py#L1186)
However, if, for instance,
n="model.layers.1"
andtied_param="model.layers.11.weight"
, the check evaluates to True, even though the parameter does not belong to the given module. Consequently, thetied_modules
list returned byget_module_size_with_ties
will only contain the modules frommodel.layers.0
up tomodel.layers.9
, as illustrated in the reproduction code snippet. As a result of this, not all the modules with the tied parameter inmodules_to_treat
are placed into the appropiate device, resulting in a crash when those are processed afterwards ininfer_auto_device_map
, since the list[i for i, (n, _) in enumerate(modules_to_treat) if n in tied_param]
will be empty.Is it possible that changing the check to
(n + "." in tied_param) in tied_param
would suffice to tackle this issue?By doing so, it would be guaranteed that the parameter belongs to the module. I tested this solution with the code in the attached snippet, and the result is:
tied_module_names = ['model.layers.1', 'model.layers.2', 'model.layers.3', 'model.layers.4', 'model.layers.5', 'model.layers.6', 'model.layers.7', 'model.layers.8', 'model.layers.9', 'model.layers.10', 'model.layers.11', 'model.layers.12', 'model.layers.13', 'model.layers.14']
.The text was updated successfully, but these errors were encountered: