You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using four parallel V100, the memory used by the program keeps increasing. Resulting in the program being minimize and only running for about 3000 steps during the relaxation phase before terminating
DeePMD-kit Version
DeePMD-kit v3.0.1
Backend and its version
pytorch
Python Version, CUDA Version, GCC Version, LAMMPS Version, etc
python version 3.12
Details
ERROR on proc 3: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/transform_output.py", line 161, in forward_lower
vvi = split_vv1[_45]
svvi = split_svv1[_45]
_46 = _37(vvi, svvi, coord_ext, do_virial, do_atomic_virial, create_graph, )
~~~ <--- HERE
ffi, aviri, = _46
ffi0 = torch.unsqueeze(ffi, -2)
File "code/torch/deepmd/pt/model/model/transform_output.py", line 196, in task_deriv_one
faked_grad = torch.ones_like(energy)
lst = annotate(List[Optional[Tensor]], [faked_grad])
_53 = torch.autograd.grad([energy], [extended_coord], lst, True, create_graph)
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_force = _53[0]
if torch.isnot(extended_force, None):
Traceback of TorchScript, original code (most recent call last):
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/transform_output.py", line 128, in forward_lower
for vvi, svvi in zip(split_vv1, split_svv1):
# nf x nloc x 3, nf x nloc x 9
ffi, aviri = task_deriv_one(
~~~~~~~~~~~~~~ <--- HERE
vvi,
svvi,
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/transform_output.py", line 78, in task_deriv_one
faked_grad = torch.ones_like(energy)
lst = torch.jit.annotate(list[Optional[torch.Tensor]], [faked_grad])
extended_force = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <--- HERE
[energy],
[extended_coord],
RuntimeError: CUDA out of memory. Tried to allocate 4.36 GiB. GPU 3 has a total capacity of 31.74 GiB of which 2.55 GiB is free. Including non-PyTorch memory, this process has 29.18 GiB memory in use. Of the allocated memory 21.53 GiB is allocated by PyTorch, and 6.53 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1735001361510/work/source/lmp/pair_deepmd.cpp:220)
Last command: run 10000
The text was updated successfully, but these errors were encountered:
LAMMPS assigns atoms to regions according to atomic coordinates. During the simulation, the memory consumed on the corresponding GPU card will increase if more atoms move into a specific region.
Summary
When using four parallel V100, the memory used by the program keeps increasing. Resulting in the program being minimize and only running for about 3000 steps during the relaxation phase before terminating
DeePMD-kit Version
DeePMD-kit v3.0.1
Backend and its version
pytorch
Python Version, CUDA Version, GCC Version, LAMMPS Version, etc
python version 3.12
Details
ERROR on proc 3: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/transform_output.py", line 161, in forward_lower
vvi = split_vv1[_45]
svvi = split_svv1[_45]
_46 = _37(vvi, svvi, coord_ext, do_virial, do_atomic_virial, create_graph, )
~~~ <--- HERE
ffi, aviri, = _46
ffi0 = torch.unsqueeze(ffi, -2)
File "code/torch/deepmd/pt/model/model/transform_output.py", line 196, in task_deriv_one
faked_grad = torch.ones_like(energy)
lst = annotate(List[Optional[Tensor]], [faked_grad])
_53 = torch.autograd.grad([energy], [extended_coord], lst, True, create_graph)
~~~~~~~~~~~~~~~~~~~ <--- HERE
extended_force = _53[0]
if torch.isnot(extended_force, None):
Traceback of TorchScript, original code (most recent call last):
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/transform_output.py", line 128, in forward_lower
for vvi, svvi in zip(split_vv1, split_svv1):
# nf x nloc x 3, nf x nloc x 9
ffi, aviri = task_deriv_one(
~~~~~~~~~~~~~~ <--- HERE
vvi,
svvi,
File "/root/deepmd-kit/lib/python3.12/site-packages/deepmd/pt/model/model/transform_output.py", line 78, in task_deriv_one
faked_grad = torch.ones_like(energy)
lst = torch.jit.annotate(list[Optional[torch.Tensor]], [faked_grad])
extended_force = torch.autograd.grad(
~~~~~~~~~~~~~~~~~~~ <--- HERE
[energy],
[extended_coord],
RuntimeError: CUDA out of memory. Tried to allocate 4.36 GiB. GPU 3 has a total capacity of 31.74 GiB of which 2.55 GiB is free. Including non-PyTorch memory, this process has 29.18 GiB memory in use. Of the allocated memory 21.53 GiB is allocated by PyTorch, and 6.53 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) (/home/conda/feedstock_root/build_artifacts/deepmd-kit_1735001361510/work/source/lmp/pair_deepmd.cpp:220)
Last command: run 10000
The text was updated successfully, but these errors were encountered: