Why I cannot save model? #97

txye · 2023-11-14T15:54:43Z

raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

The text was updated successfully, but these errors were encountered:

nprasanthi7 · 2023-11-28T12:07:32Z

RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.encoder.embed_tokens.weight', '0.auto_model.shared.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

       Could you please help me to resolve this?

hongjin-su · 2023-12-19T09:21:35Z

Hi, Thanks a lot for your interest in the INSTRUCTOR model!

Could you provide a short script for me to reproduce the error?

tush05tgsingh · 2024-03-18T18:21:38Z

I am getting the same error! I don't know how to solve this @hongjin-su I hope you would help me in this:

Traceback (most recent call last):
File "/ClusterLLM/perspective/2_finetune/finetune.py", line 617, in
main()
File "ClusterLLM/perspective/2_finetune/finetune.py", line 598, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2499, in _save_checkpoint
self.save_model(staging_output_dir, _internal_call=True)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 3016, in save_model
self._save(output_dir)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 3083, in _save
safetensors.torch.save_file(
File ".conda/envs/696ds/lib/python3.9/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File ".conda/envs/696ds/lib/python3.9/site-packages/safetensors/torch.py", line 477, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

deejayosamu · 2024-12-06T09:42:27Z

I have same issue
'''
File "/home/gaya/group2/instructor-embedding-1/train.py", line 586, in main
trainer.train(resume_from_checkpoint=checkpoint)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3007, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3097, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3730, in save_model
self._save(output_dir)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3828, in _save
safetensors.torch.save_file(
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/safetensors/torch.py", line 286, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/safetensors/torch.py", line 488, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [01:25<00:00, 1.26it/s]
(group2-re) (base) gaya@aicoss-PowerEdge-T640:~/group2/instructor-embedding-1$ CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name_or_path sentence-transformers/gtr-t5-base --output_dir output --cache_dir cache --max_source_length 512 --num_train_epochs 1 --save_steps 500 --cl_temperature 0.1 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir
one batch in task 303 is skipped
one batch in task 304 is skipped
one batch in task 305 is skipped
one batch in task 306 is skipped
one batch in task 307 is skipped
one batch in task 309 is skipped
one batch in task 310 is skipped
one batch in task 311 is skipped
one batch in task 312 is skipped
one batch in task 313 is skipped
one batch in task 314 is skipped
one batch in task 315 is skipped
one batch in task 316 is skipped
one batch in task 317 is skipped
one batch in task 318 is skipped
one batch in task 319 is skipped
one batch in task 320 is skipped
one batch in task 322 is skipped
one batch in task 323 is skipped
one batch in task 324 is skipped
one batch in task 325 is skipped
one batch in task 326 is skipped
one batch in task 328 is skipped
one batch in task 329 is skipped
one batch in task 1 is skipped
There are 856 pairs to train in total.
real_name_or_path:sentence-transformers/gtr-t5-base, model_args.cache_dir: cache
Using Hugging Face!!
Running tokenizer on train dataset: 100%|████████████████████████████████████████████████████████████████| 856/856 [00:02<00:00, 413.62 examples/s]
/home/gaya/group2/instructor-embedding-1/train.py:571: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for InstructorTrainer.__init__. Use processing_class instead.
trainer = InstructorTrainer(
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [01:18<00:00, 1.43it/s]Traceback (most recent call last):
File "/home/gaya/group2/instructor-embedding-1/train.py", line 602, in
main()
File "/home/gaya/group2/instructor-embedding-1/train.py", line 586, in main
trainer.train(resume_from_checkpoint=checkpoint)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3007, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3097, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3730, in save_model
self._save(output_dir)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3828, in _save
safetensors.torch.save_file(
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/safetensors/torch.py", line 286, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/safetensors/torch.py", line 488, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
'''

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why I cannot save model? #97

Why I cannot save model? #97

txye commented Nov 14, 2023

nprasanthi7 commented Nov 28, 2023

hongjin-su commented Dec 19, 2023

tush05tgsingh commented Mar 18, 2024

deejayosamu commented Dec 6, 2024

Why I cannot save model? #97

Why I cannot save model? #97

Comments

txye commented Nov 14, 2023

nprasanthi7 commented Nov 28, 2023

hongjin-su commented Dec 19, 2023

tush05tgsingh commented Mar 18, 2024

deejayosamu commented Dec 6, 2024