Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why I cannot save model? #97

Open
txye opened this issue Nov 14, 2023 · 4 comments
Open

Why I cannot save model? #97

txye opened this issue Nov 14, 2023 · 4 comments

Comments

@txye
Copy link

txye commented Nov 14, 2023

raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

@nprasanthi7
Copy link

RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.encoder.embed_tokens.weight', '0.auto_model.shared.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

       Could you please help me to resolve this?

@hongjin-su
Copy link
Collaborator

Hi, Thanks a lot for your interest in the INSTRUCTOR model!

Could you provide a short script for me to reproduce the error?

@tush05tgsingh
Copy link

I am getting the same error! I don't know how to solve this @hongjin-su I hope you would help me in this:

Traceback (most recent call last):
File "/ClusterLLM/perspective/2_finetune/finetune.py", line 617, in
main()
File "ClusterLLM/perspective/2_finetune/finetune.py", line 598, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2029, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2423, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 2499, in _save_checkpoint
self.save_model(staging_output_dir, _internal_call=True)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 3016, in save_model
self._save(output_dir)
File ".conda/envs/696ds/lib/python3.9/site-packages/transformers/trainer.py", line 3083, in _save
safetensors.torch.save_file(
File ".conda/envs/696ds/lib/python3.9/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File ".conda/envs/696ds/lib/python3.9/site-packages/safetensors/torch.py", line 477, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

@deejayosamu
Copy link

I have same issue
'''
File "/home/gaya/group2/instructor-embedding-1/train.py", line 586, in main
trainer.train(resume_from_checkpoint=checkpoint)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3007, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3097, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3730, in save_model
self._save(output_dir)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3828, in _save
safetensors.torch.save_file(
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/safetensors/torch.py", line 286, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/safetensors/torch.py", line 488, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [01:25<00:00, 1.26it/s]
(group2-re) (base) gaya@aicoss-PowerEdge-T640:~/group2/instructor-embedding-1$ CUDA_VISIBLE_DEVICES=0 python3 train.py --model_name_or_path sentence-transformers/gtr-t5-base --output_dir output --cache_dir cache --max_source_length 512 --num_train_epochs 1 --save_steps 500 --cl_temperature 0.1 --warmup_ratio 0.1 --learning_rate 2e-5 --overwrite_output_dir
one batch in task 303 is skipped
one batch in task 304 is skipped
one batch in task 305 is skipped
one batch in task 306 is skipped
one batch in task 307 is skipped
one batch in task 309 is skipped
one batch in task 310 is skipped
one batch in task 311 is skipped
one batch in task 312 is skipped
one batch in task 313 is skipped
one batch in task 314 is skipped
one batch in task 315 is skipped
one batch in task 316 is skipped
one batch in task 317 is skipped
one batch in task 318 is skipped
one batch in task 319 is skipped
one batch in task 320 is skipped
one batch in task 322 is skipped
one batch in task 323 is skipped
one batch in task 324 is skipped
one batch in task 325 is skipped
one batch in task 326 is skipped
one batch in task 328 is skipped
one batch in task 329 is skipped
one batch in task 1 is skipped
There are 856 pairs to train in total.
real_name_or_path:sentence-transformers/gtr-t5-base, model_args.cache_dir: cache
Using Hugging Face!!
Running tokenizer on train dataset: 100%|████████████████████████████████████████████████████████████████| 856/856 [00:02<00:00, 413.62 examples/s]
/home/gaya/group2/instructor-embedding-1/train.py:571: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for InstructorTrainer.__init__. Use processing_class instead.
trainer = InstructorTrainer(
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [01:18<00:00, 1.43it/s]Traceback (most recent call last):
File "/home/gaya/group2/instructor-embedding-1/train.py", line 602, in
main()
File "/home/gaya/group2/instructor-embedding-1/train.py", line 586, in main
trainer.train(resume_from_checkpoint=checkpoint)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3007, in _maybe_log_save_evaluate
self._save_checkpoint(model, trial, metrics=metrics)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3097, in _save_checkpoint
self.save_model(output_dir, _internal_call=True)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3730, in save_model
self._save(output_dir)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/transformers/trainer.py", line 3828, in _save
safetensors.torch.save_file(
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/safetensors/torch.py", line 286, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/home/gaya/anaconda3/envs/group2-re/lib/python3.10/site-packages/safetensors/torch.py", line 488, in _flatten
raise RuntimeError(
RuntimeError:
Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'0.auto_model.shared.weight', '0.auto_model.encoder.embed_tokens.weight'}].
A potential way to correctly save your model is to use save_model.
More information at https://huggingface.co/docs/safetensors/torch_shared_tensors
'''

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants