You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi All, I'm trying to do inference using galactica-6.7B model but errors have been popping up after inferencing few examples, and I'm not sure what to do. Can anyone look at them and tell?
following is the error
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[24], line 13
10 input_text = prompt
11 input_ids = transformers_tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
---> 13 outputs = transformers_model.generate(input_ids, max_new_tokens=128)
14 decoded_output = transformers_tokenizer.decode(outputs[0]).strip()
16 alpaca_finetuned_examples.append(decoded_output)
File ~/third/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/third/lib/python3.11/site-packages/transformers/generation/utils.py:1518, in GenerationMixin.generate(self, inputs, max_length, min_length, do_sample, early_stopping, num_beams, temperature, penalty_alpha, top_k, top_p, typical_p, repetition_penalty, bad_words_ids, force_words_ids, bos_token_id, pad_token_id, eos_token_id, length_penalty, no_repeat_ngram_size, encoder_no_repeat_ngram_size, num_return_sequences, max_time, max_new_tokens, decoder_start_token_id, use_cache, num_beam_groups, diversity_penalty, prefix_allowed_tokens_fn, logits_processor, renormalize_logits, stopping_criteria, constraints, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, forced_bos_token_id, forced_eos_token_id, remove_invalid_values, synced_gpus, exponential_decay_length_penalty, suppress_tokens, begin_suppress_tokens, forced_decoder_ids, **model_kwargs)
1513 raise ValueError(
1514 f"num_return_sequences has to be 1, but is {num_return_sequences} when doing greedy search."
1515 )
1517 # 10. run greedy search
-> 1518 return self.greedy_search(
1519 input_ids,
1520 logits_processor=logits_processor,
1521 stopping_criteria=stopping_criteria,
1522 pad_token_id=pad_token_id,
1523 eos_token_id=eos_token_id,
1524 output_scores=output_scores,
1525 return_dict_in_generate=return_dict_in_generate,
1526 synced_gpus=synced_gpus,
1527 **model_kwargs,
1528 )
1530 elif is_contrastive_search_gen_mode:
1532 if num_return_sequences > 1:
File ~/third/lib/python3.11/site-packages/transformers/generation/utils.py:2285, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, **model_kwargs)
2282 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2284 # forward pass to get next token
-> 2285 outputs = self(
2286 **model_inputs,
2287 return_dict=True,
2288 output_attentions=output_attentions,
2289 output_hidden_states=output_hidden_states,
2290 )
2292 if synced_gpus and this_peer_finished:
2293 continue # don't waste resources running the code we don't need
File ~/third/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/third/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py:934, in OPTForCausalLM.forward(self, input_ids, attention_mask, head_mask, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
931 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
933 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 934 outputs = self.model.decoder(
935 input_ids=input_ids,
936 attention_mask=attention_mask,
937 head_mask=head_mask,
938 past_key_values=past_key_values,
939 inputs_embeds=inputs_embeds,
940 use_cache=use_cache,
941 output_attentions=output_attentions,
942 output_hidden_states=output_hidden_states,
943 return_dict=return_dict,
944 )
946 logits = self.lm_head(outputs[0]).contiguous()
948 loss = None
File ~/third/lib/python3.11/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/third/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py:640, in OPTDecoder.forward(self, input_ids, attention_mask, head_mask, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
637 attention_mask = torch.ones(inputs_embeds.shape[:2], dtype=torch.bool, device=inputs_embeds.device)
638 pos_embeds = self.embed_positions(attention_mask, past_key_values_length)
--> 640 attention_mask = self._prepare_decoder_attention_mask(
641 attention_mask, input_shape, inputs_embeds, past_key_values_length
642 )
644 if self.project_in is not None:
645 inputs_embeds = self.project_in(inputs_embeds)
File ~/third/lib/python3.11/site-packages/transformers/models/opt/modeling_opt.py:539, in OPTDecoder._prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length)
535 combined_attention_mask = None
536 if input_shape[-1] > 1:
537 combined_attention_mask = _make_causal_mask(
538 input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length
--> 539 ).to(inputs_embeds.device)
541 if attention_mask is not None:
542 # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
543 expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
544 inputs_embeds.device
545 )
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
@mkardas I shortened the prompt this time, it was exceeding before. It ran for fewer more iterations and then stopped giving the same error:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Can there be problem with the version of pytorch, cuda or python?
Is it possible to tell the versions of dependencies used?
You can run python -m torch.utils.collect_env as well as pip list. What's the prompt's length in tokens now? By "ran for fewer more iterations" do you mean with the exact same prompt, or with appending the generations to the prompt?
Hi All, I'm trying to do inference using galactica-6.7B model but errors have been popping up after inferencing few examples, and I'm not sure what to do. Can anyone look at them and tell?
following is the error
and have been using following code:
The text was updated successfully, but these errors were encountered: