Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redo Bug fix for annotation/inference endpoint fails with long sequences as input #762

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

HareeshBahuleyan
Copy link
Contributor

@HareeshBahuleyan HareeshBahuleyan commented Jan 28, 2025

What's changing

When testing with Thunderbird dataset, @agpituk found that BART model fails for long sequences with the error:

IndexError: index out of range in self

As pointed out by @dpoulopoulos , the error comes from a limitation on the maximum number of positional embeddings that the model can have (token embeddings will always be within range).

  • This could be fixed by setting the allowed model_max_length (the number of tokens that the model sees) for the respective model, e.g. for BART this is 1024
    Edit: Instead of hard coding this in config_templates, we search for plausible params that are most like to correspond to model_max_length if it is set to the default value of VERY_LARGE_INTEGER (int(1e30)).

  • Additionally, one would also need to set the truncation = True in the HF pipeline to truncate the sequence at this max_length.

Closes #667

Additional notes for reviewers

See Colab Notebook for a demo with these two changes that fixes the issue for long text input.

Also removing max_length from the configs (this corresponds to the #tokens in the generated output) - this was being hard-coded in default_infer_template and bart_infer_template. After removal, we would use the HF defaults provided in the respective model configs.

How to test it

Steps to test the changes:

  1. Upload mock_long_sequences_no_gt.csv (synthetically generated data with no GT) as the dataset
  2. Run annotate endpoint for this dataset (uses BART by default)
{
  "name": "default annotation with BART",
  "dataset": "<dataset_id>",
  "max_samples": -1,
  "task": "summarization",
  "store_to_dataset": true
}

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
  • Updated the documentation (both comments in code and product documentation under /docs)
  • Checked if a (backend) DB migration step was required and included it if required

@HareeshBahuleyan HareeshBahuleyan marked this pull request as ready for review January 30, 2025 09:14
@@ -99,7 +99,7 @@
"use_fast": "{use_fast}",
"trust_remote_code": "{trust_remote_code}",
"torch_dtype": "{torch_dtype}",
"max_length": 500
"max_new_tokens": 500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have this option in HfPipelineConfig, I would change it to:

Suggested change
"max_new_tokens": 500
"max_new_tokens": "{max_new_tokens}"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with that but it throws an error since no default value is provided for max_new_tokens in inference_config (unline use_fast, trust_remote_code, etc. which have default values)

"use_fast": "{use_fast}",
"trust_remote_code": "{trust_remote_code}",
"torch_dtype": "{torch_dtype}",
"max_new_tokens": 142
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here:

Suggested change
"max_new_tokens": 142
"max_new_tokens": "{max_new_tokens}"

Checks various possible max_length parameters which varies on model architecture.
"""
config = self._pipeline.model.config
logger.info(f"Initial model_max_length {self._pipeline.tokenizer.model_max_length}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, don't say "initial ...". It kind of implies that you will definitely change it. You can also omit this log in favor of the other one, beneath.

Suggested change
logger.info(f"Initial model_max_length {self._pipeline.tokenizer.model_max_length}")
logger.info(f"The maximum number of input tokens is set to {self._pipeline.tokenizer.model_max_length}")

Copy link
Contributor Author

@HareeshBahuleyan HareeshBahuleyan Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this message, re-worded it.
Removing the one below (both messages will be providing the same info and the same value for the case self._pipeline.tokenizer.model_max_length != VERY_LARGE_INTEGER

logger.info(f"Initial model_max_length {self._pipeline.tokenizer.model_max_length}")
# If suitable model_max_length is already available, don't override it
if self._pipeline.tokenizer.model_max_length != VERY_LARGE_INTEGER:
logger.info(f"Using model_max_length = {self._pipeline.tokenizer.model_max_length} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it a bit more human readable:

Suggested change
logger.info(f"Using model_max_length = {self._pipeline.tokenizer.model_max_length} \
logger.info(f"Setting the maximum length of input tokens to {self._pipeline.tokenizer.model_max_length} \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this logging statement in favor of the above.

isinstance(value, int) and value < VERY_LARGE_INTEGER
): # Sanity check for reasonable values
self._pipeline.tokenizer.model_max_length = value
logger.info(f"Setting model_max_length to {value} based on config.{param}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info(f"Setting model_max_length to {value} based on config.{param}")
logger.info(f"Setting the maximum length of input tokens to {value} based on the config.{param} attribute.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested

return

# If no suitable parameter is found, warn the user and continue with the HF default
logger.warning(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inform the user which is the default value, if you have it available. I agree with you that this should be a warning, however, currently warnings are not included in job logs. I would leave it as a warning, though, and fix this later. Could you create an issue for that?

Copy link
Contributor Author

@HareeshBahuleyan HareeshBahuleyan Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, will add a Git issue for that. Added the default value being used in the log message.
Edit: #785

@HareeshBahuleyan
Copy link
Contributor Author

Hey @dpoulopoulos, I've made the necessary changes and added my responses. Please check and let me know if we are good to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Annotation with large texts fails
3 participants