Redo Bug fix for annotation/inference endpoint fails with long sequences as input #762

HareeshBahuleyan · 2025-01-28T16:09:05Z

What's changing

When testing with Thunderbird dataset, @agpituk found that BART model fails for long sequences with the error:

IndexError: index out of range in self

As pointed out by @dpoulopoulos , the error comes from a limitation on the maximum number of positional embeddings that the model can have (token embeddings will always be within range).

This could be fixed by setting the allowed model_max_length (the number of tokens that the model sees) for the respective model, e.g. for BART this is 1024
Edit: Instead of hard coding this in config_templates, we search for plausible params that are most like to correspond to model_max_length if it is set to the default value of VERY_LARGE_INTEGER (int(1e30)).
Additionally, one would also need to set the truncation = True in the HF pipeline to truncate the sequence at this max_length.

Closes #667

Additional notes for reviewers

See Colab Notebook for a demo with these two changes that fixes the issue for long text input.

Also removing max_length from the configs (this corresponds to the #tokens in the generated output) - this was being hard-coded in default_infer_template and bart_infer_template. After removal, we would use the HF defaults provided in the respective model configs.

How to test it

Steps to test the changes:

Upload mock_long_sequences_no_gt.csv (synthetically generated data with no GT) as the dataset
Run annotate endpoint for this dataset (uses BART by default)

{
  "name": "default annotation with BART",
  "dataset": "<dataset_id>",
  "max_samples": -1,
  "task": "summarization",
  "store_to_dataset": true
}

I already...

Tested the changes in a working environment to ensure they work as expected
Added some tests for any new functionality
Updated the documentation (both comments in code and product documentation under /docs)
Checked if a (backend) DB migration step was required and included it if required

…anch

…ssue-667-annotation-with-large-texts-fails-redo

…t by the modelclient based om HF config plausible params

…ion tests

…redo

dpoulopoulos · 2025-01-31T14:10:22Z

lumigator/backend/backend/config_templates.py

@@ -99,7 +99,7 @@
        "use_fast": "{use_fast}",
        "trust_remote_code": "{trust_remote_code}",
        "torch_dtype": "{torch_dtype}",
-        "max_length": 500
+        "max_new_tokens": 500


Since we have this option in HfPipelineConfig, I would change it to:

Suggested change

"max_new_tokens": 500

"max_new_tokens": "{max_new_tokens}"

I tried with that but it throws an error since no default value is provided for max_new_tokens in inference_config (unline use_fast, trust_remote_code, etc. which have default values)

dpoulopoulos · 2025-01-31T14:10:56Z

lumigator/backend/backend/config_templates.py

+        "use_fast": "{use_fast}",
+        "trust_remote_code": "{trust_remote_code}",
+        "torch_dtype": "{torch_dtype}",
+        "max_new_tokens": 142


Same here:

Suggested change

"max_new_tokens": 142

"max_new_tokens": "{max_new_tokens}"

dpoulopoulos · 2025-01-31T14:15:30Z

lumigator/jobs/inference/model_clients.py

+        Checks various possible max_length parameters which varies on model architecture.
+        """
+        config = self._pipeline.model.config
+        logger.info(f"Initial model_max_length {self._pipeline.tokenizer.model_max_length}")


In my opinion, don't say "initial ...". It kind of implies that you will definitely change it. You can also omit this log in favor of the other one, beneath.

Suggested change

logger.info(f"Initial model_max_length {self._pipeline.tokenizer.model_max_length}")

logger.info(f"The maximum number of input tokens is set to {self._pipeline.tokenizer.model_max_length}")

Keeping this message, re-worded it.
Removing the one below (both messages will be providing the same info and the same value for the case self._pipeline.tokenizer.model_max_length != VERY_LARGE_INTEGER

dpoulopoulos · 2025-01-31T14:16:34Z

lumigator/jobs/inference/model_clients.py

+        logger.info(f"Initial model_max_length {self._pipeline.tokenizer.model_max_length}")
+        # If suitable model_max_length is already available, don't override it
+        if self._pipeline.tokenizer.model_max_length != VERY_LARGE_INTEGER:
+            logger.info(f"Using model_max_length = {self._pipeline.tokenizer.model_max_length} \


Make it a bit more human readable:

Suggested change

logger.info(f"Using model_max_length = {self._pipeline.tokenizer.model_max_length} \

logger.info(f"Setting the maximum length of input tokens to {self._pipeline.tokenizer.model_max_length} \

Removed this logging statement in favor of the above.

dpoulopoulos · 2025-01-31T14:19:55Z

lumigator/jobs/inference/model_clients.py

+                    isinstance(value, int) and value < VERY_LARGE_INTEGER
+                ):  # Sanity check for reasonable values
+                    self._pipeline.tokenizer.model_max_length = value
+                    logger.info(f"Setting model_max_length to {value} based on config.{param}")


Suggested change

logger.info(f"Setting model_max_length to {value} based on config.{param}")

logger.info(f"Setting the maximum length of input tokens to {value} based on the config.{param} attribute.")

Updated as suggested

dpoulopoulos · 2025-01-31T14:22:17Z

lumigator/jobs/inference/model_clients.py

+                    return
+
+        # If no suitable parameter is found, warn the user and continue with the HF default
+        logger.warning(


Inform the user which is the default value, if you have it available. I agree with you that this should be a warning, however, currently warnings are not included in job logs. I would leave it as a warning, though, and fix this later. Could you create an issue for that?

I see, will add a Git issue for that. Added the default value being used in the log message.
Edit: #785

…redo

HareeshBahuleyan · 2025-02-03T14:14:46Z

Hey @dpoulopoulos, I've made the necessary changes and added my responses. Please check and let me know if we are good to go.

Re-create changes from issue-667-annotation-with-large-texts-fails br…

c358399

…anch

github-actions bot added sdk backend labels Jan 28, 2025

HareeshBahuleyan self-assigned this Jan 28, 2025

HareeshBahuleyan added 8 commits January 29, 2025 15:39

Merge branch 'main' of https://github.com/mozilla-ai/lumigator into i…

ec0555d

…ssue-667-annotation-with-large-texts-fails-redo

Mock dataset generated by LLM

ef4e04e

Setting tokenizer max length based on plausible config fields

53aed13

Remove tokenizer from config template

d7c13be

Merge branch 'main' of https://github.com/mozilla-ai/lumigator into i…

bb381cd

…ssue-667-annotation-with-large-texts-fails-redo

Move logging, rename variables

2589891

Remove max_length parameter, use from default HF config

56a3755

AutoTokenizerConfig not needed since tokenizer model_max_length is se…

b25c7d2

…t by the modelclient based om HF config plausible params

HareeshBahuleyan requested review from dpoulopoulos and aittalam January 30, 2025 09:13

HareeshBahuleyan marked this pull request as ready for review January 30, 2025 09:14

HareeshBahuleyan and others added 5 commits January 30, 2025 10:41

Rewrite the if-statement

260715e

Extra checking on return

63e73dd

Merge with main after reorganize folders PR-717

e89c168

max_new_tokens instead of max_length - hf deprecation, fixes integrat…

a7d5c39

…ion tests

Merge branch 'main' into issue-667-annotation-with-large-texts-fails-…

812019a

…redo

dpoulopoulos reviewed Jan 31, 2025

View reviewed changes

Update logging messages based on review comments

f3c60ac

HareeshBahuleyan mentioned this pull request Jan 31, 2025

Warnings log messages are not available for job #785

Open

Merge branch 'main' into issue-667-annotation-with-large-texts-fails-…

b257c9d

…redo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redo Bug fix for annotation/inference endpoint fails with long sequences as input #762

Redo Bug fix for annotation/inference endpoint fails with long sequences as input #762

HareeshBahuleyan commented Jan 28, 2025 •

edited

Loading

dpoulopoulos Jan 31, 2025

HareeshBahuleyan Jan 31, 2025

dpoulopoulos Jan 31, 2025

dpoulopoulos Jan 31, 2025

HareeshBahuleyan Jan 31, 2025 •

edited

Loading

dpoulopoulos Jan 31, 2025

HareeshBahuleyan Jan 31, 2025

dpoulopoulos Jan 31, 2025

HareeshBahuleyan Jan 31, 2025

dpoulopoulos Jan 31, 2025

HareeshBahuleyan Jan 31, 2025 •

edited

Loading

HareeshBahuleyan commented Feb 3, 2025

	logger.info(f"Initial model_max_length {self._pipeline.tokenizer.model_max_length}")
	logger.info(f"The maximum number of input tokens is set to {self._pipeline.tokenizer.model_max_length}")

	logger.info(f"Using model_max_length = {self._pipeline.tokenizer.model_max_length} \
	logger.info(f"Setting the maximum length of input tokens to {self._pipeline.tokenizer.model_max_length} \

	logger.info(f"Setting model_max_length to {value} based on config.{param}")
	logger.info(f"Setting the maximum length of input tokens to {value} based on the config.{param} attribute.")

Redo Bug fix for annotation/inference endpoint fails with long sequences as input #762

Are you sure you want to change the base?

Redo Bug fix for annotation/inference endpoint fails with long sequences as input #762

Conversation

HareeshBahuleyan commented Jan 28, 2025 • edited Loading

What's changing

Additional notes for reviewers

How to test it

I already...

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HareeshBahuleyan Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HareeshBahuleyan Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

HareeshBahuleyan commented Feb 3, 2025

HareeshBahuleyan commented Jan 28, 2025 •

edited

Loading

HareeshBahuleyan Jan 31, 2025 •

edited

Loading

HareeshBahuleyan Jan 31, 2025 •

edited

Loading