Outlines Transformers requires a ton more VRAM #1392

naston · 2025-01-23T21:33:47Z

Describe the issue as clearly as possible:

I am running outlines on an ~8B parameter model on an A10 GPU with 24GB of VRAM. When running the model myself using transformers the model uses just under 15GB of this memory. When using outlines the model requests over 27GB, causing execution to fail.

Steps/code to reproduce the bug:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import outlines
from outlines.models import Transformers


model_config = {
    "temperature": 0.1,
    "top_p": 0.5,
    "model_name": "Open-Orca/Mistral-7B-OpenOrca"
}
regex_str = "(?:\(relationship<\|>[^<|>]+<\|>[^<|>]+<\|>[^<|>]+<\|>1\)\n?)*<\|COMPLETE\|>"
device = "cuda" if torch.cuda.is_available() else "cpu"


text = """Bilbo Baggins celebrates his birthday and leaves the Ring to Frodo, his heir. 
Gandalf (a wizard) suspects it is a Ring of Power; seventeen years later, he confirms it was lost by the Dark Lord Sauron and counsels Frodo to take it away from the Shire. 
Gandalf leaves, promising to return, but fails to do so. Frodo sets out on foot with his cousin Pippin Took and gardener Sam Gamgee. 
They are pursued by Black Riders, but meet some Elves, whose singing to Elbereth wards off the Riders. 
The Hobbits take an evasive shortcut to Bucklebury Ferry, where they meet their friend Merry Brandybuck. 
Merry and Pippin reveal they know about the Ring and insist on joining Frodo on his journey. 
They try to shake off the Black Riders by cutting through the Old Forest. Merry and Pippin are trapped by the malign Old Man Willow, but are rescued by Tom Bombadil. 
Leaving Tom's house, they are caught by a barrow-wight. Frodo, awakening from the barrow-wight's spell, calls Tom Bombadil, who frees them and gives them ancient swords from the wight's hoard. 
The Hobbits reach the village of Bree, where they meet Strider, a Ranger. The innkeeper gives Frodo an old letter from Gandalf, which identifies Strider as a friend. 
Knowing the Black Riders will attempt to seize the Ring, Strider guides the group toward the Elvish sanctuary of Rivendell. 
At Weathertop, they are attacked by five Black Riders. Their leader wounds Frodo with a cursed blade. 
Strider fights them off and treats Frodo with the herb athelas. 
They are joined by the Elf Glorfindel, who rides with Frodo, now deathly ill, towards Rivendell. 
The Black Riders pursue Frodo into the Ford of Bruinen, where they are swept away by flood waters summoned by Elrond.

Frodo recovers in Rivendell under Elrond's care. 
Gandalf informs Frodo that the Black Riders are the Nazgûl, Men enslaved by Rings of Power to serve Sauron. 
The Council of Elrond discusses what to do with the Ring. 
Strider is revealed to be Aragorn, the heir of Isildur who had cut the Ring from Sauron's hand in the Second Age, but claimed it for himself. 
The Ring was lost when Isildur was killed; it passed to Gollum and then to Bilbo. 
Gandalf reports that the chief wizard, Saruman, is a traitor. 
The Council decides that the Ring must be destroyed in the fire of Mount Doom in Mordor, where it was forged. 
Frodo takes this task upon himself. Elrond chooses companions for him: Sam, Merry, and Pippin; Gandalf; the Men Aragorn and Boromir, son of the Steward of Gondor; the Elf Legolas; and the Dwarf Gimli, representing the Free Peoples of the West. 
After a failed attempt to cross the Misty Mountains, the Fellowship risk the path through the Mines of Moria. 
They learn that Balin and his Dwarves were killed by Orcs. They are attacked by Orcs and a Balrog, a fire demon. 
Gandalf confronts the Balrog: both fall into an abyss. The others escape to the Elvish forest of Lothlórien, where the Lady Galadriel tests their loyalty, and gives them magical gifts. 
She allows Frodo and Sam to look into her vision-giving fountain, the Mirror of Galadriel. 
Frodo offers her the Ring: she refuses, knowing that it would master her. 
Galadriel's husband Celeborn gives the Fellowship boats, cloaks, and waybread. 
They travel down the River Anduin. At Amon Hen, Boromir tries to take the Ring, but Frodo puts on the Ring and disappears. 
Frodo chooses to cross the river and go alone to Mordor, but Sam, guessing what he intends, intercepts him."""


class TestTransformer(Transformers):
    def _generate_output_seq(
        self, prompts, inputs, generation_config, **generation_kwargs
    ):
        generation_kwargs.pop('tokenizer',None)
        input_ids = inputs["input_ids"]
        output_ids = self.model.generate(
            **inputs, generation_config=generation_config, **generation_kwargs
        )

        # encoder-decoder returns output_ids only, decoder-only returns full seq ids
        if self.model.config.is_encoder_decoder:
            generated_ids = output_ids
        else:
            generated_ids = output_ids[:, input_ids.shape[1] :]

        # if batch list inputs AND multiple samples per input, convert generated_id to 3D view
        num_samples = generation_config.num_return_sequences or 1

        if num_samples > 1 and isinstance(prompts, list):
            batch_size = input_ids.size(0)
            num_return_sequences = generation_config.num_return_sequences or 1
            generated_ids = generated_ids.view(batch_size, num_return_sequences, -1)

        return generated_ids


@outlines.prompt
def test_prompt(
    text, 
    entity_types="ORGANIZATION, PERSON, PROCEDURE STEP, LOCATION, ",
    tuple_delimiter="<|>", 
    record_delimiter="##",
    completion_delimiter="<|COMPLETE|>"):
    """-Goal-
    Given a text document that is potentially relevant to this activity and a list of entity types extract all entities listed within the text.
    
    -Steps-
    1. From Entity_list, identify all entities that are *clearly described* in the text.
    For each entity, extract the following information:
    - entity_name: name of the entity, as given in the text
    - entity_description: explanation as to why you think the source entity and the target entity are related to each other, be as specific as possible
    Format each relationship as ("entity"{{ tuple_delimiter }}<entity_name>{{ tuple_delimiter }}<entity_description>)
    
    2. Return output in English as a single list of all the entities identified in step 1. Use **{{ record_delimiter }}** as the list delimiter.
    
    3. When finished, output {{ completion_delimiter }}
    
    ######################
    -Examples-
    ######################
    Entity_types: ORGANIZATION,PERSON
    Text:
    The Verdantis's Central Institution is scheduled to meet on Monday and Thursday, with the institution planning to release its latest policy decision on Thursday at 1:30 p.m. PDT, followed by a press conference where Central Institution Chair Martin Smith will take questions. Investors expect the Market Strategy Committee to hold its benchmark interest rate steady in a range of 3.5%-3.75%.
    ######################
    Output:
    ("entity"{{ tuple_delimiter }}CENTRAL INSTITUTION{{ tuple_delimiter }}ORGANIZATION{{ tuple_delimiter }}The Central Institution is the Federal Reserve of Verdantis, which is setting interest rates on Monday and Thursday)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}MARTIN SMITH{{ tuple_delimiter }}PERSON{{ tuple_delimiter }}Martin Smith is the chair of the Central Institution)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}MARKET STRATEGY COMMITTEE{{ tuple_delimiter }}ORGANIZATION{{ tuple_delimiter }}The Central Institution committee makes key decisions about interest rates and the growth of Verdantis's money supply)
    {{ record_delimiter }}

    ######################
    Entity_types: ORGANIZATION
    Text:
    TechGlobal's (TG) stock skyrocketed in its opening day on the Global Exchange Thursday. But IPO experts warn that the semiconductor corporation's debut on the public markets isn't indicative of how other newly listed companies may perform.

    TechGlobal, a formerly public company, was taken private by Vision Holdings in 2014. The well-established chip designer says it powers 85% of premium smartphones.
    ######################
    Output:
    ("entity"{{ tuple_delimiter }}TECHGLOBAL{{ tuple_delimiter }}ORGANIZATION{{ tuple_delimiter }}TechGlobal is a stock now listed on the Global Exchange which powers 85% of premium smartphones)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}VISION HOLDINGS{{ tuple_delimiter }}ORGANIZATION{{ tuple_delimiter }}Vision Holdings is a firm that previously owned TechGlobal)
    {{ record_delimiter }}

    ######################
    Entity_types: ORGANIZATION,GEO,PERSON
    Text:
    Five Aurelians jailed for 8 years in Firuzabad and widely regarded as hostages are on their way home to Aurelia.

    The swap orchestrated by Quintara was finalized when $8bn of Firuzi funds were transferred to financial institutions in Krohaara, the capital of Quintara.

    The exchange initiated in Firuzabad's capital, Tiruzia, led to the four men and one woman, who are also Firuzi nationals, boarding a chartered flight to Krohaara.

    They were welcomed by senior Aurelian officials and are now on their way to Aurelia's capital, Cashion.

    The Aurelians include 39-year-old businessman Samuel Namara, who has been held in Tiruzia's Alhamia Prison, as well as journalist Durke Bataglani, 59, and environmentalist Meggie Tazbah, 53, who also holds Bratinas nationality.
    ######################
    Output:
    ("entity"{{ tuple_delimiter }}FIRUZABAD{{ tuple_delimiter }}GEO{{ tuple_delimiter }}Firuzabad held Aurelians as hostages)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}AURELIA{{ tuple_delimiter }}GEO{{ tuple_delimiter }}Country seeking to release hostages)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}QUINTARA{{ tuple_delimiter }}GEO{{ tuple_delimiter }}Country that negotiated a swap of money in exchange for hostages)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}TIRUZIA{{ tuple_delimiter }}GEO{{ tuple_delimiter }}Capital of Firuzabad where the Aurelians were being held)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}KROHAARA{{ tuple_delimiter }}GEO{{ tuple_delimiter }}Capital city in Quintara)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}CASHION{{ tuple_delimiter }}GEO{{ tuple_delimiter }}Capital city in Aurelia)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}SAMUEL NAMARA{{ tuple_delimiter }}PERSON{{ tuple_delimiter }}Aurelian who spent time in Tiruzia's Alhamia Prison)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}ALHAMIA PRISON{{ tuple_delimiter }}GEO{{ tuple_delimiter }}Prison in Tiruzia)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}DURKE BATAGLANI{{ tuple_delimiter }}PERSON{{ tuple_delimiter }}Aurelian journalist who was held hostage)
    {{ record_delimiter }}
    ("entity"{{ tuple_delimiter }}MEGGIE TAZBAH{{ tuple_delimiter }}PERSON{{ tuple_delimiter }}Bratinas national and environmentalist who was held hostage)
    {{ record_delimiter }}

    ######################
    -Real Data-
    ######################
    Entity_types: {{ entity_types }}
    Text: {{ text }}
    ######################
    Output:"""


def outlines_main():
    prompt = test_prompt(text)

    tokenizer = AutoTokenizer.from_pretrained(model_config['model_name'])
    model = AutoModelForCausalLM.from_pretrained(
                model_config['model_name'], torch_dtype=torch.bfloat16
            ).to(device)
    
    structured_model = TestTransformer(model,tokenizer)
    sampler = outlines.samplers.multinomial(
        1_000, 
        temperature=model_config['temperature'], 
        top_p=model_config['top_p']
    )
    generator = outlines.generate.regex(structured_model, regex_str, sampler)
    response = generator(prompt)


def transformers_main():
    prompt = test_prompt(text)

    tokenizer = AutoTokenizer.from_pretrained(model_config['model_name'])
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(
                model_config['model_name'], torch_dtype=torch.bfloat16
            ).to(device)
    
    encoded = tokenizer(
        [prompt], 
        padding=True, 
        return_attention_mask=True,
        return_tensors='pt', 
    )
    input_ids = encoded['input_ids'].to(device)
    attention_mask = encoded['attention_mask'].to(device)

    outputs = model.generate(
            input_ids,
	        attention_mask=attention_mask,
            do_sample=True,
            max_new_tokens=1_000,
            top_p=model_config['top_p'],
            temperature=model_config['temperature'],
        )
    
    response = tokenizer.batch_decode(outputs)[0]

Expected result:

calling `transformers_main` will result in a successful generation. Calling `outlines_main` results in a cuda OOM error.

Error message:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 27.96 GiB. GPU

Outlines/Python version information:

python: '3.10'
transformers: '4.37.2'
outlines: '0.1.13'
torch: '2.3.1'

Context for the issue:

Being able to run the model is obviously a priority for me, but it is also non-intuitive why this would increase the VRAM requirements. There are ways I can solve this with quantization and the like, but I would rather find a more permanent solution than a work-around on my end.

The text was updated successfully, but these errors were encountered:

naston · 2025-01-23T21:42:57Z

Also want to add that with transformers_main the peak VRAM usage is still under 16GB. Let me know if any more information is needed to recreate.

I also needed to create the TestTransformer class as a way to bypass a hf transformers issue with the tokenizer being and invalid param for this specific transformer model.

naston · 2025-01-24T17:45:48Z

I have localized the issue to the sampler. When I use the greedy sampler (outlines.samplers.greedy()), I am able to fit the generation within the VRAM budget of the A10 GPU. This does not make much sense as greedy sampling is not required when I use the transformers library.

rlouf · 2025-01-24T18:17:54Z

Thanks for reporting the issue! FYI transformers uses greedy sampling by default: https://huggingface.co/docs/transformers/en/main_classes/text_generation

naston · 2025-01-24T19:17:27Z

Thanks for reporting the issue! FYI transformers uses greedy sampling by default: https://huggingface.co/docs/transformers/en/main_classes/text_generation

You are correct. From outlines documentation it seems as though the MultinomialSampler is equivalent to running HF generate with the parameters as shown above:

do_sample=True,
max_new_tokens=1_000,
top_p=0.5,
temperature=0.1,

If I am simply reading the documentation of one or both libraries wrong that would be nice to know, but I find it surprising that the MultinomialSampler requires a substantially larger VRAM budget for the same sequence as there is no beam.

naston added the bug label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outlines Transformers requires a ton more VRAM #1392

Outlines Transformers requires a ton more VRAM #1392

naston commented Jan 23, 2025 •

edited

Loading

naston commented Jan 23, 2025 •

edited

Loading

naston commented Jan 24, 2025

rlouf commented Jan 24, 2025

naston commented Jan 24, 2025

Outlines Transformers requires a ton more VRAM #1392

Outlines Transformers requires a ton more VRAM #1392

Comments

naston commented Jan 23, 2025 • edited Loading

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

naston commented Jan 23, 2025 • edited Loading

naston commented Jan 24, 2025

rlouf commented Jan 24, 2025

naston commented Jan 24, 2025

naston commented Jan 23, 2025 •

edited

Loading

naston commented Jan 23, 2025 •

edited

Loading