Skip to content

components text_generation_datapreprocess

github-actions[bot] edited this page Jan 22, 2025 · 51 revisions

Text Generation DataPreProcess

text_generation_datapreprocess

Overview

Component to preprocess data for text generation task

Version: 0.0.67

View in Studio: https://ml.azure.com/registries/azureml/components/text_generation_datapreprocess/version/0.0.67

Inputs

Text Generation task arguments

Name Description Type Default Optional Enum
text_key key for text in an example. format your data keeping in mind that text is concatenated with ground_truth while finetuning in the form - text + groundtruth. for eg. "text"="knock knock\n", "ground_truth"="who's there"; will be treated as "knock knock\nwho's there" string False
ground_truth_key key for ground_truth in an example. we take separate column for ground_truth to enable use cases like summarization, translation, question_answering, etc. which can be repurposed in form of text-generation where both text and ground_truth are needed. This separation is useful for calculating metrics. for eg. "text"="Summarize this dialog:\n{input_dialogue}\nSummary:\n", "ground_truth"="{summary of the dialogue}" string True
batch_size Number of examples to batch before calling the tokenization function integer 1000 True

Tokenization params

Name Description Type Default Optional Enum
pad_to_max_length If set to True, the returned sequences will be padded according to the model's padding side and padding index, up to their max_seq_length. If no max_seq_length is specified, the padding is done up to the model's max length. string false True ['true', 'false']
max_seq_length Default is -1 which means the padding is done up to the model's max length. Else will be padded to max_seq_length. integer -1 True

Inputs

Name Description Type Default Optional Enum
train_file_path Path to the registered training data asset. The supported data formats are jsonl, json, csv, tsv and parquet. uri_file True
validation_file_path Path to the registered validation data asset. The supported data formats are jsonl, json, csv, tsv and parquet. uri_file True
test_file_path Path to the registered test data asset. The supported data formats are jsonl, json, csv, tsv and parquet. uri_file True
train_mltable_path Path to the registered training data asset in mltable format. mltable True
validation_mltable_path Path to the registered validation data asset in mltable format. mltable True
test_mltable_path Path to the registered test data asset in mltable format. mltable True

Dataset parameters

Name Description Type Default Optional Enum
model_selector_output output folder of model selector containing model metadata like config, checkpoints, tokenizer config uri_folder False

Validation parameters

Name Description Type Default Optional Enum
system_properties Validation parameters propagated from pipeline. string True

Outputs

Name Description Type
output_dir The folder contains the tokenized output of the train, validation and test data along with the tokenizer files used to tokenize the data uri_folder

Environment

azureml://registries/azureml/environments/acft-hf-nlp-gpu/versions/80

Clone this wiki locally