-
Notifications
You must be signed in to change notification settings - Fork 128
components llm_rag_generate_embeddings
github-actions[bot] edited this page Dec 31, 2024
·
69 revisions
Generates embeddings vectors for data chunks read from chunks_source
.
chunks_source
is expected to contain csv
files containing two columns:
- "Chunk" - Chunk of text to be embedded
- "Metadata" - JSON object containing metadata for the chunk
If embeddings_container
is supplied, input chunks are compared to existing chunks in the Embeddings Container and only changed/new chunks are embedded, existing chunks being reused.
Version: 0.0.71
Preview
View in Studio: https://ml.azure.com/registries/azureml/components/llm_rag_generate_embeddings/version/0.0.71
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
chunks_source | Folder containing chunks to be embedded. | uri_folder |
If adding to previously generated Embeddings
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
embeddings_container | Folder containing previously generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks. | uri_folder | True |
Embeddings settings
Name | Description | Type | Default | Optional | Enum |
---|---|---|---|---|---|
embeddings_model | The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}' | string | True | ||
batch_size | Batch size to use when embedding data | integer | 100 | ||
num_workers | Number of workers to use when embedding data. -1 means use half all available CPUs | integer | -1 | ||
deployment_validation | Uri file containing information on if the Azure OpenAI deployments, if used, have been validated | uri_file | True |
Name | Description | Type |
---|---|---|
embeddings | Where to save data with embeddings. This should be a subfolder of previous embeddings if supplied, typically named using '${name}'. e.g. /my/prev/embeddings/${name} | uri_folder |
azureml:llm-rag-embeddings@latest