docs: [RFC] RAG document preprocessor endpoint #1290

ilya-kolchinsky · 2025-02-27T12:30:52Z

What does this PR do?

Introduces an RFC for preprocessing functionality in RAG and beyond for the community review.

(Closes #1232)

Test Plan

Docs change only, no test were run.

jwm4 · 2025-02-27T16:18:50Z

rfcs/RFC-0002-preprocessing-endpoint.md

+## Endpoints
+### Preprocessing endpoint
+URL: `POST /v1/preprocess`
+Request Body: 


I am not sure it is totally clear from this text exactly what this request returns, but I get the impression that this is a synchronous call that returns all the preprocessor outputs in one big JSON object. That seems OK for small scale but problematic for medium or larger scale. For medium scale, I think it would be better to have this be an asynchronous call that returns a stream of preprocessor outputs to the client. For very large scale, I think we would want to avoid returning preprocessor outputs to the client at all -- what I'd prefer in that case is for the preprocess command to return a stream ID that I can then pass along to the next stage of processing (e.g., vector DB ingestion) -- that next stage would then use this stream ID to pull the preprocessor outputs and use them.

If we want to support all three scales (small, medium, and large) then probably the best way to do that is to start with a design for large scale and then add convenience methods for smaller scales if needed. I think that would look like this:

Update preprocess to return a stream ID.

Update vector DB ingestion to optionally take a stream ID for a stream of documents as input instead of the actual document objects (and then it would use that stream ID to fetch the documents).

(Maybe) add one or more endpoints for fetching preprocessor outputs given a stream ID for those cases where the clients really do want to get all the preprocessor outputs. Such endpoints could include the small-scale case, where are you have a blocking/synchronous API call that fetches the entire stream as one big JSON blog and/or the medium-scale case where you have a streaming API or maybe just an API that you call a lot of times to fetch pre-processor outputs one at a time.

Also, we should probably plan for the case where the Llama Stack server that manages the preprocessing is a different Llama Stack server that manages the next step (e.g., vector DB ingestion or synthetic data generation). One way to do that would be to make the stream ID a call-back URL so that whatever gets that stream ID knows how to contact the Llama Stack server producing the stream.

@jwm4 I agree that the output of the preprocessing endpoint should be better defined.

That being said, I believe that the issue you're raising is bigger than the preprocessing endpoint proposed here and is also related to large-scale inference in LLS / to any endpoint that can potentially grow to huge bandwidth. Should Llama Stack really stream and route all the outputs in the system? Shouldn't it be more like the Kubernetes control plane, managing and orchestrating the workflows and not directly operating on the I/O? Wouldn't it solve both the scalability issue and the problem of multiple Llama Stack servers that you mention in your second comment?

As of now, the design of Llama Stack makes it challenging to directly stream data in large-scale deployment scenarios. Whether the community believes it should directly support massive data transfer (in which case a major redesign is required) or not (in which case we have to agree on an alternative way to proceed), this warrants a separate conversation. Would you like me to open a discussion thread on this?

Back to the preprocessing endpoint, my assumption for now is that for inline providers (i.e., small scale) the outputs will be returned directly, but for remote providers (i.e., medium and beyond) only the status will be returned, and an external preprocessing tool will take care of storing the results somewhere according the input parameters and/or configuration settings. The current preprocessing API specification includes a parameter named 'options' intended, among other uses, to specify where and how to store the output.

yanxi0830 · 2025-02-28T22:25:02Z

rfcs/RFC-0002-preprocessing-endpoint.md

+```
+remote_dirs = [ ... ]
+document_dir_paths = [
+	DocumentDirPath(document_path_id=f"path_{i}",


1/ What is a DocumentDirPath?

2/ How do you imagine the endpoint work together with /files API (https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/files/files.py) for uploading arbitrary files.

The name is probably not the best. DocumentDirPath is an abstraction for an identifier for a collection of files. It could be for example a URI of a local/remote folder to be traversed; a path containing a regex covering multiple files; a link to a directory service to get the list of actual files from, and so on. The motivation is that for large-scale scenarios listing the files individually does not work.

This is a great question - continuing the previous answer, the implementation should also support passing links to /files/{bucket} to the preprocessing endpoint (via DocumentDirPath).

yanxi0830 · 2025-02-28T22:46:07Z

rfcs/RFC-0002-preprocessing-endpoint.md

+    documents=documents,
+    vector_db_id="my_vector_id",
+    preprocessing_chain=[
+	Preprocessor(id="inline_httpx_fetcher"),


How do you want to generalize and standardize input & output types of each processor in the chain?

I imagine here:

inline_httpx_fetcher would take an URL --> file_id

inline_pypdf_converter would take a pdf file -> convert it into raw str text

inline_overlapping_chunks_chunker should take the raw str text -> list of str text

How can we guide the user to build the chain such that it doesn't connect the wrong processors?

First, I don't think this should be "coded" in the processor names. Otherwise we will end up categorizing all implementations into fetchers, converters and chunkers, which is exactly what the proposal tries to avoid. Having a hard division into distinct processor types is not ideal since new types may be introduced in the future and, more importantly, existing tools may support some combination / subset of the above.
Instead, each processor implementation can define SUPPORTED_INPUT_TYPES and SUPPORTED_OUTPUT_TYPES consts as static fields.
In this example, e.g., inline_pypdf_converter could be defined like this:

class InlinePyPdfConverterImpl(Preprocessor): ... SUPPORTED_INPUT_TYPES = ['pdf'] SUPPORTED_OUTPUT_TYPES = ['raw_text'] ...

To prevent the implementations from using arbitrary strings, we should define an enum for the valid input/output types.

Create RFC-0002-preprocessing-endpoint.md

fafab11

ilya-kolchinsky requested review from ashwinb, yanxi0830, hardikjshah, dltn, raghotham, dineshyv, vladimirivic, sixianyi0721, ehhuang and terrytangyuan as code owners February 27, 2025 12:30

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 27, 2025

jwm4 reviewed Feb 27, 2025

View reviewed changes

yanxi0830 reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: [RFC] RAG document preprocessor endpoint #1290

docs: [RFC] RAG document preprocessor endpoint #1290

ilya-kolchinsky commented Feb 27, 2025

jwm4 Feb 27, 2025

jwm4 Feb 27, 2025

ilya-kolchinsky Feb 28, 2025

yanxi0830 Feb 28, 2025

ilya-kolchinsky Mar 3, 2025

yanxi0830 Feb 28, 2025 •

edited

Loading

ilya-kolchinsky Mar 3, 2025

docs: [RFC] RAG document preprocessor endpoint #1290

Are you sure you want to change the base?

docs: [RFC] RAG document preprocessor endpoint #1290

Conversation

ilya-kolchinsky commented Feb 27, 2025

What does this PR do?

(Closes #1232)

Test Plan

jwm4 Feb 27, 2025

Choose a reason for hiding this comment

jwm4 Feb 27, 2025

Choose a reason for hiding this comment

ilya-kolchinsky Feb 28, 2025

Choose a reason for hiding this comment

yanxi0830 Feb 28, 2025

Choose a reason for hiding this comment

ilya-kolchinsky Mar 3, 2025

Choose a reason for hiding this comment

yanxi0830 Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

ilya-kolchinsky Mar 3, 2025

Choose a reason for hiding this comment

yanxi0830 Feb 28, 2025 •

edited

Loading