Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: [RFC] RAG document preprocessor endpoint #1290

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ilya-kolchinsky
Copy link

What does this PR do?

Introduces an RFC for preprocessing functionality in RAG and beyond for the community review.

(Closes #1232)

Test Plan

Docs change only, no test were run.

## Endpoints
### Preprocessing endpoint
URL: `POST /v1/preprocess`
Request Body:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is totally clear from this text exactly what this request returns, but I get the impression that this is a synchronous call that returns all the preprocessor outputs in one big JSON object. That seems OK for small scale but problematic for medium or larger scale. For medium scale, I think it would be better to have this be an asynchronous call that returns a stream of preprocessor outputs to the client. For very large scale, I think we would want to avoid returning preprocessor outputs to the client at all -- what I'd prefer in that case is for the preprocess command to return a stream ID that I can then pass along to the next stage of processing (e.g., vector DB ingestion) -- that next stage would then use this stream ID to pull the preprocessor outputs and use them.

If we want to support all three scales (small, medium, and large) then probably the best way to do that is to start with a design for large scale and then add convenience methods for smaller scales if needed. I think that would look like this:

  1. Update preprocess to return a stream ID.
  2. Update vector DB ingestion to optionally take a stream ID for a stream of documents as input instead of the actual document objects (and then it would use that stream ID to fetch the documents).
  3. (Maybe) add one or more endpoints for fetching preprocessor outputs given a stream ID for those cases where the clients really do want to get all the preprocessor outputs. Such endpoints could include the small-scale case, where are you have a blocking/synchronous API call that fetches the entire stream as one big JSON blog and/or the medium-scale case where you have a streaming API or maybe just an API that you call a lot of times to fetch pre-processor outputs one at a time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should probably plan for the case where the Llama Stack server that manages the preprocessing is a different Llama Stack server that manages the next step (e.g., vector DB ingestion or synthetic data generation). One way to do that would be to make the stream ID a call-back URL so that whatever gets that stream ID knows how to contact the Llama Stack server producing the stream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwm4 I agree that the output of the preprocessing endpoint should be better defined.

That being said, I believe that the issue you're raising is bigger than the preprocessing endpoint proposed here and is also related to large-scale inference in LLS / to any endpoint that can potentially grow to huge bandwidth. Should Llama Stack really stream and route all the outputs in the system? Shouldn't it be more like the Kubernetes control plane, managing and orchestrating the workflows and not directly operating on the I/O? Wouldn't it solve both the scalability issue and the problem of multiple Llama Stack servers that you mention in your second comment?

As of now, the design of Llama Stack makes it challenging to directly stream data in large-scale deployment scenarios. Whether the community believes it should directly support massive data transfer (in which case a major redesign is required) or not (in which case we have to agree on an alternative way to proceed), this warrants a separate conversation. Would you like me to open a discussion thread on this?

Back to the preprocessing endpoint, my assumption for now is that for inline providers (i.e., small scale) the outputs will be returned directly, but for remote providers (i.e., medium and beyond) only the status will be returned, and an external preprocessing tool will take care of storing the results somewhere according the input parameters and/or configuration settings. The current preprocessing API specification includes a parameter named 'options' intended, among other uses, to specify where and how to store the output.

```
remote_dirs = [ ... ]
document_dir_paths = [
DocumentDirPath(document_path_id=f"path_{i}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/ What is a DocumentDirPath?

2/ How do you imagine the endpoint work together with /files API (https://github.com/meta-llama/llama-stack/blob/main/llama_stack/apis/files/files.py) for uploading arbitrary files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The name is probably not the best. DocumentDirPath is an abstraction for an identifier for a collection of files. It could be for example a URI of a local/remote folder to be traversed; a path containing a regex covering multiple files; a link to a directory service to get the list of actual files from, and so on. The motivation is that for large-scale scenarios listing the files individually does not work.
  2. This is a great question - continuing the previous answer, the implementation should also support passing links to /files/{bucket} to the preprocessing endpoint (via DocumentDirPath).

documents=documents,
vector_db_id="my_vector_id",
preprocessing_chain=[
Preprocessor(id="inline_httpx_fetcher"),
Copy link
Contributor

@yanxi0830 yanxi0830 Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want to generalize and standardize input & output types of each processor in the chain?

I imagine here:

  • inline_httpx_fetcher would take an URL --> file_id
  • inline_pypdf_converter would take a pdf file -> convert it into raw str text
  • inline_overlapping_chunks_chunker should take the raw str text -> list of str text

How can we guide the user to build the chain such that it doesn't connect the wrong processors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I don't think this should be "coded" in the processor names. Otherwise we will end up categorizing all implementations into fetchers, converters and chunkers, which is exactly what the proposal tries to avoid. Having a hard division into distinct processor types is not ideal since new types may be introduced in the future and, more importantly, existing tools may support some combination / subset of the above.
Instead, each processor implementation can define SUPPORTED_INPUT_TYPES and SUPPORTED_OUTPUT_TYPES consts as static fields.
In this example, e.g., inline_pypdf_converter could be defined like this:

class InlinePyPdfConverterImpl(Preprocessor):
  ...
  SUPPORTED_INPUT_TYPES = ['pdf']
  SUPPORTED_OUTPUT_TYPES = ['raw_text']
  ...

To prevent the implementations from using arbitrary strings, we should define an enum for the valid input/output types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Preprocessing endpoint for RAG and other uses
4 participants