Skip to content

Commit

Permalink
feat(api): vector store endpoints (#468)
Browse files Browse the repository at this point in the history
* Adds Vector Stores routes to API
* Adds Vector Stores migration for database
* Adds query/indexing operations that use Langchain to process files
* Adds mimetype checking to Files endpoint, rejecting all unsupported filetypes from upload.
* Adds a new leapfrogai/rag route for obtaining rag results before the LLM has processed them.
* Adds some integration tests for Vector Stores, although Embeddings is mocked instead of required to be run - future PR should split these out into integration/unit tests.
* Supports auth implementation on all new endpoints
  • Loading branch information
gphorvath authored Jun 10, 2024
1 parent a634a59 commit 2cc0737
Show file tree
Hide file tree
Showing 34 changed files with 1,721 additions and 158 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ To run the LeapfrogAI API locally (starting from the root directory of the repos
python -m pip install src/leapfrogai_sdk
cd src/leapfrogai_api
python -m pip install .
uvicorn leapfrogai_api.main:app --port 3000 --reload
uvicorn leapfrogai_api.main:app --port 3000 --log-level debug --reload
```

#### Repeater
Expand Down
2 changes: 1 addition & 1 deletion packages/api/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,4 @@ COPY --from=builder /home/nonroot/.local/bin/uvicorn /home/nonroot/.local/bin/uv

EXPOSE 8080

ENTRYPOINT ["/home/nonroot/.local/bin/uvicorn", "leapfrogai_api.main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080"]
ENTRYPOINT ["/home/nonroot/.local/bin/uvicorn", "leapfrogai_api.main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080", "--log-level", "debug"]
2 changes: 2 additions & 0 deletions packages/api/chart/templates/api/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ spec:
value: /config/
- name: LFAI_CONFIG_FILENAME
value: "*.toml"
- name: DEFAULT_EMBEDDINGS_MODEL
value: "{{ .Values.api.defaultEmbeddingsModel }}"
- name: PORT
value: "{{ .Values.api.port }}"
- name: SUPABASE_URL
Expand Down
1 change: 1 addition & 0 deletions packages/api/chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ api:
replicas: 1
port: 8080
exposeOpenAPISchema: false
defaultEmbeddingsModel: "###ZARF_VAR_DEFAULT_EMBEDDINGS_MODEL###"

package:
host: leapfrogai-api
2 changes: 2 additions & 0 deletions packages/api/zarf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ variables:
description: "Flag to expose the OpenAPI schema for debugging."
- name: HOSTED_DOMAIN
default: "uds.dev"
- name: DEFAULT_EMBEDDINGS_MODEL
default: "text-embeddings"

components:
- name: leapfrogai
Expand Down
4 changes: 1 addition & 3 deletions src/leapfrogai_api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ A mostly OpenAI compliant API surface.

## Local Development Setup

### Requirements

1. Install dependencies
```bash
make install
Expand Down Expand Up @@ -38,7 +36,7 @@ A mostly OpenAI compliant API surface.
This will copy the JWT token to your clipboard.


5. Make calls to the api swagger endpoint at `http://localhost:8080/docs` using your JWT token as the `HTTPBearer` token.
5. Make calls to the api swagger endpoint at `http://localhost:8080/docs` using your JWT token as the `HTTPBearer` token.
* Hit `Authorize` on the swagger page to enter your JWT token

## Integration Tests
Expand Down
Empty file.
71 changes: 71 additions & 0 deletions src/leapfrogai_api/backend/rag/document_loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
"""Load a file and split it into chunks."""

# This import is required for "magic" to work, see https://github.com/ahupp/python-magic/issues/233
# may not be needed after https://github.com/ahupp/python-magic/pull/294 is merged
import pylibmagic # noqa: F401 # pylint: disable=unused-import
import magic
from langchain_community.document_loaders import (
CSVLoader,
Docx2txtLoader,
PyPDFLoader,
TextLoader,
UnstructuredHTMLLoader,
UnstructuredMarkdownLoader,
)
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

HANDLERS = {
"application/pdf": PyPDFLoader,
"text/plain": TextLoader,
"text/html": UnstructuredHTMLLoader,
"text/csv": CSVLoader,
"text/markdown": UnstructuredMarkdownLoader,
"application/msword": Docx2txtLoader,
"application/vnd.openxmlformats-officedocument.wordprocessingml.document": Docx2txtLoader,
}


def is_supported_mime_type(mime_type: str) -> bool:
"""Validate the mime type of a file."""
return mime_type in HANDLERS


async def load_file(file_path: str) -> list[Document]:
"""Load a file and return a list of documents."""

mime_type = magic.from_file(file_path, mime=True)

loader = HANDLERS.get(mime_type)

if loader:
return await loader(file_path).aload()
raise ValueError(f"Unsupported file type: {mime_type}")


async def split(docs: list[Document]) -> list[Document]:
"""Split a document into chunks."""
separators = [
"\n\n",
"\n",
" ",
".",
",",
"\u200b", # Zero-width space
"\uff0c", # Full width comma
"\u3001", # Ideographic comma
"\uff0e", # Full width full stop
"\u3002", # Ideographic full stop
"",
]

text_splitter = RecursiveCharacterTextSplitter(
# TODO: This parameters might need to be tuned and/or exposed for configuration
chunk_size=500,
chunk_overlap=50,
length_function=len,
is_separator_regex=False,
separators=separators,
)

return await text_splitter.atransform_documents(docs)
Loading

0 comments on commit 2cc0737

Please sign in to comment.