feat(api): vector store endpoints (#468)

* Adds Vector Stores routes to API * Adds Vector Stores migration for database * Adds query/indexing operations that use Langchain to process files * Adds mimetype checking to Files endpoint, rejecting all unsupported filetypes from upload. * Adds a new leapfrogai/rag route for obtaining rag results before the LLM has processed them. * Adds some integration tests for Vector Stores, although Embeddings is mocked instead of required to be run - future PR should split these out into integration/unit tests. * Supports auth implementation on all new endpoints
defenseunicorns · Jun 10, 2024 · 2cc0737 · 2cc0737
1 parent a634a59
commit 2cc0737
Show file tree

Hide file tree

Showing 34 changed files with 1,721 additions and 158 deletions.
diff --git a/README.md b/README.md
@@ -192,7 +192,7 @@ To run the LeapfrogAI API locally (starting from the root directory of the repos
 python -m pip install src/leapfrogai_sdk
 cd src/leapfrogai_api
 python -m pip install .
-uvicorn leapfrogai_api.main:app --port 3000 --reload
+uvicorn leapfrogai_api.main:app --port 3000 --log-level debug --reload
 ```
 
 #### Repeater

diff --git a/packages/api/Dockerfile b/packages/api/Dockerfile
@@ -21,4 +21,4 @@ COPY --from=builder /home/nonroot/.local/bin/uvicorn /home/nonroot/.local/bin/uv
 
 EXPOSE 8080
 
-ENTRYPOINT ["/home/nonroot/.local/bin/uvicorn", "leapfrogai_api.main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080"]
+ENTRYPOINT ["/home/nonroot/.local/bin/uvicorn", "leapfrogai_api.main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080", "--log-level", "debug"]
diff --git a/packages/api/chart/templates/api/deployment.yaml b/packages/api/chart/templates/api/deployment.yaml
@@ -43,6 +43,8 @@ spec:
             value: /config/
           - name: LFAI_CONFIG_FILENAME
             value: "*.toml"
+          - name: DEFAULT_EMBEDDINGS_MODEL
+            value: "{{ .Values.api.defaultEmbeddingsModel }}"
           - name: PORT
             value: "{{ .Values.api.port }}"
           - name: SUPABASE_URL

diff --git a/packages/api/chart/values.yaml b/packages/api/chart/values.yaml
@@ -11,6 +11,7 @@ api:
   replicas: 1
   port: 8080
   exposeOpenAPISchema: false
+  defaultEmbeddingsModel: "###ZARF_VAR_DEFAULT_EMBEDDINGS_MODEL###"
 
 package:
   host: leapfrogai-api
diff --git a/packages/api/zarf.yaml b/packages/api/zarf.yaml
@@ -20,6 +20,8 @@ variables:
     description: "Flag to expose the OpenAPI schema for debugging."
   - name: HOSTED_DOMAIN
     default: "uds.dev"
+  - name: DEFAULT_EMBEDDINGS_MODEL
+    default: "text-embeddings"
 
 components:
   - name: leapfrogai

diff --git a/src/leapfrogai_api/README.md b/src/leapfrogai_api/README.md
@@ -4,8 +4,6 @@ A mostly OpenAI compliant API surface.
 
 ## Local Development Setup
 
-### Requirements
-
 1. Install dependencies
     ```bash
     make install
@@ -38,7 +36,7 @@ A mostly OpenAI compliant API surface.
     This will copy the JWT token to your clipboard.
 
 
-5. Make calls to the api swagger endpoint at `http://localhost:8080/docs` using your JWT token as the `HTTPBearer` token. 
+5. Make calls to the api swagger endpoint at `http://localhost:8080/docs` using your JWT token as the `HTTPBearer` token.
    * Hit `Authorize` on the swagger page to enter your JWT token
 
 ## Integration Tests

diff --git a/src/leapfrogai_api/backend/rag/__init__.py b/src/leapfrogai_api/backend/rag/__init__.py
diff --git a/src/leapfrogai_api/backend/rag/document_loader.py b/src/leapfrogai_api/backend/rag/document_loader.py
@@ -0,0 +1,71 @@
+"""Load a file and split it into chunks."""
+
+# This import is required for "magic" to work, see https://github.com/ahupp/python-magic/issues/233
+# may not be needed after https://github.com/ahupp/python-magic/pull/294 is merged
+import pylibmagic  # noqa: F401 # pylint: disable=unused-import
+import magic
+from langchain_community.document_loaders import (
+    CSVLoader,
+    Docx2txtLoader,
+    PyPDFLoader,
+    TextLoader,
+    UnstructuredHTMLLoader,
+    UnstructuredMarkdownLoader,
+)
+from langchain_core.documents import Document
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+
+HANDLERS = {
+    "application/pdf": PyPDFLoader,
+    "text/plain": TextLoader,
+    "text/html": UnstructuredHTMLLoader,
+    "text/csv": CSVLoader,
+    "text/markdown": UnstructuredMarkdownLoader,
+    "application/msword": Docx2txtLoader,
+    "application/vnd.openxmlformats-officedocument.wordprocessingml.document": Docx2txtLoader,
+}
+
+
+def is_supported_mime_type(mime_type: str) -> bool:
+    """Validate the mime type of a file."""
+    return mime_type in HANDLERS
+
+
+async def load_file(file_path: str) -> list[Document]:
+    """Load a file and return a list of documents."""
+
+    mime_type = magic.from_file(file_path, mime=True)
+
+    loader = HANDLERS.get(mime_type)
+
+    if loader:
+        return await loader(file_path).aload()
+    raise ValueError(f"Unsupported file type: {mime_type}")
+
+
+async def split(docs: list[Document]) -> list[Document]:
+    """Split a document into chunks."""
+    separators = [
+        "\n\n",
+        "\n",
+        " ",
+        ".",
+        ",",
+        "\u200b",  # Zero-width space
+        "\uff0c",  # Full width comma
+        "\u3001",  # Ideographic comma
+        "\uff0e",  # Full width full stop
+        "\u3002",  # Ideographic full stop
+        "",
+    ]
+
+    text_splitter = RecursiveCharacterTextSplitter(
+        # TODO: This parameters might need to be tuned and/or exposed for configuration
+        chunk_size=500,
+        chunk_overlap=50,
+        length_function=len,
+        is_separator_regex=False,
+        separators=separators,
+    )
+
+    return await text_splitter.atransform_documents(docs)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -21,4 +21,4 @@ COPY --from=builder /home/nonroot/.local/bin/uvicorn /home/nonroot/.local/bin/uv

		EXPOSE 8080

		ENTRYPOINT ["/home/nonroot/.local/bin/uvicorn", "leapfrogai_api.main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080"]
		ENTRYPOINT ["/home/nonroot/.local/bin/uvicorn", "leapfrogai_api.main:app", "--proxy-headers", "--host", "0.0.0.0", "--port", "8080", "--log-level", "debug"]