Docsite rag #176

hanna-paasivirta · 2025-02-21T19:07:24Z

Short Description

Replace the Search service with new embed_docsite and search_docsite services.

Fixes #172

Implementation Details

The embed_docsite service downloads, chunks, processes metadata and indexes OpenFn documentation. The service uses Pinecone as a vector database and OpenAI for text embeddings.

The search_docsite searches the documentation through the vector database using and input query.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

You can read more details in our Responsible AI Policy

hanna-paasivirta · 2025-02-21T19:10:20Z

Todo: test on full docs; finish DocsiteSearch according to index metadata structure

josephjclark · 2025-03-03T16:49:09Z

Thank you @hanna-paasivirta! I've gotta tied up with a production issue but we'll get this reviewed and merged tomorrow :)

josephjclark

This is really nice and clean, thank you! Even I have a fighting chance of understanding it.

I'd like to test before merging but it looks fantastic. Left a couple of questions in the meantime.

josephjclark · 2025-03-03T18:32:48Z

services/embed_docsite/README.md

+## Implementation
+The service uses the DocsiteProcessor to download the documentation and chunk it into smaller parts. The DocsiteIndexer formats metadata, creates a new collection, embeds the chunked texts (OpenAI) and uploads them into the vector database (Pinecone).
+
+The chunked texts can be viewed in `tmp/split_sections`.


Ooh nice - this is still true? I'd love to run a test tomorrow and have a look, just to see what's going on

josephjclark · 2025-03-03T18:34:23Z

services/embed_docsite/embed_docsite.py

+        documents, metadata_dict = docsite_processor.get_preprocessed_docs()
+
+        # Upload with metadata
+        idx = docsite_indexer.insert_documents(documents, metadata_dict)


So this doesn't clear the target collection first? It just adds new content?

Maybe we can add like a purge option or something? I'm gonna guess the default value is true but we can address that later

It adds different doc types separately into the same collection (they can still be filtered by doc type), so it needs to not delete the contents. I think we might want to fill in a new collection for each version, then switch to that in production, then delete the old collection, rather than delete_collection and then refill/recreate it.

Right, we add docsite docs and adaptor docs separately to the same collection.

But the issue of knowing when to remove the old collection remains.

I like the idea of version switching. Incoming requests use the latest version, and once the version is updated, we can remove the old version (unless perhaps there are any outstanding requests, but at this stage the odds on a client making a call during an update feel pretty slim).

I think this PR I'd like to:

have some kind of version tracking in the db (maybe we need a meta table for this)

Whenever there's an update, lookup the current version and increment it

Insert all new docs into the current version

Update the meta table

Drop the old version

We can also lookup the version number when the server starts up, store it in-memory for incoming requests, and update it in-memory whenever there's an update.

I suggest we open up a new issue for this and implement a solution on a new branch. It's not crazy complicated, but probably risky enough that we should make the change in isolation. I can see a few places where it might go wrong 😅

Yep, continued on here #173

services/embed_docsite/github_utils.py

josephjclark · 2025-03-05T15:22:19Z

Is embed_docsite/search.py still used? That looks like the old stuff?

josephjclark · 2025-03-05T15:32:43Z

services/embed_docsite/README.md

+
+```js
+{
+    "docs_to_upload": ["adaptor_docs", "general_docs", "adaptor_functions"], // Select 1-3 types of documentation to upload


I think we should be able to default all this

We should generate the collection name according to the versioning strategy (see my earlier comment). To be fair, versioning by date is also a great solution and I'm quite happy to roll with it.

By default we should upload all docs, but users should be able to specify fewer if they want.

It would be more conventional to ask the user to pass an api key and pinecone URL. Maybe apollo should have its own credentials for this.. but also maybe not? I think we should give this more thought.

But users must be able to pass credentials, even if if apollo provides a default

josephjclark · 2025-03-05T15:49:41Z

@hanna-paasivirta what do I need to setup on the pinecone side to get this working? I'm getting an error because I don't have a docsite index. When I go to create an index, I get all sorts of complicated questions about how I should configure it 😬

hanna-paasivirta · 2025-03-06T17:21:04Z

Index creation should work properly now.

Next up figuring out how to actually make use of the RAG #180

hanna-paasivirta added 8 commits February 19, 2025 09:59

add new docsite rag file structure

37b92a0

refactor docsite processing

b92e782

simplify adaptor data processing to get docs

1938de6

remove empty file

751fa40

add splitting by headers

0d8e72a

add github download for docs

43c75d3

simplify chunking for all doc types

b2598be

add indexing for all three types of docs

92d3c1e

hanna-paasivirta added 6 commits February 24, 2025 19:09

fix chunk overlap for all docs

9e787cb

add search filtering

df3ecb1

fix index initialisation

d619f20

add database upload check

4f1164c

adjust metadata fields

4dc1e03

Tidy and add docstrings

2d6e9a5

hanna-paasivirta marked this pull request as ready for review March 3, 2025 13:37

hanna-paasivirta assigned hanna-paasivirta and josephjclark and unassigned hanna-paasivirta Mar 3, 2025

changeset

2c51eb8

josephjclark reviewed Mar 3, 2025

View reviewed changes

add GitHub API limits doc link

867e797

josephjclark reviewed Mar 5, 2025

View reviewed changes

update dependencies

2ca7663

hanna-paasivirta mentioned this pull request Mar 6, 2025

Docsite search: Set update mechanism #173

Open

hanna-paasivirta added 2 commits March 6, 2025 17:16

fix new index creation

f900c59

Merge branch 'docsite-rag' of github.com:OpenFn/apollo into docsite-rag

5ef6fad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docsite rag #176

Docsite rag #176

hanna-paasivirta commented Feb 21, 2025 •

edited

Loading

hanna-paasivirta commented Feb 21, 2025

josephjclark commented Mar 3, 2025

josephjclark left a comment

josephjclark Mar 3, 2025

josephjclark Mar 3, 2025

hanna-paasivirta Mar 4, 2025

josephjclark Mar 5, 2025

hanna-paasivirta Mar 6, 2025

josephjclark commented Mar 5, 2025

josephjclark Mar 5, 2025

josephjclark commented Mar 5, 2025

hanna-paasivirta commented Mar 6, 2025

Docsite rag #176

Are you sure you want to change the base?

Docsite rag #176

Conversation

hanna-paasivirta commented Feb 21, 2025 • edited Loading

Short Description

Implementation Details

AI Usage

hanna-paasivirta commented Feb 21, 2025

josephjclark commented Mar 3, 2025

josephjclark left a comment

Choose a reason for hiding this comment

josephjclark Mar 3, 2025

Choose a reason for hiding this comment

josephjclark Mar 3, 2025

Choose a reason for hiding this comment

hanna-paasivirta Mar 4, 2025

Choose a reason for hiding this comment

josephjclark Mar 5, 2025

Choose a reason for hiding this comment

hanna-paasivirta Mar 6, 2025

Choose a reason for hiding this comment

josephjclark commented Mar 5, 2025

josephjclark Mar 5, 2025

Choose a reason for hiding this comment

josephjclark commented Mar 5, 2025

hanna-paasivirta commented Mar 6, 2025

hanna-paasivirta commented Feb 21, 2025 •

edited

Loading