Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docsite rag #176

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Docsite rag #176

wants to merge 19 commits into from

Conversation

hanna-paasivirta
Copy link
Contributor

@hanna-paasivirta hanna-paasivirta commented Feb 21, 2025

Short Description

Replace the Search service with new embed_docsite and search_docsite services.

Fixes #172

Implementation Details

The embed_docsite service downloads, chunks, processes metadata and indexes OpenFn documentation. The service uses Pinecone as a vector database and OpenAI for text embeddings.

The search_docsite searches the documentation through the vector database using and input query.

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

You can read more details in our Responsible AI Policy

@hanna-paasivirta
Copy link
Contributor Author

Todo: test on full docs; finish DocsiteSearch according to index metadata structure

@hanna-paasivirta hanna-paasivirta marked this pull request as ready for review March 3, 2025 13:37
@josephjclark
Copy link
Collaborator

Thank you @hanna-paasivirta! I've gotta tied up with a production issue but we'll get this reviewed and merged tomorrow :)

Copy link
Collaborator

@josephjclark josephjclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice and clean, thank you! Even I have a fighting chance of understanding it.

I'd like to test before merging but it looks fantastic. Left a couple of questions in the meantime.

## Implementation
The service uses the DocsiteProcessor to download the documentation and chunk it into smaller parts. The DocsiteIndexer formats metadata, creates a new collection, embeds the chunked texts (OpenAI) and uploads them into the vector database (Pinecone).

The chunked texts can be viewed in `tmp/split_sections`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh nice - this is still true? I'd love to run a test tomorrow and have a look, just to see what's going on

documents, metadata_dict = docsite_processor.get_preprocessed_docs()

# Upload with metadata
idx = docsite_indexer.insert_documents(documents, metadata_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this doesn't clear the target collection first? It just adds new content?

Maybe we can add like a purge option or something? I'm gonna guess the default value is true but we can address that later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It adds different doc types separately into the same collection (they can still be filtered by doc type), so it needs to not delete the contents. I think we might want to fill in a new collection for each version, then switch to that in production, then delete the old collection, rather than delete_collection and then refill/recreate it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we add docsite docs and adaptor docs separately to the same collection.

But the issue of knowing when to remove the old collection remains.

I like the idea of version switching. Incoming requests use the latest version, and once the version is updated, we can remove the old version (unless perhaps there are any outstanding requests, but at this stage the odds on a client making a call during an update feel pretty slim).

I think this PR I'd like to:

  • have some kind of version tracking in the db (maybe we need a meta table for this)
  • Whenever there's an update, lookup the current version and increment it
  • Insert all new docs into the current version
  • Update the meta table
  • Drop the old version

We can also lookup the version number when the server starts up, store it in-memory for incoming requests, and update it in-memory whenever there's an update.

I suggest we open up a new issue for this and implement a solution on a new branch. It's not crazy complicated, but probably risky enough that we should make the change in isolation. I can see a few places where it might go wrong 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, continued on here #173

@josephjclark
Copy link
Collaborator

Is embed_docsite/search.py still used? That looks like the old stuff?


```js
{
"docs_to_upload": ["adaptor_docs", "general_docs", "adaptor_functions"], // Select 1-3 types of documentation to upload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to default all this

We should generate the collection name according to the versioning strategy (see my earlier comment). To be fair, versioning by date is also a great solution and I'm quite happy to roll with it.

By default we should upload all docs, but users should be able to specify fewer if they want.

It would be more conventional to ask the user to pass an api key and pinecone URL. Maybe apollo should have its own credentials for this.. but also maybe not? I think we should give this more thought.

But users must be able to pass credentials, even if if apollo provides a default

@josephjclark
Copy link
Collaborator

@hanna-paasivirta what do I need to setup on the pinecone side to get this working? I'm getting an error because I don't have a docsite index. When I go to create an index, I get all sorts of complicated questions about how I should configure it 😬

@hanna-paasivirta
Copy link
Contributor Author

Index creation should work properly now.

Next up figuring out how to actually make use of the RAG #180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docsite search: Add a new docsite search RAG
2 participants