-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docsite rag #176
base: main
Are you sure you want to change the base?
Docsite rag #176
Conversation
Todo: test on full docs; finish DocsiteSearch according to index metadata structure |
Thank you @hanna-paasivirta! I've gotta tied up with a production issue but we'll get this reviewed and merged tomorrow :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice and clean, thank you! Even I have a fighting chance of understanding it.
I'd like to test before merging but it looks fantastic. Left a couple of questions in the meantime.
## Implementation | ||
The service uses the DocsiteProcessor to download the documentation and chunk it into smaller parts. The DocsiteIndexer formats metadata, creates a new collection, embeds the chunked texts (OpenAI) and uploads them into the vector database (Pinecone). | ||
|
||
The chunked texts can be viewed in `tmp/split_sections`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh nice - this is still true? I'd love to run a test tomorrow and have a look, just to see what's going on
documents, metadata_dict = docsite_processor.get_preprocessed_docs() | ||
|
||
# Upload with metadata | ||
idx = docsite_indexer.insert_documents(documents, metadata_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this doesn't clear the target collection first? It just adds new content?
Maybe we can add like a purge
option or something? I'm gonna guess the default value is true
but we can address that later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It adds different doc types separately into the same collection (they can still be filtered by doc type), so it needs to not delete the contents. I think we might want to fill in a new collection for each version, then switch to that in production, then delete the old collection, rather than delete_collection
and then refill/recreate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we add docsite docs and adaptor docs separately to the same collection.
But the issue of knowing when to remove the old collection remains.
I like the idea of version switching. Incoming requests use the latest version, and once the version is updated, we can remove the old version (unless perhaps there are any outstanding requests, but at this stage the odds on a client making a call during an update feel pretty slim).
I think this PR I'd like to:
- have some kind of version tracking in the db (maybe we need a meta table for this)
- Whenever there's an update, lookup the current version and increment it
- Insert all new docs into the current version
- Update the meta table
- Drop the old version
We can also lookup the version number when the server starts up, store it in-memory for incoming requests, and update it in-memory whenever there's an update.
I suggest we open up a new issue for this and implement a solution on a new branch. It's not crazy complicated, but probably risky enough that we should make the change in isolation. I can see a few places where it might go wrong 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, continued on here #173
Is |
|
||
```js | ||
{ | ||
"docs_to_upload": ["adaptor_docs", "general_docs", "adaptor_functions"], // Select 1-3 types of documentation to upload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to default all this
We should generate the collection name according to the versioning strategy (see my earlier comment). To be fair, versioning by date is also a great solution and I'm quite happy to roll with it.
By default we should upload all docs, but users should be able to specify fewer if they want.
It would be more conventional to ask the user to pass an api key and pinecone URL. Maybe apollo should have its own credentials for this.. but also maybe not? I think we should give this more thought.
But users must be able to pass credentials, even if if apollo provides a default
@hanna-paasivirta what do I need to setup on the pinecone side to get this working? I'm getting an error because I don't have a |
Index creation should work properly now. Next up figuring out how to actually make use of the RAG #180 |
Short Description
Replace the Search service with new embed_docsite and search_docsite services.
Fixes #172
Implementation Details
The embed_docsite service downloads, chunks, processes metadata and indexes OpenFn documentation. The service uses Pinecone as a vector database and OpenAI for text embeddings.
The search_docsite searches the documentation through the vector database using and input query.
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
You can read more details in our Responsible AI Policy