Docsite search: Improving vanilla RAG #180

hanna-paasivirta · 2025-03-06T11:21:26Z

We've built a database upload and semantic search for OpenFn documentation RAG. The first goal of this is to improve the AI Assistant's answers by giving it access to relevant parts of the documentation.

A key challenge with using the semantic search is that the quality of the search results can be terrible. Often the semantic search premise that the question will look like the answer is simply wrong. For example, we might fetch a chunk of the documentation that introduces the problem, and not the solution. Sometimes it's just hard to understand why something scored high. Sometimes the user question is not well formulated for a direct search (e.g. has lots of code for context). However, at other times, it works really well. There's ways to tap into this potential better.

We need a more flexible system to fetch relevant documentation for the assistant. Two main directions below:

1) Improve the search input and logic (focusing on this first)
Leverage LLM calls for a more flexible search process (more agentic). Sometimes it's best to fetch by semantic search, sometimes it's best to fetch a specific section by title, and sometimes we need both. Often, it's useful to reformulate the user question into a query.

Step one will be a lightweight LLM call that decides whether we need to fetch documentation. Then, potential decisions that can be combined into one or more steps include:

Do we do semantic search?
Which query/queries?
Which filter?
Do we just fetch an entire section by title (especially, adaptor docs)? Split into 2 (adaptor/general), then:
- option a) Which one of these titles [list]
- option b) Search titles by exact match, or if needed, revert to semantic search
How many search queries vs how many search results

v1: just run on first user input. v2: if that works ok, can try adding more often, or to every conversation turn.

The highest quality, most comprehensive way of doing this might not be ideal, if we aim for this to potentially run several times in a conversation. A good goal might be to keep within +30% of the current avg initial prompt token length, but significantly improve its relevance. This also risks becoming increasingly complex, so need to focus on the simplest possible approach that yields somewhat improved results.

2) Improve the database/search algorithm

Use an LLM to summarise each chunk for an LLM-based search (step 1: select section, step 2: select from these titles, step 3: select from these summaries)
Try different chunk sizes (larger)
Try different embeddings (large version, then other types)
Use an LLM to contextualise each chunk before vectorising it
Try different similarity calculations (e.g. fetch a variety of docs)
Hybrid search – add an efficient keyword search to use alongside semantic (this worked well in the vocab mapper, but isn't implemented efficiently).
- Pinecone has its own brand of hybrid search, but I think it uses semantic as a first layer so could potentially have issues. It's also only available on a paid tier non-serverless service so it's better to avoid this.

hanna-paasivirta self-assigned this Mar 6, 2025

hanna-paasivirta mentioned this issue Mar 6, 2025

Docsite rag #176

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docsite search: Improving vanilla RAG #180

Docsite search: Improving vanilla RAG #180

hanna-paasivirta commented Mar 6, 2025 •

edited

Loading

Docsite search: Improving vanilla RAG #180

Docsite search: Improving vanilla RAG #180

Comments

hanna-paasivirta commented Mar 6, 2025 • edited Loading

hanna-paasivirta commented Mar 6, 2025 •

edited

Loading