Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docsite search: Improving vanilla RAG #180

Open
hanna-paasivirta opened this issue Mar 6, 2025 · 0 comments
Open

Docsite search: Improving vanilla RAG #180

hanna-paasivirta opened this issue Mar 6, 2025 · 0 comments
Assignees

Comments

@hanna-paasivirta
Copy link
Contributor

hanna-paasivirta commented Mar 6, 2025

We've built a database upload and semantic search for OpenFn documentation RAG. The first goal of this is to improve the AI Assistant's answers by giving it access to relevant parts of the documentation.

A key challenge with using the semantic search is that the quality of the search results can be terrible. Often the semantic search premise that the question will look like the answer is simply wrong. For example, we might fetch a chunk of the documentation that introduces the problem, and not the solution. Sometimes it's just hard to understand why something scored high. Sometimes the user question is not well formulated for a direct search (e.g. has lots of code for context). However, at other times, it works really well. There's ways to tap into this potential better.

We need a more flexible system to fetch relevant documentation for the assistant. Two main directions below:

1) Improve the search input and logic (focusing on this first)
Leverage LLM calls for a more flexible search process (more agentic). Sometimes it's best to fetch by semantic search, sometimes it's best to fetch a specific section by title, and sometimes we need both. Often, it's useful to reformulate the user question into a query.

Step one will be a lightweight LLM call that decides whether we need to fetch documentation. Then, potential decisions that can be combined into one or more steps include:

  • Do we do semantic search?
  • Which query/queries?
  • Which filter?
  • Do we just fetch an entire section by title (especially, adaptor docs)? Split into 2 (adaptor/general), then:
    • option a) Which one of these titles [list]
    • option b) Search titles by exact match, or if needed, revert to semantic search
  • How many search queries vs how many search results

v1: just run on first user input. v2: if that works ok, can try adding more often, or to every conversation turn.

The highest quality, most comprehensive way of doing this might not be ideal, if we aim for this to potentially run several times in a conversation. A good goal might be to keep within +30% of the current avg initial prompt token length, but significantly improve its relevance. This also risks becoming increasingly complex, so need to focus on the simplest possible approach that yields somewhat improved results.

2) Improve the database/search algorithm

  • Use an LLM to summarise each chunk for an LLM-based search (step 1: select section, step 2: select from these titles, step 3: select from these summaries)
  • Try different chunk sizes (larger)
  • Try different embeddings (large version, then other types)
  • Use an LLM to contextualise each chunk before vectorising it
  • Try different similarity calculations (e.g. fetch a variety of docs)
  • Hybrid search – add an efficient keyword search to use alongside semantic (this worked well in the vocab mapper, but isn't implemented efficiently).
    • Pinecone has its own brand of hybrid search, but I think it uses semantic as a first layer so could potentially have issues. It's also only available on a paid tier non-serverless service so it's better to avoid this.
@hanna-paasivirta hanna-paasivirta self-assigned this Mar 6, 2025
@hanna-paasivirta hanna-paasivirta mentioned this issue Mar 6, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant