Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database #199

Closed
6 tasks done
Teneroy opened this issue Oct 2, 2024 · 3 comments · Fixed by #239, #241, #244, #247 or #248
Closed
6 tasks done
Assignees

Comments

@Teneroy
Copy link
Collaborator

Teneroy commented Oct 2, 2024

Description

The goal of this task is to develop a script that automates the process of pulling Kyma BTP and Kyma Open-Source documentation (in .md format), filtering for relevant documents, embedding them using a suitable model, and storing the resulting embeddings in the Hana Vector Database. The embedding model used should be carefully selected, with a suggestion to start by exploring OpenAI models, given their success in previous PoC experiments. An appropriate chunking strategy for breaking down the documentation into manageable parts must also be implemented. A plan to trigger this script will be discussed with the team for follow-up tasks.

This task can be parallelelized, 2 people can work on it and split the subtasks however they decide. (Recommendation strong)

Subtasks

  1. Pull Kyma Documentation:

    • Write a script to pull Kyma BTP and Kyma Open-Source documentation in .md format from their respective sources.
    • Ensure that the script covers all relevant documents for both BTP and Open-Source versions.
  2. Filter Relevant Documentation Files:

    • Implement logic to keep only the relevant documentation files for embedding, based on predefined criteria.
    • Define what constitutes "relevant" documents in the context of Kyma Companion’s needs (e.g., technical reference docs, API documentation, core concepts, etc.).
    • Ensure non-relevant files (e.g., examples, license files, or changelogs) are excluded from processing.
  3. Choose an Embedding Model:

    • Research and select an appropriate embedding model for converting the cleaned documentation into vector embeddings.
    • Start by evaluating OpenAI’s embedding models (used previously in PoC) and explore other alternatives if necessary.
  4. Implement Chunking Strategy:

    • Define an initial strategy for breaking down the documentation into smaller chunks to ensure effective and meaningful embeddings.
    • Test chunking strategies for both large and small documentation files to strike a balance between chunk size and relevance.
    • Use PoC experiments as a reference to guide the chunking implementation.
  5. Store Embeddings in Hana Vector Database:

    • Once the documentation is embedded, develop the logic to store the resulting embeddings in the Hana Vector Database.
    • Ensure that all relevant metadata (document title, section, source, etc.) is stored along with the embeddings for easy retrieval.
  6. Propose Triggering Mechanism:

    • As part of this task, propose an efficient method for triggering the script (e.g., manual trigger, automated based on repository changes).
    • Discuss this triggering method with the team to gather input for a follow-up task.

Subtasks

  • Prepare Kyma documents. Filter and clean up the *.MD files automatically. - @mfaizanse
  • Choose an embedding model - Mansur
  • Implement chunking and store it to the Vector DB - Mansur
  • Come up with a automatic indexing mechanism

Acceptance Criteria

  • The script successfully pulls Kyma BTP and Kyma Open-Source documentation in .md format. (Assignee: @mfaizanse )
  • Non-relevant files are excluded, and only relevant documentation files are processed. (Assignee: @mfaizanse )
  • An appropriate embedding model is selected and used to generate vector embeddings for the documentation.
  • Documentation is chunked effectively, ensuring relevant embeddings are created.
  • Embeddings and related metadata are stored in the Hana Vector Database.
  • A method for triggering the script is proposed and discussed with the team.
@muralov muralov self-assigned this Oct 17, 2024
@mfaizanse mfaizanse self-assigned this Oct 24, 2024
@muralov muralov linked a pull request Oct 29, 2024 that will close this issue
@mfaizanse
Copy link
Member

mfaizanse commented Oct 29, 2024

@mfaizanse
Copy link
Member

mfaizanse commented Oct 29, 2024

Todo(s):

  • Rate limit and retries
  • More logs
  • Default table name
  • cleanup table in tests
  • Tests in fetcher (Assignee: @mfaizanse )
  • Update documentation for fetcher (Assignee: @mfaizanse )
  • Github actions for lint and tests (optional) (Assignee: @mfaizanse )

@mfaizanse
Copy link
Member

Check. doc main/docs/rag/indexing.md from a-force repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment