Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database #199

Teneroy · 2024-10-02T02:09:32Z

Description

The goal of this task is to develop a script that automates the process of pulling Kyma BTP and Kyma Open-Source documentation (in .md format), filtering for relevant documents, embedding them using a suitable model, and storing the resulting embeddings in the Hana Vector Database. The embedding model used should be carefully selected, with a suggestion to start by exploring OpenAI models, given their success in previous PoC experiments. An appropriate chunking strategy for breaking down the documentation into manageable parts must also be implemented. A plan to trigger this script will be discussed with the team for follow-up tasks.

This task can be parallelelized, 2 people can work on it and split the subtasks however they decide. (Recommendation strong)

Subtasks

Pull Kyma Documentation:
- Write a script to pull Kyma BTP and Kyma Open-Source documentation in .md format from their respective sources.
- Ensure that the script covers all relevant documents for both BTP and Open-Source versions.
Filter Relevant Documentation Files:
- Implement logic to keep only the relevant documentation files for embedding, based on predefined criteria.
- Define what constitutes "relevant" documents in the context of Kyma Companion’s needs (e.g., technical reference docs, API documentation, core concepts, etc.).
- Ensure non-relevant files (e.g., examples, license files, or changelogs) are excluded from processing.
Choose an Embedding Model:
- Research and select an appropriate embedding model for converting the cleaned documentation into vector embeddings.
- Start by evaluating OpenAI’s embedding models (used previously in PoC) and explore other alternatives if necessary.
Implement Chunking Strategy:
- Define an initial strategy for breaking down the documentation into smaller chunks to ensure effective and meaningful embeddings.
- Test chunking strategies for both large and small documentation files to strike a balance between chunk size and relevance.
- Use PoC experiments as a reference to guide the chunking implementation.
Store Embeddings in Hana Vector Database:
- Once the documentation is embedded, develop the logic to store the resulting embeddings in the Hana Vector Database.
- Ensure that all relevant metadata (document title, section, source, etc.) is stored along with the embeddings for easy retrieval.
Propose Triggering Mechanism:
- As part of this task, propose an efficient method for triggering the script (e.g., manual trigger, automated based on repository changes).
- Discuss this triggering method with the team to gather input for a follow-up task.

Subtasks

Prepare Kyma documents. Filter and clean up the *.MD files automatically. - @mfaizanse
Choose an embedding model - Mansur
Implement chunking and store it to the Vector DB - Mansur
Come up with a automatic indexing mechanism

Acceptance Criteria

The script successfully pulls Kyma BTP and Kyma Open-Source documentation in .md format. (Assignee: @mfaizanse )
Non-relevant files are excluded, and only relevant documentation files are processed. (Assignee: @mfaizanse )
An appropriate embedding model is selected and used to generate vector embeddings for the documentation.
Documentation is chunked effectively, ensuring relevant embeddings are created.
Embeddings and related metadata are stored in the Hana Vector Database.
A method for triggering the script is proposed and discussed with the team.

The text was updated successfully, but these errors were encountered:

mfaizanse · 2024-10-29T14:07:55Z

Follow-up tickets:

mfaizanse · 2024-10-29T14:12:16Z

Todo(s):

Rate limit and retries
More logs
Default table name
cleanup table in tests
Tests in fetcher (Assignee: @mfaizanse )
Update documentation for fetcher (Assignee: @mfaizanse )
Github actions for lint and tests (optional) (Assignee: @mfaizanse )

mfaizanse · 2024-11-06T09:02:57Z

Check. doc main/docs/rag/indexing.md from a-force repo.

Teneroy added this to the Develop core agent functionality milestone Oct 2, 2024

muralov self-assigned this Oct 17, 2024

mfaizanse self-assigned this Oct 24, 2024

muralov mentioned this issue Oct 24, 2024

feat: implement indexing #239

Merged

muralov linked a pull request Oct 29, 2024 that will close this issue

feat: implement indexing #239

Merged

mfaizanse mentioned this issue Oct 29, 2024

feat: added script to pull documents for RAG #241

Merged

mfaizanse mentioned this issue Oct 30, 2024

chore: updated documentation for doc_indexer #244

Merged

This was linked to pull requests Oct 30, 2024

feat: added script to pull documents for RAG #241

Merged

chore: updated documentation for doc_indexer #244

Merged

muralov mentioned this issue Oct 30, 2024

chore: improve the Indexer #245

Merged

kyma-bot closed this as completed in #241 Oct 31, 2024

mfaizanse reopened this Oct 31, 2024

mfaizanse mentioned this issue Nov 4, 2024

chore: added github actions for tests of doc-indexer #247

Merged

mfaizanse linked a pull request Nov 4, 2024 that will close this issue

chore: added github actions for tests of doc-indexer #247

Merged

kyma-bot closed this as completed in #247 Nov 4, 2024

mfaizanse reopened this Nov 4, 2024

This was linked to pull requests Nov 4, 2024

fix: fix doc_indexer actions #248

Merged

fix: fix tests github action of doc_indexer #249

Merged

chore: improve the Indexer #245

Merged

mfaizanse closed this as completed Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database #199

Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database #199

Teneroy commented Oct 2, 2024 •

edited by muralov

Loading

mfaizanse commented Oct 29, 2024 •

edited

Loading

mfaizanse commented Oct 29, 2024 •

edited

Loading

mfaizanse commented Nov 6, 2024

Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database #199

Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database #199

Comments

Teneroy commented Oct 2, 2024 • edited by muralov Loading

Description

Subtasks

Subtasks

Acceptance Criteria

mfaizanse commented Oct 29, 2024 • edited Loading

mfaizanse commented Oct 29, 2024 • edited Loading

mfaizanse commented Nov 6, 2024

Teneroy commented Oct 2, 2024 •

edited by muralov

Loading

mfaizanse commented Oct 29, 2024 •

edited

Loading

mfaizanse commented Oct 29, 2024 •

edited

Loading