Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: improve the Indexer #245

Merged
merged 2 commits into from
Oct 30, 2024
Merged

Conversation

muralov
Copy link
Collaborator

@muralov muralov commented Oct 30, 2024

Description

Changes proposed in this pull request:

  • Index with batches to prevent rate limiting
  • Clean-up table after test run
  • Add more logs

Related issue(s)
#199

@muralov muralov requested a review from a team as a code owner October 30, 2024 11:36
Copy link

Note(s) for PR Auther:

  • The integration test will be skipped for the PR. You can trigger it manually after adding the label: run-integration-test.
  • The evaluation test will be skipped for the PR. You can trigger it manually after adding the label: evaluation requested.
  • If any changes are made to the evaluation tests data, make sure that the integration tests are working as expected.
  • If any changes are made to how to run the unit tests, make sure to update the steps for unit-tests in the create-release.yml workflow as well.

Note(s) for PR Reviewer(s):

  • Make sure that the integration and evaluation tests are working as expected.

@muralov muralov changed the title Improve the Indexer chore: improve the Indexer Oct 30, 2024
* Index with batches to prevent rate limiting
* Clean-up table after test run
* Add more logs
@muralov muralov force-pushed the improve-doc-indexer branch from 4986911 to 6408e80 Compare October 30, 2024 11:41
@muralov muralov removed the request for review from friedrichwilken October 30, 2024 11:41
doc_indexer/src/utils/models.py Outdated Show resolved Hide resolved
@@ -5,6 +5,8 @@
from decouple import Config, RepositoryEnv, config
from dotenv import find_dotenv, load_dotenv

# TODO: re-use the settings parent project
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it todo suppose to be completed as part of this PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be part of this issue. Do you want me to add the issue link to the comment too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the issue link too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the todo comment from the code then. Having it in the issue is enough.

DOCS_PATH = config("DOCS_PATH", default=None)
DOCS_PATH = config("DOCS_PATH", default="data/output")
DOCS_TABLE_NAME = config("DOCS_TABLE_NAME", default="kyma_docs")
CHUNKS_BATCH_SIZE = config("CHUNKS_BATCH_SIZE", cast=int, default=100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 100 the best possible chunk size?

Copy link
Collaborator Author

@muralov muralov Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

number of chunks is 474, that means there will 5 batches. I tested with 200 batch size. It was successful multiple times and increased it to 200. We can configure it later if we want to increase more.

@muralov muralov force-pushed the improve-doc-indexer branch from 9ae17e9 to 263a387 Compare October 30, 2024 14:37
@muralov muralov force-pushed the improve-doc-indexer branch from 263a387 to 956a493 Compare October 30, 2024 14:50
@kyma-bot kyma-bot added the lgtm label Oct 30, 2024
@kyma-bot kyma-bot merged commit 43b326c into kyma-project:main Oct 30, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database
3 participants