Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to load the existing index #78

Open
llm-finetune opened this issue Jan 16, 2025 · 3 comments
Open

Not able to load the existing index #78

llm-finetune opened this issue Jan 16, 2025 · 3 comments

Comments

@llm-finetune
Copy link

Hi I am working on a RAG application and trying to implement document indexing using pylate library. Below is the code snippet for creating the index: -

model = models.ColBERT(
model_name_or_path="lightonai/colbertv2.0",
)

index = indexes.Voyager(
index_folder="pylate-index",
index_name="test",
)

After the above code the index gets initialized.

documents_embeddings = model.encode(
documents,
batch_size=1,
is_query=False,
show_progress_bar=True,
)

After the above code the embeddings get stored in index.

However, when I want to load the index using below code, I am getting error. I have tried multiple things but couldn't get any solution.

index = indexes.Voyager(
index_folder="pylate-index",
index_name="test",
)

Note: - I am working on Windows

Any solution or guidance would be appreciated.

Thanks,

Error:
RuntimeError: Tried to read 18648 bytes from stream, but only received 974 bytes!

Error Trace
Traceback (most recent call last):
File "C:\Users\khand\AppData\Local\Programs\Python\Python311\Lib\runpy.py", line 198, in _run_module_as_main
return _run_code(code, main_globals, None,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\khand\AppData\Local\Programs\Python\Python311\Lib\runpy.py", line 88, in run_code
exec(code, run_globals)
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy_main
.py", line 39, in
cli.main()
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 430, in main
run()
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy\adapter/../..\debugpy\launcher/../..\debugpy/..\debugpy\server\cli.py", line 284, in run_file
runpy.run_path(target, run_name="main")
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "c:\Users\khand.cursor\extensions\ms-python.debugpy-2024.6.0-win32-x64\bundled\libs\debugpy_vendored\pydevd_pydevd_bundle\pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "D:\Ankit\MyWork\TestColbert\Test.py", line 11, in
index = indexes.Voyager("pylate-index",
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "d:\Ankit\MyWork\TestColbert.venv\Lib\site-packages\pylate\indexes\voyager.py", line 122, in init
self.index = self._create_collection(
^^^^^^^^^^^^^^^^^^^^^^^^
File "d:\Ankit\MyWork\TestColbert.venv\Lib\site-packages\pylate\indexes\voyager.py", line 163, in _create_collection
return Index.load(index_path)
^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Tried to read 18648 bytes from stream, but only received 974 bytes!

@NohTow
Copy link
Collaborator

NohTow commented Jan 17, 2025

Hello,

I tried with this snippet:
from pylate import indexes, models

model = models.ColBERT(
    model_name_or_path="lightonai/colbertv2.0",
)

index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="test",
)

documents = ["document 1", " document 2"]
documents_embeddings = model.encode(
    documents,
    batch_size=1,
    is_query=False,
    show_progress_bar=True,
)
index.add_documents(documents_embeddings=documents_embeddings)

And then

from pylate import indexes, models, retrieve

model = models.ColBERT(
    model_name_or_path="lightonai/colbertv2.0",
)

index = indexes.Voyager(
    index_folder="pylate-index",
    index_name="test",
)

queries = ["hello", "how are you"]
queries_embeddings = model.encode(
    documents,
    batch_size=1,
    is_query=True,
    show_progress_bar=True,
)
retriever = retrieve.ColBERT(index=index)
print(retriever.retrieve(queries_embeddings))

And this works fine.
I believe these kind of error messages arise when the index is corrupted somehow, could you try removing the files from the pylate-index folder (or initing the index with override=True, before adding documents the first time, should be the same) and try again?
If you are able to replicate the index corruption, maybe I can add some guardrails to prevent the corruption.

@llm-finetune
Copy link
Author

llm-finetune commented Jan 18, 2025

Thanks @NohTow, for looking into this.

I tried with the code snippet you have provided above. I am getting below error in index.add_documents() statement: -

TypeError: Voyager.add_documents() missing 1 required positional argument: 'documents_ids'

My voyager version is 2.1.0
and pylate is 1.1.4
python 3.11

I need to maintain the document_ids also along with the document embeddings but when passing document-ids somehow the index is getting corrupted.

@NohTow
Copy link
Collaborator

NohTow commented Jan 21, 2025

Yeah I messed up when copying the boilerplate, you need to add the documents_ids when adding to the index:
index.add_documents(documents_embeddings=documents_embeddings, documents_ids=["1", "2"])

Refer to the documentation for more examples, besides the corruption that might have happened at first, if you clean everything and runs the boilerplates, it should work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants