Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headline Splitter Exceptions problems #1598

Conversation

joaorura
Copy link
Contributor

When running with a prompt in Portuguese with documents read using LlamaIndex, I noticed that several generated nodes were becoming empty and generating an exception when processed by Embedding, because of this.
Because of this, I added additional treatments to search the text in a way that could avoid problematic comparisons.
I was using PDF documents with text formatting with well-distributed text blocks, causing the reader to get \n and inappropriate spaces, causing failures in the splitter.
GPT 4o mini would often be inspired by the prompt and add headlines to indexes that did not exist since the example prompt uses indexing for this. But often texts have headlines without this indexing. Making the indexing generated by LLM generate incompatibility with the text.
I added part of the code to deal with these details.

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 29, 2024
@jjmachan jjmachan requested a review from shahules786 November 8, 2024 07:05
@jjmachan
Copy link
Member

jjmachan commented Nov 8, 2024

@shahules786 should we merge these changes? I don't know if @joaorura will be working on this anymore

@joaorura
Copy link
Contributor Author

joaorura commented Nov 8, 2024

I will take a look at the issues and fix the PR.

@shahules786
Copy link
Member

Hey @joaorura I have made several changes to headline splitter and extraction in last weeks. Does any of those fix your issue? Additionally, can you explain the issue with an example

@joaorura
Copy link
Contributor Author

@shahules786

I'll take a look at the changes you mentioned and see if they solve the problem.

I need some time to run the project again and get some examples. Unfortunately, I made the mistake of not saving it to a file to present here.

@jjmachan
Copy link
Member

jjmachan commented Jan 9, 2025

hey @joaorura are you still interested in completing this or should we close this?

Copy link
Member

@shahules786 shahules786 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the concerns in the issue were addressed in few PRs made to improve test gen.

@jjmachan
Copy link
Member

jjmachan commented Jan 10, 2025

okay closing this in that case. @joaorura could you check out the latest version and check if it's improved for your usecases? If not feel free to let us know and we'll look into it

@jjmachan jjmachan closed this Jan 10, 2025
@joaorura
Copy link
Contributor Author

joaorura commented Jan 10, 2025

okay closing this in that case. @joaorura could you check out the latest version and check if it's improved for your usecases? If not feel free to let us know and we'll look into it

Portuguese is not yet supported in ragas.

image

Furthermore, using a dataset in Portuguese to generate the synthetic test I have problems.

image

image

Attached is the code I used and an example file I used.

files.zip

Output:

unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'
unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'
unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'
unable to apply transformation: Node 81cbb917-0b1a-4e10-aa3e-40ecca643230 has no summary_embedding
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], [line 1](vscode-notebook-cell:?execution_count=9&line=1)
----> [1](vscode-notebook-cell:?execution_count=9&line=1) testset = generator.generate_with_llamaindex_docs(documents, AMOUNT_TESTS,
      [2](vscode-notebook-cell:?execution_count=9&line=2)                                                   with_debugging_logs=True, 
      [3](vscode-notebook-cell:?execution_count=9&line=3)                                                   run_config=run_config, 
      [4](vscode-notebook-cell:?execution_count=9&line=4)                                                   transforms_llm=llm,
      [5](vscode-notebook-cell:?execution_count=9&line=5)                                                   transforms_embedding_model=embedding)

File c:\Users\jmess\miniconda3\envs\rag\Lib\site-packages\ragas\testset\synthesizers\generate.py:265, in TestsetGenerator.generate_with_llamaindex_docs(self, documents, testset_size, transforms, transforms_llm, transforms_embedding_model, query_distribution, run_config, callbacks, with_debugging_logs, raise_exceptions)
    [262](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:262) apply_transforms(kg, transforms, run_config)
    [263](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:263) self.knowledge_graph = kg
--> [265](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:265) return self.generate(
    [266](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:266)     testset_size=testset_size,
    [267](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:267)     query_distribution=query_distribution,
    [268](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:268)     run_config=run_config,
    [269](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:269)     callbacks=callbacks,
    [270](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:270)     with_debugging_logs=with_debugging_logs,
    [271](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:271)     raise_exceptions=raise_exceptions,
    [272](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:272) )

File c:\Users\jmess\miniconda3\envs\rag\Lib\site-packages\ragas\testset\synthesizers\generate.py:369, in TestsetGenerator.generate(self, testset_size, query_distribution, num_personas, run_config, batch_size, callbacks, token_usage_parser, with_debugging_logs, raise_exceptions)
    [366](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:366)     patch_logger("ragas.experimental.testset.transforms", logging.DEBUG)
    [368](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:368) if self.persona_list is None:
--> [369](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:369)     self.persona_list = generate_personas_from_kg(
...
     [97](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/persona.py:97)     )
     [99](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/persona.py:99) summaries = [node.properties.get("summary") for node in nodes]
    [100](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/persona.py:100) summaries = [summary for summary in summaries if isinstance(summary, str)]

ValueError: No nodes that satisfied the given filer. Try changing the filter.     

@joaorura
Copy link
Contributor Author

@jjmachan

When loading prompt adaptations this problem still happens I fixed it in an adaptation implementation for Portuguese.

image

@joaorura
Copy link
Contributor Author

It is worth remembering that these tests were done on ragas 0.2.10

@jjmachan
Copy link
Member

got it @joaorura, will take a look at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants