Headline Splitter Exceptions problems #1598

joaorura · 2024-10-29T22:57:46Z

When running with a prompt in Portuguese with documents read using LlamaIndex, I noticed that several generated nodes were becoming empty and generating an exception when processed by Embedding, because of this.
Because of this, I added additional treatments to search the text in a way that could avoid problematic comparisons.
I was using PDF documents with text formatting with well-distributed text blocks, causing the reader to get \n and inappropriate spaces, causing failures in the splitter.
GPT 4o mini would often be inspired by the prompt and add headlines to indexes that did not exist since the example prompt uses indexing for this. But often texts have headlines without this indexing. Making the indexing generated by LLM generate incompatibility with the text.
I added part of the code to deal with these details.

jjmachan · 2024-11-08T07:06:29Z

@shahules786 should we merge these changes? I don't know if @joaorura will be working on this anymore

joaorura · 2024-11-08T12:27:53Z

I will take a look at the issues and fix the PR.

shahules786 · 2024-11-15T04:24:42Z

Hey @joaorura I have made several changes to headline splitter and extraction in last weeks. Does any of those fix your issue? Additionally, can you explain the issue with an example

joaorura · 2024-11-15T19:51:50Z

@shahules786

I'll take a look at the changes you mentioned and see if they solve the problem.

I need some time to run the project again and get some examples. Unfortunately, I made the mistake of not saving it to a file to present here.

jjmachan · 2025-01-09T17:41:35Z

hey @joaorura are you still interested in completing this or should we close this?

shahules786

I think some of the concerns in the issue were addressed in few PRs made to improve test gen.

jjmachan · 2025-01-10T07:18:17Z

okay closing this in that case. @joaorura could you check out the latest version and check if it's improved for your usecases? If not feel free to let us know and we'll look into it

joaorura · 2025-01-10T21:07:50Z

okay closing this in that case. @joaorura could you check out the latest version and check if it's improved for your usecases? If not feel free to let us know and we'll look into it

Portuguese is not yet supported in ragas.

Furthermore, using a dataset in Portuguese to generate the synthetic test I have problems.

Attached is the code I used and an example file I used.

files.zip

Output:

unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: 'LlamaIndexLLMWrapper' object has no attribute 'acomplete'
unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'
unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'
unable to apply transformation: node.property('summary') must be a string, found '<class 'NoneType'>'
unable to apply transformation: Node 81cbb917-0b1a-4e10-aa3e-40ecca643230 has no summary_embedding
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], [line 1](vscode-notebook-cell:?execution_count=9&line=1)
----> [1](vscode-notebook-cell:?execution_count=9&line=1) testset = generator.generate_with_llamaindex_docs(documents, AMOUNT_TESTS,
      [2](vscode-notebook-cell:?execution_count=9&line=2)                                                   with_debugging_logs=True, 
      [3](vscode-notebook-cell:?execution_count=9&line=3)                                                   run_config=run_config, 
      [4](vscode-notebook-cell:?execution_count=9&line=4)                                                   transforms_llm=llm,
      [5](vscode-notebook-cell:?execution_count=9&line=5)                                                   transforms_embedding_model=embedding)

File c:\Users\jmess\miniconda3\envs\rag\Lib\site-packages\ragas\testset\synthesizers\generate.py:265, in TestsetGenerator.generate_with_llamaindex_docs(self, documents, testset_size, transforms, transforms_llm, transforms_embedding_model, query_distribution, run_config, callbacks, with_debugging_logs, raise_exceptions)
    [262](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:262) apply_transforms(kg, transforms, run_config)
    [263](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:263) self.knowledge_graph = kg
--> [265](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:265) return self.generate(
    [266](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:266)     testset_size=testset_size,
    [267](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:267)     query_distribution=query_distribution,
    [268](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:268)     run_config=run_config,
    [269](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:269)     callbacks=callbacks,
    [270](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:270)     with_debugging_logs=with_debugging_logs,
    [271](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:271)     raise_exceptions=raise_exceptions,
    [272](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:272) )

File c:\Users\jmess\miniconda3\envs\rag\Lib\site-packages\ragas\testset\synthesizers\generate.py:369, in TestsetGenerator.generate(self, testset_size, query_distribution, num_personas, run_config, batch_size, callbacks, token_usage_parser, with_debugging_logs, raise_exceptions)
    [366](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:366)     patch_logger("ragas.experimental.testset.transforms", logging.DEBUG)
    [368](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:368) if self.persona_list is None:
--> [369](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/synthesizers/generate.py:369)     self.persona_list = generate_personas_from_kg(
...
     [97](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/persona.py:97)     )
     [99](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/persona.py:99) summaries = [node.properties.get("summary") for node in nodes]
    [100](file:///C:/Users/jmess/miniconda3/envs/rag/Lib/site-packages/ragas/testset/persona.py:100) summaries = [summary for summary in summaries if isinstance(summary, str)]

ValueError: No nodes that satisfied the given filer. Try changing the filter.

joaorura · 2025-01-10T21:41:07Z

@jjmachan

When loading prompt adaptations this problem still happens I fixed it in an adaptation implementation for Portuguese.

joaorura · 2025-01-11T22:21:12Z

It is worth remembering that these tests were done on ragas 0.2.10

Update headline.py

60cd938

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Oct 29, 2024

jjmachan requested a review from shahules786 November 8, 2024 07:05

shahules786 reviewed Jan 9, 2025

View reviewed changes

jjmachan closed this Jan 10, 2025

joaorura mentioned this pull request Jan 11, 2025

Implemented Support Another Languages - Portugue tested | Add Google Translation #1596

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Headline Splitter Exceptions problems #1598

Headline Splitter Exceptions problems #1598

joaorura commented Oct 29, 2024

jjmachan commented Nov 8, 2024

joaorura commented Nov 8, 2024

shahules786 commented Nov 15, 2024

joaorura commented Nov 15, 2024

jjmachan commented Jan 9, 2025

shahules786 left a comment

jjmachan commented Jan 10, 2025 •

edited

Loading

joaorura commented Jan 10, 2025 •

edited by jjmachan

Loading

joaorura commented Jan 10, 2025

joaorura commented Jan 11, 2025

Headline Splitter Exceptions problems #1598

Headline Splitter Exceptions problems #1598

Conversation

joaorura commented Oct 29, 2024

jjmachan commented Nov 8, 2024

joaorura commented Nov 8, 2024

shahules786 commented Nov 15, 2024

joaorura commented Nov 15, 2024

jjmachan commented Jan 9, 2025

shahules786 left a comment

Choose a reason for hiding this comment

jjmachan commented Jan 10, 2025 • edited Loading

joaorura commented Jan 10, 2025 • edited by jjmachan Loading

joaorura commented Jan 10, 2025

joaorura commented Jan 11, 2025

jjmachan commented Jan 10, 2025 •

edited

Loading

joaorura commented Jan 10, 2025 •

edited by jjmachan

Loading