-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Headline Splitter Exceptions problems #1598
Headline Splitter Exceptions problems #1598
Conversation
@shahules786 should we merge these changes? I don't know if @joaorura will be working on this anymore |
I will take a look at the issues and fix the PR. |
Hey @joaorura I have made several changes to headline splitter and extraction in last weeks. Does any of those fix your issue? Additionally, can you explain the issue with an example |
I'll take a look at the changes you mentioned and see if they solve the problem. I need some time to run the project again and get some examples. Unfortunately, I made the mistake of not saving it to a file to present here. |
hey @joaorura are you still interested in completing this or should we close this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the concerns in the issue were addressed in few PRs made to improve test gen.
okay closing this in that case. @joaorura could you check out the latest version and check if it's improved for your usecases? If not feel free to let us know and we'll look into it |
Portuguese is not yet supported in ragas. Furthermore, using a dataset in Portuguese to generate the synthetic test I have problems. Attached is the code I used and an example file I used. Output:
|
When loading prompt adaptations this problem still happens I fixed it in an adaptation implementation for Portuguese. |
It is worth remembering that these tests were done on ragas 0.2.10 |
When running with a prompt in Portuguese with documents read using LlamaIndex, I noticed that several generated nodes were becoming empty and generating an exception when processed by Embedding, because of this.
Because of this, I added additional treatments to search the text in a way that could avoid problematic comparisons.
I was using PDF documents with text formatting with well-distributed text blocks, causing the reader to get \n and inappropriate spaces, causing failures in the splitter.
GPT 4o mini would often be inspired by the prompt and add headlines to indexes that did not exist since the example prompt uses indexing for this. But often texts have headlines without this indexing. Making the indexing generated by LLM generate incompatibility with the text.
I added part of the code to deal with these details.