Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentence splitting implementation #26

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

doobidoo
Copy link

To implement sentence splitting for sentences that are too long, we need to break them down into smaller parts while ensuring that the resulting chunks do not exceed the specified chunk size. Here's an updated version of the splitIntoChunks method that includes sentence splitting for overly long sentences:

Copy link
Member

@splitbrain splitbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for approaching this, but I think it needs some clean up

Embeddings.php Outdated Show resolved Hide resolved
Embeddings.php Outdated Show resolved Hide resolved
Embeddings.php Outdated Show resolved Hide resolved
Embeddings.php Outdated Show resolved Hide resolved
@doobidoo doobidoo requested a review from splitbrain January 2, 2025 20:40
Copy link
Member

@splitbrain splitbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my previous concerns have not been addressed yet.

@doobidoo
Copy link
Author

doobidoo commented Jan 9, 2025

Sorry, for not getting back earlier. Regarding the sentence splitting. What is the max chunk size respectively where is it settled.
If that is taken into the equation the sentence splitter could basically have a treshold on where to split if no separator can be found.

@doobidoo
Copy link
Author

doobidoo commented Jan 9, 2025

I have made some reasonable adjustments to tackle the large bodies. It is certainly not perfect, however it will not break the code and the embedding service will complain anyway if the payload is to big. Here is an excerpt of my test:

⚠ Sentence too long, splitting it into smaller parts
✗ Failed to get embedding for chunk of page intern:editors:bpmn: Gemini API error: Request payload size exceeds the limit: 10000 bytes.
✓ intern:editors:bpmn split into 1 chunks
ℹ Reusing chunks for faq:sql:memory
⚠ Sentence too long, splitting it into smaller parts
✗ Failed to get embedding for chunk of page intern:editors:loremipsumlongtext: Gemini API error: Request payload size exceeds the limit: 10000 bytes.
✓ intern:editors:loremipsumlongtext split into 1 chunks
ℹ Assigning clusters to chunks...
✓ Chunk 11300 assigned to cluster 14

@doobidoo doobidoo requested a review from splitbrain February 3, 2025 19:33
Push split sentences to the front of the queue with array_unshift($sentences, ...$this->splitLongSentence($sentence, $tiktok));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants