-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Document Chunker to always use docling #430
Refactor Document Chunker to always use docling #430
Conversation
f62567d
to
309fd11
Compare
c84fa40
to
d4cc458
Compare
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
e2e workflow succeeded on this PR: View run, congrats! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a first pass through this and have a few questions / comments. Nothing big, but just to clarify some comments in the code as well as some intended behavior. Thanks for the new tests here!
3e0eed7
to
ff40438
Compare
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Aakanksha Duggal <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
ff40438
to
a692d70
Compare
Signed-off-by: Khaled Sulayman <[email protected]>
2b122f1
to
e3a3e1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test pass, requested changes have been made - looks good to me. Thanks!
The old DocumentChunker was a factory class that called the text-splitter on markdowns and docling on PDFs. In reality, we want to call docling and then use the text-splitter on all document types. This change refactors the DocumentChunker class to always call docling (as long as the provided documents are supported filetypes).
Resolves: #334