-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contextual Retrieval #17367
Contextual Retrieval #17367
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
docs/docs/examples/metadata_extraction/DocumentContextExtractor.ipynb
Outdated
Show resolved
Hide resolved
llama-index-core/llama_index/core/extractors/document_context.py
Outdated
Show resolved
Hide resolved
llama-index-core/llama_index/core/extractors/document_context.py
Outdated
Show resolved
Hide resolved
continue | ||
|
||
# Count as failure for all other exceptions | ||
self.failed_nodes.add(node.node_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By storing this on self
, we make the extractor stateful -- It might be better to avoid this? Or find a way to make multiple runs on the same extractor object possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah true. im not sure how else to do this though. ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we pass some list between functions instead of having it on self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I thought about this. The failed nodes just seemed like a useful convenience. The function aextract truly is stateless: it doesn't use the failed_nodes list in any way. its sort of just a logging thing I guess?
I could remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we pass some list between functions instead of having it on self ?
yeah we could but then how do we give it back to the user? Although im also not sure how useful this even is as a feature. I might just kill it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but I do use it to check if it successfully generated context for all nodes. but there's probably other ways to do that too. but I'd still need to store state! I guess the user could also just write a function to iterate over all their nodes and check if they all have a 'context' metadata key... hmm.
My notebook does use the extractor.is_job_complete() function which depends on self._failed_nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on this one @logan-markewich ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its absolutely needed, there should be a way to store the state without being on the class. Or if it is on the class, associating it with some run ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hate to be a stickler on this, but it really should be possible to both make this stateless and also maintain this list for any internal purpose
…/llama_index into contextual_retrieval
unrelated tests failing (vllm is known, deeplake is flakey) |
Description
Document Context Retrieval
This feature adds a llama_index implementation of "contextual retrieval"
It adds a new Extractor class, DocumentContextExtractor, which can be used in a pipeline. It requires a Document Store and an LLM to provide the context. It also requires you keep the documentstore up to date.
motivation
Contextual retrieval is a cool technique to boost RAG accuracy, but no production-ready code exists that handles all the gnarly nasty edge cases.
Anthropic made a 'cookbook' notebook. llama_index also made a demo of it here
Both are cool, neither scale to 100s of documents because:
rate limits
cost
cant use in pipeline
documents too large for context window
prompt caching doesn't work via llama_index interface (this may have been very recently fixed)
error handling and warnings
chunk + context can be too big for the embedding model so you need some control over that
and more!
Usage
custom keys and prompts
by default, the extractor adds a key called "context" to each node, using a reasonable default prompt taken from the blog post cookbook, but you can pass in a list of keys and prompts like so:
Type of Change
Please delete options that are not relevant.
How Has This Been Tested?
Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.
core/tests/extractors/test_document_context.py
Suggested Checklist:
make format; make lint
to appease the lint gods