Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Build a tutorial for using Agent and retrieval for extracting insight from large data corpus #1430

Open
2 tasks done
codingjaguar opened this issue Jan 10, 2025 · 0 comments
Labels

Comments

@codingjaguar
Copy link

Required prerequisites

Motivation

Large public corpora like Common Crawl or Wikipedia contain vast amounts of information, but extracting actionable insights from them can be challenging. For instance, analyzing public sentiment toward a political figure involves a sophisticated, iterative exploration process. Here’s how it can be done:
1. Embedding and Storage: Process the corpus to generate embeddings for all documents and store them in a high-performance vector database like Milvus to enable efficient search.
2. Initial Retrieval: Retrieve potentially relevant articles from billions of documents using similarity-based search.
3. Article Sampling and Filtering: Sample articles related to the political figure and analyze them to identify patterns. Utilize a small LLM to refine the pool by excluding documents that are related but not useful for sentiment analysis.
4. Refined Analysis: Use a larger LLM to perform in-depth analysis on the refined set of documents. Employ a multi-agent approach, where agents in different roles (e.g., journalist, analyst, fact-checker) collaborate to provide a nuanced sentiment analysis.
5. Sentiment Evaluation: Synthesize the insights from the refined documents to derive meaningful conclusions about public sentiment.

A reference implementation like this will be very useful to demonstrate how to combine the strengths of vector search, LLM capabilities, and multi-agent collaboration to extract valuable insights from massive datasets.

Solution

No response

Alternatives

No response

Additional context

No response

@codingjaguar codingjaguar added the enhancement New feature or request label Jan 10, 2025
@Wendong-Fan Wendong-Fan added use case and removed enhancement New feature or request labels Jan 12, 2025
@Wendong-Fan Wendong-Fan added P1 Task with middle level priority and removed P1 Task with middle level priority labels Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

2 participants