-
Notifications
You must be signed in to change notification settings - Fork 606
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
43 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
Embedder | ||
======== | ||
|
||
.. tip:: | ||
|
||
If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`! | ||
|
||
Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space. | ||
These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis. | ||
|
||
A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic: | ||
|
||
.. image:: ../_static/img/word2vec.png | ||
:width: 500 | ||
:align: center | ||
|
||
Nowadays, embedders are capable of mapping sentences and even passages into vector space. | ||
They are widely used in real world tasks such as retrieval, clustering, etc. | ||
In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets. | ||
|
||
|
||
Sparse Vector | ||
------------- | ||
|
||
Sparse vectors usually have structure of high dimensionality with only a few non-zero values, which usually effective for tasks like keyword matching. | ||
Typically, though not always, the number of dimensions in sparse vectors corresponds to the different tokens present in the language. | ||
Each dimension is assigned a value representing the token's relative importance within the document. | ||
Some well known algorithms for sparse vector embedding includes `bag-of-words <https://en.wikipedia.org/wiki/Bag-of-words_model>`_, `TF-IDF <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, `BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_, etc. | ||
Sparse vector embeddings have great ability to extract the information of key terms and their corresponding importance within documents. | ||
|
||
Dense Vector | ||
------------ | ||
|
||
Dense vectors typically use neural networks to map words, sentences, and passages into a fixed dimension latent vector space. | ||
Then we can compare the similarity between two objects using certain metrics like Euclidean distance or Cos similarity. | ||
Those vectors can represent deeper meaning of the sentences. | ||
Thus we can distinguish sentences using similar words but actually have different meaning. | ||
And also understand different ways in speaking and writing that express the same thing. | ||
Dense vector embeddings, instead of keywords counting and matching, directly capture the semantics. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,5 +25,6 @@ Quickly get started with: | |
:caption: Concept | ||
|
||
IR | ||
model | ||
embedder | ||
reranker | ||
retrieval_demo |
23 changes: 1 addition & 22 deletions
23
docs/source/Introduction/model.rst → docs/source/Introduction/reranker.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters