Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ZiyiXia committed Jan 17, 2025
1 parent 70c3d32 commit db83452
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 25 deletions.
3 changes: 1 addition & 2 deletions docs/source/FAQ/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,7 @@ Below are some commonly asked questions.
:animate: fade-in-slide-down

- Dense retrieval: map the text into a single embedding, e.g., `DPR <https://arxiv.org/abs/2004.04906>`_, `BGE-v1.5 <../bge/bge_v1_v1.5>`_
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text.
e.g., BM25, `unicoil <https://arxiv.org/pdf/2106.14807>`_, and `splade <https://arxiv.org/abs/2107.05720>`_
- Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, `unicoil <https://arxiv.org/pdf/2106.14807>`_, and `splade <https://arxiv.org/abs/2107.05720>`_
- Multi-vector retrieval: use multiple vectors to represent a text, e.g., `ColBERT <https://arxiv.org/abs/2004.12832>`_.

.. dropdown:: Recommended vector database?
Expand Down
39 changes: 39 additions & 0 deletions docs/source/Introduction/embedder.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Embedder
========

.. tip::

If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`!

Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.

A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:

.. image:: ../_static/img/word2vec.png
:width: 500
:align: center

Nowadays, embedders are capable of mapping sentences and even passages into vector space.
They are widely used in real world tasks such as retrieval, clustering, etc.
In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.


Sparse Vector
-------------

Sparse vectors usually have structure of high dimensionality with only a few non-zero values, which usually effective for tasks like keyword matching.
Typically, though not always, the number of dimensions in sparse vectors corresponds to the different tokens present in the language.
Each dimension is assigned a value representing the token's relative importance within the document.
Some well known algorithms for sparse vector embedding includes `bag-of-words <https://en.wikipedia.org/wiki/Bag-of-words_model>`_, `TF-IDF <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, `BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_, etc.
Sparse vector embeddings have great ability to extract the information of key terms and their corresponding importance within documents.

Dense Vector
------------

Dense vectors typically use neural networks to map words, sentences, and passages into a fixed dimension latent vector space.
Then we can compare the similarity between two objects using certain metrics like Euclidean distance or Cos similarity.
Those vectors can represent deeper meaning of the sentences.
Thus we can distinguish sentences using similar words but actually have different meaning.
And also understand different ways in speaking and writing that express the same thing.
Dense vector embeddings, instead of keywords counting and matching, directly capture the semantics.
3 changes: 2 additions & 1 deletion docs/source/Introduction/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,5 +25,6 @@ Quickly get started with:
:caption: Concept

IR
model
embedder
reranker
retrieval_demo
Original file line number Diff line number Diff line change
@@ -1,26 +1,5 @@
Model
=====

If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`!

Embedder
--------

Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.

A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:

.. image:: ../_static/img/word2vec.png
:width: 500
:align: center

Nowadays, embedders are capable of mapping sentences and even passages into vector space.
They are widely used in real world tasks such as retrieval, clustering, etc.
In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.

Reranker
--------
========

Reranker, or Cross-Encoder, is a model that refines the ranking of candidate pairs (e.g., query-document pairs) by jointly encoding and scoring them.

Expand Down

0 comments on commit db83452

Please sign in to comment.