update docs

FlagOpen · Jan 17, 2025 · db83452 · db83452
1 parent 70c3d32
commit db83452
Show file tree

Hide file tree

Showing 4 changed files with 43 additions and 25 deletions.
diff --git a/docs/source/FAQ/index.rst b/docs/source/FAQ/index.rst
@@ -47,8 +47,7 @@ Below are some commonly asked questions.
     :animate: fade-in-slide-down
 
     - Dense retrieval: map the text into a single embedding, e.g., `DPR <https://arxiv.org/abs/2004.04906>`_, `BGE-v1.5 <../bge/bge_v1_v1.5>`_
-    - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. 
-    e.g., BM25, `unicoil <https://arxiv.org/pdf/2106.14807>`_, and `splade <https://arxiv.org/abs/2107.05720>`_
+    - Sparse retrieval (lexical matching): a vector of size equal to the vocabulary, with the majority of positions set to zero, calculating a weight only for tokens present in the text. e.g., BM25, `unicoil <https://arxiv.org/pdf/2106.14807>`_, and `splade <https://arxiv.org/abs/2107.05720>`_
     - Multi-vector retrieval: use multiple vectors to represent a text, e.g., `ColBERT <https://arxiv.org/abs/2004.12832>`_.
 
 .. dropdown:: Recommended vector database?

diff --git a/docs/source/Introduction/embedder.rst b/docs/source/Introduction/embedder.rst
@@ -0,0 +1,39 @@
+Embedder
+========
+
+.. tip::
+
+   If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`!
+
+Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
+These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.
+
+A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:
+
+.. image:: ../_static/img/word2vec.png
+   :width: 500
+   :align: center
+
+Nowadays, embedders are capable of mapping sentences and even passages into vector space.
+They are widely used in real world tasks such as retrieval, clustering, etc.
+In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.
+
+
+Sparse Vector
+-------------
+
+Sparse vectors usually have structure of high dimensionality with only a few non-zero values, which usually effective for tasks like keyword matching.
+Typically, though not always, the number of dimensions in sparse vectors corresponds to the different tokens present in the language.
+Each dimension is assigned a value representing the token's relative importance within the document.
+Some well known algorithms for sparse vector embedding includes `bag-of-words <https://en.wikipedia.org/wiki/Bag-of-words_model>`_, `TF-IDF <https://en.wikipedia.org/wiki/Tf%E2%80%93idf>`_, `BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>`_, etc.
+Sparse vector embeddings have great ability to extract the information of key terms and their corresponding importance within documents.
+
+Dense Vector
+------------
+
+Dense vectors typically use neural networks to map words, sentences, and passages into a fixed dimension latent vector space. 
+Then we can compare the similarity between two objects using certain metrics like Euclidean distance or Cos similarity.
+Those vectors can represent deeper meaning of the sentences.
+Thus we can distinguish sentences using similar words but actually have different meaning. 
+And also understand different ways in speaking and writing that express the same thing.
+Dense vector embeddings, instead of keywords counting and matching, directly capture the semantics.
diff --git a/docs/source/Introduction/index.rst b/docs/source/Introduction/index.rst
@@ -25,5 +25,6 @@ Quickly get started with:
    :caption: Concept
 
    IR
-   model
+   embedder
+   reranker
    retrieval_demo
diff --git a/docs/source/Introduction/model.rst → docs/source/Introduction/reranker.rst b/docs/source/Introduction/model.rst → docs/source/Introduction/reranker.rst
@@ -1,26 +1,5 @@
-Model
-=====
-
-If you are already familiar with the concepts, take a look at the :doc:`BGE models <../bge/index>`!
-
-Embedder
---------
-
-Embedder, or embedding model, bi-encoder, is a model designed to convert data, usually text, codes, or images, into sparse or dense numerical vectors (embeddings) in a high dimensional vector space.
-These embeddings capture the semantic meaning or key features of the input, which enable efficient comparison and analysis.
-
-A very famous demonstration is the example from `word2vec <https://arxiv.org/abs/1301.3781>`_. It shows how word embeddings capture semantic relationships through vector arithmetic:
-
-.. image:: ../_static/img/word2vec.png
-   :width: 500
-   :align: center
-
-Nowadays, embedders are capable of mapping sentences and even passages into vector space.
-They are widely used in real world tasks such as retrieval, clustering, etc.
-In the era of LLMs, embedding models play a pivot role in RAG, enables LLMs to access and integrate relevant context from vast external datasets.
-
 Reranker
---------
+========
 
 Reranker, or Cross-Encoder, is a model that refines the ranking of candidate pairs (e.g., query-document pairs) by jointly encoding and scoring them.