Skip to content

Commit

Permalink
Deploying to master from @ jina-ai/website@f536c01 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
hanxiao committed Aug 27, 2024
1 parent a9ab965 commit 43e72f5
Show file tree
Hide file tree
Showing 280 changed files with 1,478 additions and 1,478 deletions.
2 changes: 1 addition & 1 deletion about-us/index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion assets/data.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"finetuner": "1.5K",
"dalle-flow": "2.8K",
"discoart": "3.8K",
"clip-as-service": "12.3K",
"clip-as-service": "12.4K",
"jcloud": "297",
"langchain-serve": "1.6K",
"thinkgpt": "1.5K",
Expand Down
4 changes: 2 additions & 2 deletions contact-sales/index.html

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions embeddings/index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion feed.rss
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Jina AI]]></title><description><![CDATA[The official newsroom of Jina AI]]></description><link>https://jina.ai/news</link><image><url>https://jina.ai/favicon.ico</url><title>Jina AI</title><link>https://jina.ai/news</link></image><generator>Ghost 5.90</generator><lastBuildDate>Tue, 27 Aug 2024 07:14:26 GMT</lastBuildDate><atom:link href="https://jina.ai/feed.rss" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[The What and Why of Text-Image Modality Gap in CLIP Models]]></title><description><![CDATA[You can't just use a CLIP model to retrieve text and images and sort the results by score. Why? Because of the modality gap. What is it, and where does it come from?]]></description><link>https://jina.ai/news/the-what-and-why-of-text-image-modality-gap-in-clip-models/</link><guid isPermaLink="false">66c8431bda9a33000146d97d</guid><category><![CDATA[Tech Blog]]></category><dc:creator><![CDATA[Bo Wang]]></dc:creator><pubDate>Mon, 26 Aug 2024 13:56:36 GMT</pubDate><media:content url="https://jina-ai-gmbh.ghost.io/content/images/2024/08/modality-gap-banner-2.png" medium="image"/><content:encoded><![CDATA[<img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/modality-gap-banner-2.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"><p><a href="https://jina.ai/news/embeddings-the-swiss-army-knife-of-ai?ref=jina-ai-gmbh.ghost.io">Semantic embeddings</a> are the core of modern AI models, even chatbots and AI art models. They&#x2019;re sometimes hidden from users, but they&#x2019;re still there, lurking just under the surface.</p><p>The theory of embeddings has only two parts:</p><ol><li>Things &#x2014; things outside of an AI model, like texts and images &#x2014; are represented by vectors created by AI models from data about those things.</li><li>Relationships between things outside of an AI model are represented by spatial relations between those vectors. We train AI models specifically to create vectors that work that way.</li></ol><p>When we make an image-text multimodal model, we train the model so that embeddings of pictures and embeddings of texts describing or related to those pictures are relatively close together. The semantic similarities between the things those two vectors represent &#x2014; an image and a text &#x2014; are reflected in the spatial relationship between the two vectors.</p><p>For example, we might reasonably expect the embedding vectors for an image of an orange and the text &#x201C;a fresh orange&#x201D; to be closer together than the same image and the text &#x201C;a fresh apple.&#x201D;</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare_2.png" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="1000" height="500" srcset="https://jina-ai-gmbh.ghost.io/content/images/size/w600/2024/08/apple-orange-compare_2.png 600w, https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare_2.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>That&#x2019;s the purpose of an embedding model: To generate representations where the characteristics we care about &#x2014; like what kind of fruit is depicted in an image or named in a text &#x2014; are preserved in the distance between them.</p><p>But multimodality introduces something else. We might find that a picture of an orange is closer to a picture of an apple than it is to the text &#x201C;a fresh orange&#x201D;, and that the text &#x201C;a fresh apple&#x201D; is closer to another text than to an image of an apple.</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare.png" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="1000" height="500" srcset="https://jina-ai-gmbh.ghost.io/content/images/size/w600/2024/08/apple-orange-compare.png 600w, https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>It turns out this is exactly what happens with multimodal models, including Jina AI&#x2019;s own <a href="https://jina.ai/news/jina-clip-v1-a-truly-multimodal-embeddings-model-for-text-and-image?ref=jina-ai-gmbh.ghost.io">Jina CLIP model</a> (<code>jina-clip-v1</code>).</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://arxiv.org/abs/2405.20204?ref=jina-ai-gmbh.ghost.io"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Jina CLIP: Your CLIP Model Is Also Your Text Retriever</div><div class="kg-bookmark-description">Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://arxiv.org/static/browse/0.3.4/images/icons/apple-touch-icon.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"><span class="kg-bookmark-author">arXiv.org</span><span class="kg-bookmark-publisher">Andreas Koukounas</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"></div></a></figure><p>To test this, we sampled 1,000 text-image pairs from the <a href="https://www.kaggle.com/datasets/adityajn105/flickr8k?ref=jina-ai-gmbh.ghost.io">Flickr8k test set</a>. Each pair contains five caption texts (so technically not a pair), and a single image, with all five texts describing the same image.</p><p>For example, the following image (<code>1245022983_fb329886dd.jpg</code> in the Flickr8k dataset):</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/1245022983_fb329886dd.jpg" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="334" height="500"></figure><p>Its five captions:</p><pre><code class="language-Text">A child in all pink is posing nearby a stroller with buildings in the distance.
<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Jina AI]]></title><description><![CDATA[The official newsroom of Jina AI]]></description><link>https://jina.ai/news</link><image><url>https://jina.ai/favicon.ico</url><title>Jina AI</title><link>https://jina.ai/news</link></image><generator>Ghost 5.90</generator><lastBuildDate>Tue, 27 Aug 2024 12:27:25 GMT</lastBuildDate><atom:link href="https://jina.ai/feed.rss" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[The What and Why of Text-Image Modality Gap in CLIP Models]]></title><description><![CDATA[You can't just use a CLIP model to retrieve text and images and sort the results by score. Why? Because of the modality gap. What is it, and where does it come from?]]></description><link>https://jina.ai/news/the-what-and-why-of-text-image-modality-gap-in-clip-models/</link><guid isPermaLink="false">66c8431bda9a33000146d97d</guid><category><![CDATA[Tech Blog]]></category><dc:creator><![CDATA[Bo Wang]]></dc:creator><pubDate>Mon, 26 Aug 2024 13:56:36 GMT</pubDate><media:content url="https://jina-ai-gmbh.ghost.io/content/images/2024/08/modality-gap-banner.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/modality-gap-banner.jpg" alt="The What and Why of Text-Image Modality Gap in CLIP Models"><p><a href="https://jina.ai/news/embeddings-the-swiss-army-knife-of-ai?ref=jina-ai-gmbh.ghost.io">Semantic embeddings</a> are the core of modern AI models, even chatbots and AI art models. They&#x2019;re sometimes hidden from users, but they&#x2019;re still there, lurking just under the surface.</p><p>The theory of embeddings has only two parts:</p><ol><li>Things &#x2014; things outside of an AI model, like texts and images &#x2014; are represented by vectors created by AI models from data about those things.</li><li>Relationships between things outside of an AI model are represented by spatial relations between those vectors. We train AI models specifically to create vectors that work that way.</li></ol><p>When we make an image-text multimodal model, we train the model so that embeddings of pictures and embeddings of texts describing or related to those pictures are relatively close together. The semantic similarities between the things those two vectors represent &#x2014; an image and a text &#x2014; are reflected in the spatial relationship between the two vectors.</p><p>For example, we might reasonably expect the embedding vectors for an image of an orange and the text &#x201C;a fresh orange&#x201D; to be closer together than the same image and the text &#x201C;a fresh apple.&#x201D;</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare_2.png" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="1000" height="500" srcset="https://jina-ai-gmbh.ghost.io/content/images/size/w600/2024/08/apple-orange-compare_2.png 600w, https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare_2.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>That&#x2019;s the purpose of an embedding model: To generate representations where the characteristics we care about &#x2014; like what kind of fruit is depicted in an image or named in a text &#x2014; are preserved in the distance between them.</p><p>But multimodality introduces something else. We might find that a picture of an orange is closer to a picture of an apple than it is to the text &#x201C;a fresh orange&#x201D;, and that the text &#x201C;a fresh apple&#x201D; is closer to another text than to an image of an apple.</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare.png" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="1000" height="500" srcset="https://jina-ai-gmbh.ghost.io/content/images/size/w600/2024/08/apple-orange-compare.png 600w, https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>It turns out this is exactly what happens with multimodal models, including Jina AI&#x2019;s own <a href="https://jina.ai/news/jina-clip-v1-a-truly-multimodal-embeddings-model-for-text-and-image?ref=jina-ai-gmbh.ghost.io">Jina CLIP model</a> (<code>jina-clip-v1</code>).</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://arxiv.org/abs/2405.20204?ref=jina-ai-gmbh.ghost.io"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Jina CLIP: Your CLIP Model Is Also Your Text Retriever</div><div class="kg-bookmark-description">Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://arxiv.org/static/browse/0.3.4/images/icons/apple-touch-icon.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"><span class="kg-bookmark-author">arXiv.org</span><span class="kg-bookmark-publisher">Andreas Koukounas</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"></div></a></figure><p>To test this, we sampled 1,000 text-image pairs from the <a href="https://www.kaggle.com/datasets/adityajn105/flickr8k?ref=jina-ai-gmbh.ghost.io">Flickr8k test set</a>. Each pair contains five caption texts (so technically not a pair), and a single image, with all five texts describing the same image.</p><p>For example, the following image (<code>1245022983_fb329886dd.jpg</code> in the Flickr8k dataset):</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/1245022983_fb329886dd.jpg" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="334" height="500"></figure><p>Its five captions:</p><pre><code class="language-Text">A child in all pink is posing nearby a stroller with buildings in the distance.
A little girl in pink dances with her hands on her hips.
A small girl wearing pink dances on the sidewalk.
The girl in a bright pink skirt dances near a stroller.
Expand Down
2 changes: 1 addition & 1 deletion fine-tuning/index.html

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion internship/index.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion legal/index.html

Large diffs are not rendered by default.

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions news/a-deep-dive-into-tokenization/index.html

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions news/a-tale-of-two-worlds-emnlp-2023-at-sentosa/index.html

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion news/asset_ovi/index.html

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion news/berlin-tech-job-fair/index.html

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions news/binary-embeddings-all-the-ai-3125-of-the-fat/index.html

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Loading

0 comments on commit 43e72f5

Please sign in to comment.