Deploying to master from @ jina-ai/website@f536c01 🚀

jina-ai · Aug 27, 2024 · 43e72f5 · 43e72f5
1 parent a9ab965
commit 43e72f5
Show file tree

Hide file tree

Showing 280 changed files with 1,478 additions and 1,478 deletions.
diff --git a/about-us/index.html b/about-us/index.html
diff --git a/assets/data.json b/assets/data.json
@@ -5,7 +5,7 @@
     "finetuner": "1.5K",
     "dalle-flow": "2.8K",
     "discoart": "3.8K",
-    "clip-as-service": "12.3K",
+    "clip-as-service": "12.4K",
     "jcloud": "297",
     "langchain-serve": "1.6K",
     "thinkgpt": "1.5K",

diff --git a/contact-sales/index.html b/contact-sales/index.html
diff --git a/embeddings/index.html b/embeddings/index.html
diff --git a/feed.rss b/feed.rss
@@ -1,4 +1,4 @@
-<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Jina AI]]></title><description><![CDATA[The official newsroom of Jina AI]]></description><link>https://jina.ai/news</link><image><url>https://jina.ai/favicon.ico</url><title>Jina AI</title><link>https://jina.ai/news</link></image><generator>Ghost 5.90</generator><lastBuildDate>Tue, 27 Aug 2024 07:14:26 GMT</lastBuildDate><atom:link href="https://jina.ai/feed.rss" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[The What and Why of Text-Image Modality Gap in CLIP Models]]></title><description><![CDATA[You can't just use a CLIP model to retrieve text and images and sort the results by score. Why? Because of the modality gap. What is it, and where does it come from?]]></description><link>https://jina.ai/news/the-what-and-why-of-text-image-modality-gap-in-clip-models/</link><guid isPermaLink="false">66c8431bda9a33000146d97d</guid><category><![CDATA[Tech Blog]]></category><dc:creator><![CDATA[Bo Wang]]></dc:creator><pubDate>Mon, 26 Aug 2024 13:56:36 GMT</pubDate><media:content url="https://jina-ai-gmbh.ghost.io/content/images/2024/08/modality-gap-banner-2.png" medium="image"/><content:encoded><![CDATA[<img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/modality-gap-banner-2.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"><p><a href="https://jina.ai/news/embeddings-the-swiss-army-knife-of-ai?ref=jina-ai-gmbh.ghost.io">Semantic embeddings</a> are the core of modern AI models, even chatbots and AI art models. They&#x2019;re sometimes hidden from users, but they&#x2019;re still there, lurking just under the surface.</p><p>The theory of embeddings has only two parts:</p><ol><li>Things &#x2014; things outside of an AI model, like texts and images &#x2014; are represented by vectors created by AI models from data about those things.</li><li>Relationships between things outside of an AI model are represented by spatial relations between those vectors. We train AI models specifically to create vectors that work that way.</li></ol><p>When we make an image-text multimodal model, we train the model so that embeddings of pictures and embeddings of texts describing or related to those pictures are relatively close together. The semantic similarities between the things those two vectors represent &#x2014; an image and a text &#x2014; are reflected in the spatial relationship between the two vectors.</p><p>For example, we might reasonably expect the embedding vectors for an image of an orange and the text &#x201C;a fresh orange&#x201D; to be closer together than the same image and the text &#x201C;a fresh apple.&#x201D;</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare_2.png" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="1000" height="500" srcset="https://jina-ai-gmbh.ghost.io/content/images/size/w600/2024/08/apple-orange-compare_2.png 600w, https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare_2.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>That&#x2019;s the purpose of an embedding model: To generate representations where the characteristics we care about &#x2014; like what kind of fruit is depicted in an image or named in a text &#x2014; are preserved in the distance between them.</p><p>But multimodality introduces something else. We might find that a picture of an orange is closer to a picture of an apple than it is to the text &#x201C;a fresh orange&#x201D;, and that the text &#x201C;a fresh apple&#x201D; is closer to another text than to an image of an apple.</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare.png" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="1000" height="500" srcset="https://jina-ai-gmbh.ghost.io/content/images/size/w600/2024/08/apple-orange-compare.png 600w, https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>It turns out this is exactly what happens with multimodal models, including Jina AI&#x2019;s own <a href="https://jina.ai/news/jina-clip-v1-a-truly-multimodal-embeddings-model-for-text-and-image?ref=jina-ai-gmbh.ghost.io">Jina CLIP model</a> (<code>jina-clip-v1</code>).</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://arxiv.org/abs/2405.20204?ref=jina-ai-gmbh.ghost.io"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Jina CLIP: Your CLIP Model Is Also Your Text Retriever</div><div class="kg-bookmark-description">Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://arxiv.org/static/browse/0.3.4/images/icons/apple-touch-icon.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"><span class="kg-bookmark-author">arXiv.org</span><span class="kg-bookmark-publisher">Andreas Koukounas</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"></div></a></figure><p>To test this, we sampled 1,000 text-image pairs from the <a href="https://www.kaggle.com/datasets/adityajn105/flickr8k?ref=jina-ai-gmbh.ghost.io">Flickr8k test set</a>. Each pair contains five caption texts (so technically not a pair), and a single image, with all five texts describing the same image.</p><p>For example, the following image (<code>1245022983_fb329886dd.jpg</code> in the Flickr8k dataset):</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/1245022983_fb329886dd.jpg" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="334" height="500"></figure><p>Its five captions:</p><pre><code class="language-Text">A child in all pink is posing nearby a stroller with buildings in the distance.
+<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Jina AI]]></title><description><![CDATA[The official newsroom of Jina AI]]></description><link>https://jina.ai/news</link><image><url>https://jina.ai/favicon.ico</url><title>Jina AI</title><link>https://jina.ai/news</link></image><generator>Ghost 5.90</generator><lastBuildDate>Tue, 27 Aug 2024 12:27:25 GMT</lastBuildDate><atom:link href="https://jina.ai/feed.rss" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[The What and Why of Text-Image Modality Gap in CLIP Models]]></title><description><![CDATA[You can't just use a CLIP model to retrieve text and images and sort the results by score. Why? Because of the modality gap. What is it, and where does it come from?]]></description><link>https://jina.ai/news/the-what-and-why-of-text-image-modality-gap-in-clip-models/</link><guid isPermaLink="false">66c8431bda9a33000146d97d</guid><category><![CDATA[Tech Blog]]></category><dc:creator><![CDATA[Bo Wang]]></dc:creator><pubDate>Mon, 26 Aug 2024 13:56:36 GMT</pubDate><media:content url="https://jina-ai-gmbh.ghost.io/content/images/2024/08/modality-gap-banner.jpg" medium="image"/><content:encoded><![CDATA[<img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/modality-gap-banner.jpg" alt="The What and Why of Text-Image Modality Gap in CLIP Models"><p><a href="https://jina.ai/news/embeddings-the-swiss-army-knife-of-ai?ref=jina-ai-gmbh.ghost.io">Semantic embeddings</a> are the core of modern AI models, even chatbots and AI art models. They&#x2019;re sometimes hidden from users, but they&#x2019;re still there, lurking just under the surface.</p><p>The theory of embeddings has only two parts:</p><ol><li>Things &#x2014; things outside of an AI model, like texts and images &#x2014; are represented by vectors created by AI models from data about those things.</li><li>Relationships between things outside of an AI model are represented by spatial relations between those vectors. We train AI models specifically to create vectors that work that way.</li></ol><p>When we make an image-text multimodal model, we train the model so that embeddings of pictures and embeddings of texts describing or related to those pictures are relatively close together. The semantic similarities between the things those two vectors represent &#x2014; an image and a text &#x2014; are reflected in the spatial relationship between the two vectors.</p><p>For example, we might reasonably expect the embedding vectors for an image of an orange and the text &#x201C;a fresh orange&#x201D; to be closer together than the same image and the text &#x201C;a fresh apple.&#x201D;</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare_2.png" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="1000" height="500" srcset="https://jina-ai-gmbh.ghost.io/content/images/size/w600/2024/08/apple-orange-compare_2.png 600w, https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare_2.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>That&#x2019;s the purpose of an embedding model: To generate representations where the characteristics we care about &#x2014; like what kind of fruit is depicted in an image or named in a text &#x2014; are preserved in the distance between them.</p><p>But multimodality introduces something else. We might find that a picture of an orange is closer to a picture of an apple than it is to the text &#x201C;a fresh orange&#x201D;, and that the text &#x201C;a fresh apple&#x201D; is closer to another text than to an image of an apple.</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare.png" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="1000" height="500" srcset="https://jina-ai-gmbh.ghost.io/content/images/size/w600/2024/08/apple-orange-compare.png 600w, https://jina-ai-gmbh.ghost.io/content/images/2024/08/apple-orange-compare.png 1000w" sizes="(min-width: 720px) 720px"></figure><p>It turns out this is exactly what happens with multimodal models, including Jina AI&#x2019;s own <a href="https://jina.ai/news/jina-clip-v1-a-truly-multimodal-embeddings-model-for-text-and-image?ref=jina-ai-gmbh.ghost.io">Jina CLIP model</a> (<code>jina-clip-v1</code>).</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://arxiv.org/abs/2405.20204?ref=jina-ai-gmbh.ghost.io"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Jina CLIP: Your CLIP Model Is Also Your Text Retriever</div><div class="kg-bookmark-description">Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://arxiv.org/static/browse/0.3.4/images/icons/apple-touch-icon.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"><span class="kg-bookmark-author">arXiv.org</span><span class="kg-bookmark-publisher">Andreas Koukounas</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" alt="The What and Why of Text-Image Modality Gap in CLIP Models"></div></a></figure><p>To test this, we sampled 1,000 text-image pairs from the <a href="https://www.kaggle.com/datasets/adityajn105/flickr8k?ref=jina-ai-gmbh.ghost.io">Flickr8k test set</a>. Each pair contains five caption texts (so technically not a pair), and a single image, with all five texts describing the same image.</p><p>For example, the following image (<code>1245022983_fb329886dd.jpg</code> in the Flickr8k dataset):</p><figure class="kg-card kg-image-card"><img src="https://jina-ai-gmbh.ghost.io/content/images/2024/08/1245022983_fb329886dd.jpg" class="kg-image" alt="The What and Why of Text-Image Modality Gap in CLIP Models" loading="lazy" width="334" height="500"></figure><p>Its five captions:</p><pre><code class="language-Text">A child in all pink is posing nearby a stroller with buildings in the distance.
 A little girl in pink dances with her hands on her hips.
 A small girl wearing pink dances on the sidewalk.
 The girl in a bright pink skirt dances near a stroller.

diff --git a/fine-tuning/index.html b/fine-tuning/index.html
diff --git a/index.html b/index.html
diff --git a/internship/index.html b/internship/index.html
diff --git a/legal/index.html b/legal/index.html
diff --git a/...ken-length-bilingual-embeddings-break-language-barriers-in-chinese-and-english/index.html b/...ken-length-bilingual-embeddings-break-language-barriers-in-chinese-and-english/index.html
diff --git a/news/a-deep-dive-into-tokenization/index.html b/news/a-deep-dive-into-tokenization/index.html
diff --git a/news/a-guide-for-super-resolution-ordinary-to-extraordinary-with-inference/index.html b/news/a-guide-for-super-resolution-ordinary-to-extraordinary-with-inference/index.html
diff --git a/news/a-magic-carpet-ride-building-vivid-product-stories-with-scenexplain/index.html b/news/a-magic-carpet-ride-building-vivid-product-stories-with-scenexplain/index.html
diff --git a/news/a-tale-of-two-worlds-emnlp-2023-at-sentosa/index.html b/news/a-tale-of-two-worlds-emnlp-2023-at-sentosa/index.html
diff --git a/news/agentchain-chain-together-models-to-perform-complex-tasks/index.html b/news/agentchain-chain-together-models-to-perform-complex-tasks/index.html
diff --git a/...i-explainability-made-easy-how-late-interaction-makes-jina-colbert-transparent/index.html b/...i-explainability-made-easy-how-late-interaction-makes-jina-colbert-transparent/index.html
diff --git a/news/air-bench-better-metrics-for-better-search-foundation/index.html b/news/air-bench-better-metrics-for-better-search-foundation/index.html
diff --git a/news/albus-by-springworks-empowering-employees-with-enterprise-search/index.html b/news/albus-by-springworks-empowering-employees-with-enterprise-search/index.html
diff --git a/...applying-jina-ai-finetuner-to-clip-less-data-smaller-models-higher-performance/index.html b/...applying-jina-ai-finetuner-to-clip-less-data-smaller-models-higher-performance/index.html
diff --git a/news/aqui-se-habla-espanol-top-quality-spanish-english-embeddings-and-8k-context/index.html b/news/aqui-se-habla-espanol-top-quality-spanish-english-embeddings-and-8k-context/index.html
diff --git a/news/are-you-ready-for-this-jelly-scenexplains-new-algo-kills-hallucinations-dead/index.html b/news/are-you-ready-for-this-jelly-scenexplains-new-algo-kills-hallucinations-dead/index.html
diff --git a/news/artificial-general-intelligence-is-cursed-and-science-fiction-isnt-helping/index.html b/news/artificial-general-intelligence-is-cursed-and-science-fiction-isnt-helping/index.html
diff --git a/news/asset_ovi/index.html b/news/asset_ovi/index.html
diff --git a/news/auto-gpt-unmasked-hype-hard-truths-production-pitfalls/index.html b/news/auto-gpt-unmasked-hype-hard-truths-production-pitfalls/index.html
diff --git a/news/benchmark-vector-search-databases-with-one-million-data/index.html b/news/benchmark-vector-search-databases-with-one-million-data/index.html
diff --git a/news/berlin-tech-job-fair/index.html b/news/berlin-tech-job-fair/index.html
diff --git a/...6-unleash-the-power-of-social-marketing-with-these-stunning-banners-and-titles/index.html b/...6-unleash-the-power-of-social-marketing-with-these-stunning-banners-and-titles/index.html
diff --git a/...ixels-to-prose-scenexplains-hearth-algorithm-breathes-audible-life-into-images/index.html b/...ixels-to-prose-scenexplains-hearth-algorithm-breathes-audible-life-into-images/index.html
diff --git a/news/binary-embeddings-all-the-ai-3125-of-the-fat/index.html b/news/binary-embeddings-all-the-ai-3125-of-the-fat/index.html
diff --git a/...ting-ai-model-performance-fine-tuning-with-synthetic-data-for-enhanced-results/index.html b/...ting-ai-model-performance-fine-tuning-with-synthetic-data-for-enhanced-results/index.html