Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Integrating ChemTEB #1708

Merged
merged 82 commits into from
Jan 25, 2025
Merged
Show file tree
Hide file tree
Changes from 78 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
dfa6f84
Add SMILES, AI Paraphrase and Inter-Source Paragraphs PairClassificat…
HSILA Aug 10, 2024
d0d94db
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Aug 10, 2024
1190c02
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Aug 12, 2024
b56e017
Add chemical subsets of NQ and HotpotQA datasets as Retrieval tasks
HSILA Aug 12, 2024
678dbc9
Add PubChem Synonyms PairClassification task
HSILA Aug 12, 2024
9c8f7f5
Update task __init__ for previously added tasks
HSILA Aug 12, 2024
5e31208
Add nomic-bert loader
HSILA Aug 12, 2024
20f69a2
Merge branch 'chemteb' of https://github.com/basf/chemteb into chemteb
HSILA Aug 12, 2024
9806073
Add a script to run the evaluation pipeline for chemical-related tasks
HSILA Aug 12, 2024
09d2fee
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Aug 17, 2024
947e07a
Add 15 Wikipedia article classification tasks
HSILA Aug 17, 2024
842af8e
Merge branch 'chemteb' of https://github.com/basf/chemteb into chemteb
HSILA Aug 17, 2024
47b550f
Add PairClassification and BitextMining tasks for Coconut SMILES
HSILA Aug 17, 2024
79d9111
Fix naming of some Classification and PairClassification tasks
HSILA Aug 18, 2024
17f8be1
Fix some classification tasks naming issues
HSILA Aug 18, 2024
bb77955
Integrate WANDB with benchmarking script
HSILA Aug 19, 2024
d287801
Update .gitignore
HSILA Aug 19, 2024
b3a4f72
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Aug 19, 2024
0ec882b
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Aug 20, 2024
107fba5
Fix `nomic_models.py` issue with retrieval tasks, similar to issue #1…
HSILA Aug 20, 2024
82aa559
Add one chemical model and some SentenceTransformer models
HSILA Aug 21, 2024
90c5ecb
Fix a naming issue for SentenceTransformer models
HSILA Aug 21, 2024
0078b2d
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Aug 25, 2024
7fcd3ea
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Aug 27, 2024
85b6ba5
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Sep 5, 2024
6bc75fc
Merge branch 'chemteb' of https://github.com/basf/chemteb into chemteb
HSILA Sep 5, 2024
0c2deda
Add OpenAI, bge-m3 and matscibert models
HSILA Sep 5, 2024
4e9f309
Add PubChem SMILES Bitext Mining tasks
HSILA Sep 8, 2024
52d1831
Change metric namings to be more descriptive
HSILA Sep 8, 2024
5c4e501
Add English e5 and bge v1 models, all the sizes
HSILA Sep 8, 2024
30996cd
Add two Wikipedia Clustering tasks
HSILA Sep 8, 2024
c9198b0
Add a try-except in evaluation script to skip faulty models during th…
HSILA Sep 8, 2024
043e6fd
Add bge v1.5 models and clustering score extraction to json parser
HSILA Sep 8, 2024
1be7bf2
Add Amazon Titan embedding models
HSILA Sep 9, 2024
655f699
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Sep 9, 2024
69adb9c
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Sep 11, 2024
1e950e8
Add Cohere Bedrock models
HSILA Sep 11, 2024
2116f8c
Add two SDS Classification tasks
HSILA Sep 11, 2024
0c415f1
Add SDS Classification tasks to classification init and chem_eval
HSILA Sep 12, 2024
63fcb6a
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Sep 17, 2024
0fe4eb0
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Sep 19, 2024
b73613b
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Sep 20, 2024
737b554
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Oct 1, 2024
31404f9
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Oct 3, 2024
63b416f
Add a retrieval dataset, update dataset names and revisions
HSILA Oct 3, 2024
95fd764
Merge branch 'chemteb' of https://github.com/basf/chemteb into chemteb
HSILA Oct 3, 2024
f1a36e7
Update revision for the CoconutRetrieval dataset: handle duplicate SM…
HSILA Oct 3, 2024
16ce7b9
Update `CoconutSMILES2FormulaPC` task
HSILA Oct 4, 2024
0ecb130
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Oct 7, 2024
fbd11ba
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Oct 10, 2024
7c10fa9
Change CoconutRetrieval dataset to a smaller one
HSILA Oct 10, 2024
5fc8100
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Oct 15, 2024
4906ca2
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Oct 17, 2024
10dae45
Merge remote-tracking branch 'upstream/main' into chemteb
HSILA Oct 21, 2024
84d7ae7
Merge remote-tracking branch 'upstream/main' into chemteb
HSILA Dec 13, 2024
b5596a9
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Dec 13, 2024
1d6b06a
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Dec 16, 2024
a51aec8
Merge remote-tracking branch 'upstream/main' into chemteb
HSILA Dec 25, 2024
9572ded
Merge branch 'chemteb' of https://github.com/basf/chemteb into chemteb
HSILA Dec 25, 2024
8721a65
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Jan 2, 2025
4e92589
Update some models
HSILA Jan 2, 2025
cea1063
Fix a typo
HSILA Jan 2, 2025
47668ac
Update ChemTEB tasks
HSILA Jan 4, 2025
4883f42
Merge branch 'embeddings-benchmark:main' into chemteb
HSILA Jan 4, 2025
4864dc0
Remove unnecessary files and tasks for MTEB
HSILA Jan 4, 2025
c68d696
Update some ChemTEB tasks
HSILA Jan 5, 2025
20637f6
Create ChemTEB benchmark
HSILA Jan 5, 2025
5c04d06
Remove `CoconutRetrieval`
HSILA Jan 5, 2025
c947012
Update tasks and benchmarks tables with ChemTEB
HSILA Jan 5, 2025
add930c
Mention ChemTEB in readme
HSILA Jan 5, 2025
ea75d39
Fix some issues, update task metadata, lint
HSILA Jan 5, 2025
256d6d8
Remove `nomic_bert_model.py` as it is now compatible with SentenceTra…
HSILA Jan 5, 2025
3475017
Remove `WikipediaAIParagraphsParaphrasePC` task due to being trivial.
HSILA Jan 5, 2025
da4ef35
Merge `amazon_models` and `cohere_bedrock_models.py` into `bedrock_mo…
HSILA Jan 5, 2025
f50cd66
Remove unnecessary `load_data` for some tasks.
HSILA Jan 5, 2025
1dd051c
Update `bedrock_models.py`, `openai_models.py` and two dataset revisions
HSILA Jan 8, 2025
7b93330
Add a layer of dynamic truncation for amazon models in `bedrock_model…
HSILA Jan 11, 2025
ead1026
Merge branch 'embeddings-benchmark:main' into mteb
HSILA Jan 11, 2025
064d053
Replace `metadata_dict` with `self.metadata` in `PubChemSMILESPC.py`
HSILA Jan 12, 2025
62f99b0
Merge remote-tracking branch 'upstream/main' into mteb
HSILA Jan 23, 2025
0156440
fix model meta for bedrock models
HSILA Jan 23, 2025
c630053
Add reference comment to original Cohere API implementation
HSILA Jan 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -517,5 +517,6 @@ You may also want to read and cite the amazing work that has extended MTEB & int
- Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini. "[FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions](https://arxiv.org/abs/2403.15246)" arXiv 2024
- Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, Sujian Li. "[LongEmbed: Extending Embedding Models for Long Context Retrieval](https://arxiv.org/abs/2404.12096)" arXiv 2024
- Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, Kristoffer Laigaard Nielbo. "[The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding](https://arxiv.org/abs/2406.02396)" arXiv 2024
- Ali Shiraee Kasmaee, Mohammad Khodadad, Mohammad Arshi Saloot, Nick Sherck, Stephen Dokas, Hamidreza Mahyar, Soheila Samiee. "[ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance & Efficiency on a Specific Domain](https://arxiv.org/abs/2412.00532)" arXiv 2024

For works that have used MTEB for benchmarking, you can find them on the [leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
33 changes: 22 additions & 11 deletions docs/benchmarks.md

Large diffs are not rendered by default.

55 changes: 41 additions & 14 deletions docs/tasks.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions mteb/abstasks/TaskMetadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@
"Web",
"Written",
"Programming",
"Chemistry",
]

SAMPLE_CREATION_METHOD = Literal[
Expand Down
44 changes: 44 additions & 0 deletions mteb/benchmarks/benchmarks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1037,3 +1037,47 @@ def load_results(
reference="https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6",
citation=None,
)

CHEMTEB = Benchmark(
name="ChemTEB",
tasks=get_tasks(
tasks=[
"PubChemSMILESBitextMining",
"SDSEyeProtectionClassification",
"SDSGlovesClassification",
"WikipediaBioMetChemClassification",
"WikipediaGreenhouseEnantiopureClassification",
"WikipediaSolidStateColloidalClassification",
"WikipediaOrganicInorganicClassification",
"WikipediaCryobiologySeparationClassification",
"WikipediaChemistryTopicsClassification",
"WikipediaTheoreticalAppliedClassification",
"WikipediaChemFieldsClassification",
"WikipediaLuminescenceClassification",
"WikipediaIsotopesFissionClassification",
"WikipediaSaltsSemiconductorsClassification",
"WikipediaBiolumNeurochemClassification",
"WikipediaCrystallographyAnalyticalClassification",
"WikipediaCompChemSpectroscopyClassification",
"WikipediaChemEngSpecialtiesClassification",
"WikipediaChemistryTopicsClustering",
"WikipediaSpecialtiesInChemistryClustering",
"PubChemAISentenceParaphrasePC",
"PubChemSMILESPC",
"PubChemSynonymPC",
"PubChemWikiParagraphsPC",
"PubChemWikiPairClassification",
"ChemNQRetrieval",
"ChemHotpotQARetrieval",
],
),
description="ChemTEB evaluates the performance of text embedding models on chemical domain data.",
reference="https://arxiv.org/abs/2412.00532",
citation="""
@article{kasmaee2024chemteb,
title={ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
author={Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
journal={arXiv preprint arXiv:2412.00532},
year={2024}
}""",
)
254 changes: 254 additions & 0 deletions mteb/models/bedrock_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
from __future__ import annotations

import json
import logging
import re
from functools import partial
from typing import Any

import numpy as np
import tqdm

from mteb.encoder_interface import PromptType
from mteb.model_meta import ModelMeta
from mteb.models.cohere_models import model_prompts as cohere_model_prompts
from mteb.models.cohere_models import supported_languages as cohere_supported_languages
from mteb.requires_package import requires_package

from .wrapper import Wrapper

logger = logging.getLogger(__name__)


class BedrockWrapper(Wrapper):
def __init__(
self,
model_id: str,
provider: str,
max_tokens: int,
model_prompts: dict[str, str] | None = None,
**kwargs,
) -> None:
requires_package(self, "boto3", "The AWS SDK for Python")
import boto3

boto3_session = boto3.session.Session()
region_name = boto3_session.region_name
self._client = boto3.client("bedrock-runtime", region_name)

self._model_id = model_id
self._provider = provider.lower()

if self._provider == "cohere":
self.model_prompts = (
self.validate_task_to_prompt_name(model_prompts)
if model_prompts
else None
)
self._max_batch_size = 96
self._max_sequence_length = max_tokens * 4
else:
self._max_tokens = max_tokens

def encode(
self,
sentences: list[str],
*,
task_name: str | None = None,
prompt_type: PromptType | None = None,
**kwargs: Any,
) -> np.ndarray:
requires_package(self, "boto3", "Amazon Bedrock")
show_progress_bar = (
False
if "show_progress_bar" not in kwargs
else kwargs.pop("show_progress_bar")
)
if self._provider == "amazon":
return self._encode_amazon(sentences, show_progress_bar)
elif self._provider == "cohere":
prompt_name = self.get_prompt_name(
self.model_prompts, task_name, prompt_type
)
cohere_task_type = self.model_prompts.get(prompt_name, "search_document")
return self._encode_cohere(sentences, cohere_task_type, show_progress_bar)
else:
raise ValueError(
f"Unknown provider '{self._provider}'. Must be 'amazon' or 'cohere'."
)

def _encode_amazon(
self, sentences: list[str], show_progress_bar: bool = False
) -> np.ndarray:
from botocore.exceptions import ValidationError

all_embeddings = []
# https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html
max_sequence_length = int(self._max_tokens * 4.5)

for sentence in tqdm.tqdm(
sentences, leave=False, disable=not show_progress_bar
):
if len(sentence) > max_sequence_length:
truncated_sentence = sentence[:max_sequence_length]
else:
truncated_sentence = sentence

try:
embedding = self._embed_amazon(truncated_sentence)
all_embeddings.append(embedding)

except ValidationError as e:
error_str = str(e)
pattern = r"request input token count:\s*(\d+)"
match = re.search(pattern, error_str)
if match:
num_tokens = int(match.group(1))

ratio = 0.9 * (self._max_tokens / num_tokens)
dynamic_cutoff = int(len(truncated_sentence) * ratio)

embedding = self._embed_amazon(truncated_sentence[:dynamic_cutoff])
all_embeddings.append(embedding)
else:
raise e

return np.array(all_embeddings)

def _encode_cohere(
self,
sentences: list[str],
cohere_task_type: str,
show_progress_bar: bool = False,
) -> np.ndarray:
batches = [
sentences[i : i + self._max_batch_size]
for i in range(0, len(sentences), self._max_batch_size)
]

all_embeddings = []

for batch in tqdm.tqdm(batches, leave=False, disable=not show_progress_bar):
response = self._client.invoke_model(
body=json.dumps(
{
"texts": [sent[: self._max_sequence_length] for sent in batch],
"input_type": cohere_task_type,
}
),
modelId=self._model_id,
accept="*/*",
contentType="application/json",
)
all_embeddings.extend(self._to_numpy(response))

return np.array(all_embeddings)

def _embed_amazon(self, sentence: str) -> np.ndarray:
response = self._client.invoke_model(
body=json.dumps({"inputText": sentence}),
modelId=self._model_id,
accept="application/json",
contentType="application/json",
)
return self._to_numpy(response)

def _to_numpy(self, embedding_response) -> np.ndarray:
response = json.loads(embedding_response.get("body").read())
key = "embedding" if self._provider == "amazon" else "embeddings"
return np.array(response[key])


amazon_titan_embed_text_v1 = ModelMeta(
name="bedrock/amazon-titan-embed-text-v1",
revision="1",
release_date="2023-09-27",
languages=None, # not specified
loader=partial(
BedrockWrapper,
model_id="amazon.titan-embed-text-v1",
provider="amazon",
max_tokens=8192,
),
max_tokens=8192,
embed_dim=1536,
open_weights=False,
n_parameters=None,
memory_usage=None,
license=None,
reference="https://aws.amazon.com/about-aws/whats-new/2023/09/amazon-titan-embeddings-generally-available/",
similarity_fn_name="cosine",
framework=["API"],
use_instructions=False,
)

amazon_titan_embed_text_v2 = ModelMeta(
name="bedrock/amazon-titan-embed-text-v2",
revision="1",
release_date="2024-04-30",
languages=None, # not specified
loader=partial(
BedrockWrapper,
model_id="amazon.titan-embed-text-v2:0",
provider="amazon",
max_tokens=8192,
),
max_tokens=8192,
embed_dim=1024,
open_weights=False,
n_parameters=None,
memory_usage=None,
license=None,
reference="https://aws.amazon.com/about-aws/whats-new/2024/04/amazon-titan-text-embeddings-v2-amazon-bedrock/",
similarity_fn_name="cosine",
framework=["API"],
use_instructions=False,
)

cohere_embed_english_v3 = ModelMeta(
HSILA marked this conversation as resolved.
Show resolved Hide resolved
loader=partial(
BedrockWrapper,
model_id="cohere.embed-english-v3",
provider="cohere",
max_tokens=512,
model_prompts=cohere_model_prompts,
),
name="bedrock/cohere-embed-english-v3",
languages=["eng-Latn"],
open_weights=False,
reference="https://cohere.com/blog/introducing-embed-v3",
revision="1",
release_date="2023-11-02",
n_parameters=None,
memory_usage=None,
max_tokens=512,
embed_dim=1024,
license=None,
similarity_fn_name="cosine",
framework=["API"],
use_instructions=True,
)

cohere_embed_multilingual_v3 = ModelMeta(
loader=partial(
BedrockWrapper,
model_id="cohere.embed-multilingual-v3",
provider="cohere",
max_tokens=512,
model_prompts=cohere_model_prompts,
),
name="bedrock/cohere-embed-multilingual-v3",
languages=cohere_supported_languages,
open_weights=False,
reference="https://cohere.com/blog/introducing-embed-v3",
revision="1",
release_date="2023-11-02",
n_parameters=None,
memory_usage=None,
max_tokens=512,
embed_dim=1024,
license=None,
similarity_fn_name="cosine",
framework=["API"],
use_instructions=True,
)
2 changes: 2 additions & 0 deletions mteb/models/overview.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from mteb.model_meta import ModelMeta
from mteb.models import (
arctic_models,
bedrock_models,
bge_models,
bm25,
cohere_models,
Expand Down Expand Up @@ -86,6 +87,7 @@
jasper_models,
uae_models,
stella_models,
bedrock_models,
uae_models,
voyage_models,
]
Expand Down
1 change: 1 addition & 0 deletions mteb/tasks/BitextMining/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

from .dan.BornholmskBitextMining import *
from .eng.PubChemSMILESBitextMining import *
from .kat.TbilisiCityHallBitextMining import *
from .multilingual.BibleNLPBitextMining import *
from .multilingual.BUCCBitextMining import *
Expand Down
Loading
Loading