Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model jxm/cde-small-v1 #1521

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
d894bbc
Fix verbosity handling in MTEB.py for consistent logging
YashDThapliyal Nov 4, 2024
0dc2a8a
updates
YashDThapliyal Nov 4, 2024
00032aa
update docstrings
YashDThapliyal Nov 4, 2024
7ae0583
linting code
YashDThapliyal Nov 4, 2024
838253d
Create cde-small-v1_model.py
YashDThapliyal Nov 27, 2024
eb04a87
update code for cde-small-v1 model
YashDThapliyal Nov 28, 2024
be6790e
Merge branch 'main' of https://github.com/embeddings-benchmark/mteb
YashDThapliyal Nov 28, 2024
5dc100f
Merge branch 'embeddings-benchmark:main' into main
YashDThapliyal Nov 28, 2024
c95b672
make lint and make test
YashDThapliyal Nov 28, 2024
3a692ec
Merge branch 'main' of https://github.com/YashDThapliyal/mteb
YashDThapliyal Nov 28, 2024
6ab8ffb
Update cde-small-v1_model.py
YashDThapliyal Nov 28, 2024
36d6702
Update cde-small-v1_model.py
YashDThapliyal Nov 28, 2024
200029d
create a test
YashDThapliyal Dec 23, 2024
5f48218
Merge branch 'embeddings-benchmark:main' into main
YashDThapliyal Dec 23, 2024
64b1c41
add model meta data card
YashDThapliyal Dec 23, 2024
93b5794
Merge branch 'main' of https://github.com/YashDThapliyal/mteb
YashDThapliyal Dec 24, 2024
9a25f97
remove zero_shot_benchmark as discussed on PR
YashDThapliyal Dec 24, 2024
06f5f01
clean up comments/add liscense
YashDThapliyal Dec 24, 2024
3a2d2e3
begin implementing cde
YashDThapliyal Dec 25, 2024
9222601
add corpus for the model to use
YashDThapliyal Dec 25, 2024
d5413fd
add model implementation via following HF refrence
YashDThapliyal Dec 25, 2024
e269da6
syntax error fix (delete ';' )
YashDThapliyal Dec 25, 2024
8af60b6
Update cde-small-v1_model.py
YashDThapliyal Dec 25, 2024
919c2b1
update implementation code
YashDThapliyal Dec 25, 2024
5be4ff5
Update cde-small-v1_model.py
YashDThapliyal Dec 25, 2024
5a05876
add results folder
YashDThapliyal Dec 25, 2024
449111e
Delete mteb/models/results directory
YashDThapliyal Dec 25, 2024
9815f36
results directory
YashDThapliyal Dec 25, 2024
d2ab89d
Merge remote-tracking branch 'upstream/main'
YashDThapliyal Jan 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion mteb/leaderboard/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def get_means_per_types(df: pd.DataFrame) -> pd.DataFrame:
def failsafe_get_model_meta(model_name):
try:
return get_model_meta(model_name)
except Exception as e:
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since your PR is not concerned with the leaderboard, you probably shouldn't put changes in it related to that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe that was a result of running make lint, however I can leave that out.

return None


Expand Down
34 changes: 34 additions & 0 deletions mteb/models/cde-small-v1_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
from __future__ import annotations

import mteb

model = mteb.get_model(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand this correctly, but it seems like you did not add a model implementation or model metadata for CDEs. I'm also unsure whether this would work or not. I believe their official guide on how to use CDE is a bit more complicated than this, since they have a first and a sceond stage in all of their guides where they first produce a corpus embedding and then pass it along to the model when embedding new documents.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but I guess it's a better choice still not to implement the model incorrectly here, and maybe just add metadata on it, then ask the CDE team to upload their results to the results repository.
I don't see too much value in adding a script here, that does not use CDEs as they are supposed to be used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I added it evaluation script just for information and show author's implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x-tabdeveloping

I didn't explicitly define the model metadata because when I ran the mteb.get_model_meta command, the output seemed correct. However, I may have misunderstood and overlooked the need to explicitly define the model metadata.

I also have the results repository from when I ran the script. Should I disregard that?

I'm a bit unsure about the next steps I should take. I would appreciate your guidance—thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jxmorris12 Awesome, I look forward to working with you in the new year!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jxmorris12,

Happy New Year! I hope you’re doing well. I wanted to follow up and see if you’ve had a chance to take a deeper look at my implementation of your model. I’d greatly appreciate any feedback or suggestions for improvement to ensure we can properly integrate CDE into MTEB.

Looking forward to hearing your thoughts—thank you in advance!

Copy link

@jxmorris12 jxmorris12 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. I'm currently in the process of uploading cde-small-v2, which should happen this week. Once that is finished we can update this PR since it should be easier to use. Should be available within a few days.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @YashDThapliyal – can you (1) update your code to use cde-small-v2 (https://huggingface.co/jxm/cde-small-v2) and (2) update the code to actually grab contextual documents from each corpus? I

I've actually done the work for you of figuring out how to get documents from each dataset type, so you should be essentially copy the approach in the CDE repo: https://github.com/jxmorris12/cde/blob/main/evaluate_mteb.py

Let us know once you've done that and we can all look over the results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jxmorris12 sure, I will get on that

"jxm/cde-small-v1",
trust_remote_code=True,
model_prompts={"query": "search_query: ", "passage": "search_document: "},
)
tasks = mteb.get_tasks(
tasks=[
# classification
"AmazonCounterfactualClassification",
# clustering
"RedditClustering",
# pair classification
"TwitterSemEval2015",
# reranking
"AskUbuntuDupQuestions",
# retrieval
"SCIDOCS",
# # sts
"STS22",
# # summarization
"SummEval",
]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
model,
output_folder="results",
extra_kwargs={"batch_size": 8},
overwrite_results=True,
)