Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add datasets in CodeRAG-Bench #1595

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

hepengfe
Copy link

@hepengfe hepengfe commented Dec 15, 2024

This PR address #1151
It has a blocker as the dataset cannot be downloaded as reported here code-rag-bench/code-rag-bench#5

Update on 1/3/2024: the dataset server has recovered.

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@hepengfe hepengfe changed the title add three out of four datasets in CodeRAG-Bench add datasets in CodeRAG-Bench Dec 15, 2024
@hepengfe
Copy link
Author

hepengfe commented Jan 6, 2025

@KennethEnevoldsen @isaac-chung Hi, can I get the approval for the workflow.

@hepengfe hepengfe marked this pull request as ready for review January 6, 2025 07:22
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good first try. Added a few suggestions. And since the dataset_transform and split_by_first_newline are mostly repeated, let's put these classes in the same file. That way the functions can be written once and reused.

mteb/abstasks/AbsTaskRetrieval.py Outdated Show resolved Hide resolved
mteb/languages.py Outdated Show resolved Hide resolved
@hepengfe
Copy link
Author

hepengfe commented Jan 7, 2025

Thanks! Good first try. Added a few suggestions. And since the dataset_transform and split_by_first_newline are mostly repeated, let's put these classes in the same file. That way the functions can be written once and reused.

I noticed that mteb/tasks/Retrieval/code/CodeSearchNetCCRetrieval.py and mteb/tasks/Retrieval/code/COIRCodeSearchNetRetrieval.py have the same function _load_code_search_code_retrieval. Plus, any suggestion on where to write those helper functions? Do I just create a new file such as mteb/task_helper_function.py and write functions there?

@isaac-chung
Copy link
Collaborator

I noticed that mteb/tasks/Retrieval/code/CodeSearchNetCCRetrieval.py and mteb/tasks/Retrieval/code/COIRCodeSearchNetRetrieval.py have the same function _load_code_search_code_retrieval. Plus, any suggestion on where to write those helper functions? Do I just create a new file such as mteb/task_helper_function.py and write functions there?

That can be one option. I'd prefer to limit this PR to CodeRAG-bench and avoid any refactoring other files.

@hepengfe
Copy link
Author

hepengfe commented Jan 9, 2025

I also noticed that the evaluation score for nauc_ndcg_at_*_std is negative. Is it expected? If not, any pointer on the solution for the error?

@isaac-chung
Copy link
Collaborator

isaac-chung commented Jan 14, 2025

I also noticed that the evaluation score for nauc_ndcg_at_*_std is negative. Is it expected? If not, any pointer on the solution for the error?

I've not paid attention to that before. Maybe @Samoed or @orionw might? Could be related to these being reranking tasks?

@Samoed
Copy link
Collaborator

Samoed commented Jan 14, 2025

This is common on many tasks. E.g. voyage results on MIRACL

@isaac-chung isaac-chung changed the title add datasets in CodeRAG-Bench dix: Add datasets in CodeRAG-Bench Jan 15, 2025
@isaac-chung isaac-chung changed the title dix: Add datasets in CodeRAG-Bench fix: Add datasets in CodeRAG-Bench Jan 15, 2025
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I see two main items from docs before we merge:

  1. Add dataset metrics: "Add metadata to the task (run task.calculate_metadata_metrics())"
  2. Add a benchmark entry in https://github.com/embeddings-benchmark/mteb/blob/main/mteb/benchmarks/benchmarks.py to reference these datasets.

That will complete the PR. Let us know if you have any questions. Thanks again for iterating!

@hepengfe
Copy link
Author

hepengfe commented Jan 25, 2025

Any way to run task.calculate_metadata_metrics() in a low-memory setting? I am trying it out but stackoverflow-posts is too large (2.65GB) and it results in OOM even for 128 GB RAM. @Samoed @isaac-chung

@isaac-chung
Copy link
Collaborator

isaac-chung commented Jan 25, 2025

@hepengfe hmm I don't have a good suggestion. In light of that, I'd say adding a benchmark entry (more important) and whatever dataset metrics you already have should complete the PR. That way we can look at the descriptive stats issue separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants