fix: Add datasets in CodeRAG-Bench #1595

hepengfe · 2024-12-15T07:16:55Z

This PR address #1151
It has a blocker as the dataset cannot be downloaded as reported here code-rag-bench/code-rag-bench#5

Update on 1/3/2024: the dataset server has recovered.

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

hepengfe · 2025-01-06T05:51:31Z

@KennethEnevoldsen @isaac-chung Hi, can I get the approval for the workflow.

isaac-chung

Thanks! Good first try. Added a few suggestions. And since the dataset_transform and split_by_first_newline are mostly repeated, let's put these classes in the same file. That way the functions can be written once and reused.

mteb/abstasks/AbsTaskRetrieval.py

mteb/languages.py

mteb/tasks/Retrieval/code/CodeRAGLibraryDocumentationRetrieval.py

hepengfe · 2025-01-07T04:31:29Z

Thanks! Good first try. Added a few suggestions. And since the dataset_transform and split_by_first_newline are mostly repeated, let's put these classes in the same file. That way the functions can be written once and reused.

I noticed that mteb/tasks/Retrieval/code/CodeSearchNetCCRetrieval.py and mteb/tasks/Retrieval/code/COIRCodeSearchNetRetrieval.py have the same function _load_code_search_code_retrieval. Plus, any suggestion on where to write those helper functions? Do I just create a new file such as mteb/task_helper_function.py and write functions there?

isaac-chung · 2025-01-07T10:32:09Z

I noticed that mteb/tasks/Retrieval/code/CodeSearchNetCCRetrieval.py and mteb/tasks/Retrieval/code/COIRCodeSearchNetRetrieval.py have the same function _load_code_search_code_retrieval. Plus, any suggestion on where to write those helper functions? Do I just create a new file such as mteb/task_helper_function.py and write functions there?

That can be one option. I'd prefer to limit this PR to CodeRAG-bench and avoid any refactoring other files.

…unnessary code and loop

hepengfe · 2025-01-09T05:34:42Z

I also noticed that the evaluation score for nauc_ndcg_at_*_std is negative. Is it expected? If not, any pointer on the solution for the error?

…rivate by adding a underscore prefix

isaac-chung · 2025-01-14T03:19:51Z

I also noticed that the evaluation score for nauc_ndcg_at_*_std is negative. Is it expected? If not, any pointer on the solution for the error?

I've not paid attention to that before. Maybe @Samoed or @orionw might? Could be related to these being reranking tasks?

Samoed · 2025-01-14T04:55:46Z

This is common on many tasks. E.g. voyage results on MIRACL

isaac-chung

Nice. I see two main items from docs before we merge:

Add dataset metrics: "Add metadata to the task (run task.calculate_metadata_metrics())"
Add a benchmark entry in https://github.com/embeddings-benchmark/mteb/blob/main/mteb/benchmarks/benchmarks.py to reference these datasets.

That will complete the PR. Let us know if you have any questions. Thanks again for iterating!

hepengfe · 2025-01-25T04:32:55Z

Any way to run task.calculate_metadata_metrics() in a low-memory setting? I am trying it out but stackoverflow-posts is too large (2.65GB) and it results in OOM even for 128 GB RAM. @Samoed @isaac-chung

isaac-chung · 2025-01-25T05:25:14Z

@hepengfe hmm I don't have a good suggestion. In light of that, I'd say adding a benchmark entry (more important) and whatever dataset metrics you already have should complete the PR. That way we can look at the descriptive stats issue separately.

add three out of four datasets in CodeRAG-Bench

4f5bb7b

hepengfe changed the title ~~add three out of four datasets in CodeRAG-Bench~~ add datasets in CodeRAG-Bench Dec 15, 2024

hepengfe added 2 commits January 5, 2025 21:08

add verified CodeRAGStackoverflowPostsRetrieval dataset

32de0d5

clean up code and make some comments

cbfda09

fixed lint errors

de9fc48

hepengfe marked this pull request as ready for review January 6, 2025 07:22

isaac-chung reviewed Jan 6, 2025

View reviewed changes

Samoed reviewed Jan 6, 2025

View reviewed changes

mteb/tasks/Retrieval/code/CodeRAGLibraryDocumentationRetrieval.py Outdated Show resolved Hide resolved

Samoed reviewed Jan 6, 2025

View reviewed changes

mteb/tasks/Retrieval/code/CodeRAGLibraryDocumentationRetrieval.py Outdated Show resolved Hide resolved

Samoed reviewed Jan 6, 2025

View reviewed changes

mteb/tasks/Retrieval/code/CodeRAGLibraryDocumentationRetrieval.py Outdated Show resolved Hide resolved

hepengfe added 2 commits January 8, 2025 21:30

addressed comments about code-rag datasets: fixed grammar and remove …

1738bdb

…unnessary code and loop

roll back files which is not supposed to change

75c5a45

fixed the comments in split_by_first_newline() and make the methods p…

64c865b

…rivate by adding a underscore prefix

hepengfe requested review from isaac-chung and Samoed January 14, 2025 02:43

refactor to use common args

319e1b6

isaac-chung changed the title ~~add datasets in CodeRAG-Bench~~ dix: Add datasets in CodeRAG-Bench Jan 15, 2025

isaac-chung changed the title ~~dix: Add datasets in CodeRAG-Bench~~ fix: Add datasets in CodeRAG-Bench Jan 15, 2025

isaac-chung reviewed Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add datasets in CodeRAG-Bench #1595

fix: Add datasets in CodeRAG-Bench #1595

hepengfe commented Dec 15, 2024 •

edited by isaac-chung

Loading

hepengfe commented Jan 6, 2025

isaac-chung left a comment

hepengfe commented Jan 7, 2025

isaac-chung commented Jan 7, 2025

hepengfe commented Jan 9, 2025

isaac-chung commented Jan 14, 2025 •

edited

Loading

Samoed commented Jan 14, 2025

isaac-chung left a comment

hepengfe commented Jan 25, 2025 •

edited

Loading

isaac-chung commented Jan 25, 2025 •

edited

Loading

fix: Add datasets in CodeRAG-Bench #1595

Are you sure you want to change the base?

fix: Add datasets in CodeRAG-Bench #1595

Conversation

hepengfe commented Dec 15, 2024 • edited by isaac-chung Loading

Checklist

Adding datasets checklist

hepengfe commented Jan 6, 2025

isaac-chung left a comment

Choose a reason for hiding this comment

hepengfe commented Jan 7, 2025

isaac-chung commented Jan 7, 2025

hepengfe commented Jan 9, 2025

isaac-chung commented Jan 14, 2025 • edited Loading

Samoed commented Jan 14, 2025

isaac-chung left a comment

Choose a reason for hiding this comment

hepengfe commented Jan 25, 2025 • edited Loading

isaac-chung commented Jan 25, 2025 • edited Loading

hepengfe commented Dec 15, 2024 •

edited by isaac-chung

Loading

isaac-chung commented Jan 14, 2025 •

edited

Loading

hepengfe commented Jan 25, 2025 •

edited

Loading

isaac-chung commented Jan 25, 2025 •

edited

Loading