Skip to content

Commit

Permalink
fix translation error (#562)
Browse files Browse the repository at this point in the history
* 1. refactor doc for RecipeGallery;
2. improve the doc for developer guide
3. some typo fix, and suitable overview fig size;
4. add link to the added data resplit tool

* add use cases for DJ related competitions

* in use case, add agentscope

* remove []

* unify commas

* fix TOC rendering error

* fix spaces and en version

* fix bad link

* suitable overview fig size in homepage

* fix translation error
  • Loading branch information
yxdyc authored Jan 22, 2025
1 parent dbf880c commit 7ca6ba6
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 16 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,12 +151,12 @@ Table of Contents
- [Data Recipe Gallery](docs/RecipeGallery.md)
- Data-Juicer Minimal Example Recipe
- Reproducing Open Source Text Datasets
- Improving Open Source Text Pre-training Datasets
- Improving Open Source Text Post-processing Datasets
- Improving Open Source Pre-training Text Datasets
- Improving Open Source Post-tuning Text Datasets
- Synthetic Contrastive Learning Image-text Datasets
- Improving Open Source Image-text Datasets
- Basic Example Recipes for Video Data
- Synthesizing Human-centered Video Evaluation Sets
- Synthesizing Human-centric Video Benchmarks
- Improving Existing Open Source Video Datasets
- Data-Juicer related Competitions
- [Better Synth](https://tianchi.aliyun.com/competition/entrance/532251), explore the impact of large model synthetic data on image understanding ability with DJ-Sandbox Lab and multimodal large models
Expand Down
26 changes: 13 additions & 13 deletions docs/RecipeGallery.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@
Table of Contents
- [1. Data-Juicer Minimal Example Recipe](#1-data-juicer-minimal-example-recipe)
- [2. Reproduce Open Source Text Datasets](#2-reproduce-open-source-text-datasets)
- [3. Improved Open Source Text Pre-training Datasets](#3-improved-open-source-text-pre-training-datasets)
- [4. Improved open source text post-processing dataset](#4-improved-open-source-text-post-processing-dataset)
- [5. Synthetic contrastive learning image and text datasets](#5-synthetic-contrastive-learning-image-and-text-datasets)
- [6. Improved open source image and text datasets](#6-improved-open-source-image-and-text-datasets)
- [3. Improved Open Source Pre-training Text Datasets](#3-improved-open-source-pre-training-text-datasets)
- [4. Improved Open Source Post-tuning Text Dataset](#4-improved-open-source-post-tuning-text-dataset)
- [5. Synthetic Contrastive Learning Image-text datasets](#5-synthetic-contrastive-learning-image-text-datasets)
- [6. Improved Open Source Image-text datasets](#6-improved-open-source-image-text-datasets)
- [6.1. Evaluation and Verification](#61-evaluation-and-verification)
- [7. Basic Example Recipes for Video Data](#7-basic-example-recipes-for-video-data)
- [8. Synthesize a human-centric video review set](#8-synthesize-a-human-centric-video-review-set)
- [9. Improve existing open source video datasets](#9-improve-existing-open-source-video-datasets)
- [8. Synthesize Human-centric Video Benchmarks](#8-synthesize-human-centric-video-benchmarks)
- [9. Improve Existing Open Source Video Datasets](#9-improve-existing-open-source-video-datasets)
- [9.1. Evaluation and Verification](#91-evaluation-and-verification)


Expand All @@ -24,7 +24,7 @@ Some basic configuration files are placed in the [Demo](../configs/demo/) folder
- We reproduced the processing flow of part of the Redpajama dataset. Please refer to the [reproduced_redpajama](../configs/reproduced_redpajama) folder for detailed description.
- We reproduced the processing flow of part of the BLOOM dataset. Please refer to the [reproduced_bloom](../configs/reproduced_bloom) folder for detailed description.

## 3. Improved Open Source Text Pre-training Datasets
## 3. Improved Open Source Pre-training Text Datasets

We found that there are still some "bad" data samples in the existing processed datasets (such as Redpajama, The Pile, etc.). So we use our Data-Juicer to refine these datasets and try to feed them to LLM to get better performance.

Expand Down Expand Up @@ -53,18 +53,18 @@ We use a simple 3-σ rule to set the hyperparameters of the operators in each da
| USPTO | 5,883,024 | 4,516,283 | 76.77% | [pile-uspto-refine.yaml](../configs/data_juicer_recipes/pile-uspto-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/pretraining/the-pile-uspto-refine-result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/the-pile-uspto-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/the-pile-uspto-refined-by-data-juicer) | The Pile |


## 4. Improved open source text post-processing dataset
## 4. Improved Open Source Post-tuning Text Dataset
Take the Alpaca-CoT dataset as an example:

| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
|-------------------|:------------------------:|:----------------------------------:|:---------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------|
| Alpaca-Cot EN | 136,219,879 | 72,855,345 | 54.48% | [alpaca-cot-en-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-en-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-en-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-en-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-en-refined-by-data-juicer) | [39 subsets from Alpaca-CoT](../configs/data_juicer_recipes/alpaca_cot/README.md) |
| Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | [alpaca-cot-zh-refine.yaml](../configs/data_juicer_recipes/alpaca_cot/alpaca-cot-zh-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/LLM_data/our_refined_datasets/CFT/alpaca-cot-zh-refine_result.jsonl) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/alpaca-cot-zh-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/alpaca-cot-zh-refined-by-data-juicer) | [28 subsets from Alpaca-CoT](../configs/data_juicer_recipes/alpaca_cot/README.md) |

## 5. Synthetic contrastive learning image and text datasets
## 5. Synthetic Contrastive Learning Image-text datasets
Data-Juicer has built-in rich operators to support image multimodal data synthesis, such as the Img-Diff dataset. This synthetic data brings a 12-point performance improvement on the MMVP benchmark. For more details, see the Img-Diff [paper](https://arxiv.org/abs/2408.04594), and the corresponding recipe implementation can refer to [ImgDiff-Dev](https://github.com/modelscope/data-juicer/tree/ImgDiff).

## 6. Improved open source image and text datasets
## 6. Improved Open Source Image-text datasets

| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
Expand Down Expand Up @@ -102,10 +102,10 @@ We provide users with a video dataset processing recipe sample to help better us
- Text-Video: Improve the dataset quality based on the alignment between text and video
Users can start their video dataset processing workflow based on this recipe.

## 8. Synthesize a human-centric video review set
Data-Juicer can also support video review set synthesis, such as [HumanVBench](https://arxiv.org/abs/2412.17574), which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in [HumanVBench-dev](https://github.com/modelscope/data-juicer/tree/HumanVBench).
## 8. Synthesize Human-centric Video Benchmarks
Data-Juicer can also support video benchmark synthesis, such as [HumanVBench](https://arxiv.org/abs/2412.17574), which converts in-the-wild videos into human-centric video benchmarks. The corresponding data recipes and construction process can be found in [HumanVBench-dev](https://github.com/modelscope/data-juicer/tree/HumanVBench).

## 9. Improve existing open source video datasets
## 9. Improve Existing Open Source Video Datasets

| Data subset | Number of samples before improvement | Number of samples after improvement | Sample retention rate | Configuration link | Data link | Source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
Expand Down

0 comments on commit 7ca6ba6

Please sign in to comment.