[LLaVA-OV] Support LLaVA-OneVision eval loss and data filtering #305

Caozhou1995 · 2025-01-03T04:29:39Z

This PR adds three important features:

In the eval mode, the loss corresponding to each sample is output, so that data filtering can be performed subsequently. To enable this feature, simply set skip_train: True, and the rest of the settings are consistent with those in the train mode.
Data filtering is provided, which allows custom filtering rules, reads the loss corresponding to the sample from the log, and stores the sample id that need to be retained in a JSON file.Usage: python filter_to_json.py --input_dir outputs --output sample_ids.json
When processing data, the input of filter_json is added, and if the sample ID is not in this JSON file, it can be skipped. Usgae: bash make_llava_ov_wds.sh $DATA_PATH $EXPNAME_PATH $HOSTFILE $FILTER_JSON
If you do not want to make it from scratch and only filter it from the already made tar file according to filter_json, Usage: python filter_by_json.py --input_dir raw_dir --output_dir output_dir --json_file filter_json，and you can use mpirun to start multiple nodes to process.

Caozhou1995 requested a review from a team as a code owner January 3, 2025 04:29

llava ov eval and filter

3bff380

Caozhou1995 force-pushed the eval_llava_ov branch from 35afa95 to 3bff380 Compare January 6, 2025 12:22

Provide feedback