Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLaVA-OV] Support LLaVA-OneVision eval loss and data filtering #305

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Caozhou1995
Copy link
Collaborator

@Caozhou1995 Caozhou1995 commented Jan 3, 2025

This PR adds three important features:

  • In the eval mode, the loss corresponding to each sample is output, so that data filtering can be performed subsequently. To enable this feature, simply set skip_train: True, and the rest of the settings are consistent with those in the train mode.
  • Data filtering is provided, which allows custom filtering rules, reads the loss corresponding to the sample from the log, and stores the sample id that need to be retained in a JSON file.Usage: python filter_to_json.py --input_dir outputs --output sample_ids.json
  • When processing data, the input of filter_json is added, and if the sample ID is not in this JSON file, it can be skipped. Usgae: bash make_llava_ov_wds.sh $DATA_PATH $EXPNAME_PATH $HOSTFILE $FILTER_JSON
  • If you do not want to make it from scratch and only filter it from the already made tar file according to filter_json, Usage: python filter_by_json.py --input_dir raw_dir --output_dir output_dir --json_file filter_json,and you can use mpirun to start multiple nodes to process.

@Caozhou1995 Caozhou1995 requested a review from a team as a code owner January 3, 2025 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant