Skip to content

Release v1.1.0

Latest
Compare
Choose a tag to compare
@BeachWang BeachWang released this 17 Jan 09:46
· 7 commits to main since this release
030e786

Major Updates

  • 🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
  • 🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
  • 💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
  • 🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
  • 🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
  • 🛝 Add usability tags for OPs:
    • alpha tag for OPs in which only the basic OP implementations are finished;
    • beta tag for OPs in which unittests are added based on the alpha version;
    • stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

  • image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
  • mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
  • sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
  • sentence_augmentation_mapper: Augment sentences using LLMs. #550
  • text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Bug Fixed

  • Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
  • Fix model force download bug. #529
  • Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
  • Fix missing field meta tag on ray mode. #538
  • Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
  • Fix bug in the role playing data generation demo. #545

Others

  • Enhance unit test for API calling OPs. #528
  • Remove sandbox requirements installation from Dockerfile. #530
  • Update the datasource related APIs to be compatible with the latest version of Ray. #532
  • Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
  • Update docs for preparing DJ2.0 release. #542
  • Update a quick cdn link for arch figure. #543
  • Add a video demo for role playing data generation. #545
  • Optimize op doc for global textual search. #552
  • Use a more stable and fast translator than google translator for automatic OP doc building. #554

Acknowledgement

  • @Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550