Skip to content

Latest commit

 

History

History
119 lines (93 loc) · 5.94 KB

README.md

File metadata and controls

119 lines (93 loc) · 5.94 KB

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Anwen Hu, Haiyang Xu†, Liang Zhang, Jiabo Ye, Ming Yan†, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou

† Corresponding Author

Data: MP-DocStruct1M 🤗 MP-DocReason51K 🤗 DocDownstream 2.0 🤗 DocGenome12K 🤗
Models: DocOwl2-stage1 🤗 DocOwl2-stage2 🤗 DocOwl2 🤗

image

Spotlights

  • Support Multi-page Text Lookup and Multi-page Text Parsing.

  • Support Multi-page Question Answering using simple phrases or detailed explanations with evidence pages.

  • Support Text-rich Video Understanding.

  • Open Source

    • ✅ Training Data: MP-DocStruct1M, MP-DocReason51K, DocDownsteam-2.0, DocGenome12K
    • ✅ Model: DocOwl2
    • ✅ Source code of model inference and evaluation.
    • Model: DocOwl2-stage1, DocOwl2-stage2,
    • Online Demo on ModelScope and HuggingFace.
    • Source code of launching a local demo.
    • Training code.

Training and Evaluation Datasets

Dataset Download Link
MP-DocStruct1M
  • HuggingFace: mPLUG/MP-DocStruct1M
  • ModelScope: iic/MP-DocStruct1M
  • DocDownstream-2.0
  • HuggingFace: mPLUG/DocDownstream-2.0
  • ModelScope: iic/DocDownstream-2.0
  • MP-DocReason51K
  • HuggingFace: mPLUG/MP-DocReason51K
  • ModelScope: iic/MP-DocReason51K
  • DocGenome12K
  • HuggingFace: mPLUG/DocGenome12K
  • ModelScope: iic/DocGenome12K
  • Models

    Model Card

    Model Download Link Abilities
    DocOwl2
  • 🤗 mPLUG/DocOwl2
  • iic/DocOwl2
  • Multi-page VQA with detailed explanations
  • Multi-page VQA with concise answers
  • Model Inference

    import torch
    import os
    from transformers import AutoTokenizer, AutoModel
    from icecream import ic
    import time
    
    class DocOwlInfer():
        def __init__(self, ckpt_path):
            self.tokenizer = AutoTokenizer.from_pretrained(ckpt_path, use_fast=False)
            self.model = AutoModel.from_pretrained(ckpt_path, trust_remote_code=True, low_cpu_mem_usage=True, torch_dtype=torch.float16, device_map='auto')
            self.model.init_processor(tokenizer=self.tokenizer, basic_image_size=504, crop_anchors='grid_12')
            
        def inference(self, images, query):
            messages = [{'role': 'USER', 'content': '<|image|>'*len(images)+query}]
            answer = self.model.chat(messages=messages, images=images, tokenizer=self.tokenizer)
            return answer
    
    
    docowl = DocOwlInfer(ckpt_path='mPLUG/DocOwl2')
    
    images = [
            './examples/docowl2_page0.png',
            './examples/docowl2_page1.png',
            './examples/docowl2_page2.png',
            './examples/docowl2_page3.png',
            './examples/docowl2_page4.png',
            './examples/docowl2_page5.png',
        ]
    
    answer = docowl.inference(images, query='what is this paper about? provide detailed information.')
    
    answer = docowl.inference(images, query='what is the third page about? provide detailed information.')

    Model Evaluation

    prepare environments for evaluation as follows:

    pip install textdistance
    pip install editdistance
    pip install pycocoevalcap
    

    Evaluate DocOwl2 on 10 single-image tasks, 2 multi-page tasks and 1 video task:

    python docowl_benchmark_evaluate.py --model_path $MODEL_PATH --dataset $DATASET --downstream_dir $DOWNSTREAM_DIR_PATH --save_dir $SAVE_DIR --split $split
    

    Note: For sinlge-image evaluation, $DATASET should be chosen from [DocVQA, InfographicsVQA, WikiTableQuestions, DeepForm,KleisterCharity, TabFact, ChartQA, TextVQA, TextCaps, VisualMRC]. $DOWNSTREAM_DIR_PATH is the local path of mPLUG/DocDownstream-1.0, $split==test.

    For multi-page evaluation and video evaluation, $DATASET should be chosen from [MP-DocVQA, DUDE, NewsVideoQA]. $DOWNSTREAM_DIR_PATH is the local path of mPLUG/DocDownstream-2.0, $split==val. You can also set $split==test and submit the file named with suffix _submission.json to the official evaluation website.