Skip to content

Latest commit

 

History

History
373 lines (220 loc) · 15.3 KB

README.md

File metadata and controls

373 lines (220 loc) · 15.3 KB

Prompt-Can-Anything

English | 中文

This is a gradio library and research repository that combines SOTA AI applications. It can help you achieve anything - all you need to do is provide prompts and make one click. Through the prompts and creativity of SOTA models, you can do anything.You don't have to install all the features, you can install them according to the features you want to use.

Motivation

Currently, the “Anything” AI intelligent agent backend has been accumulated for engineering and research. This requires the use of more multi-modal tasks and zero-shot models, not only to provide multi-modal AI processing web UI, but also to gradually enrich its functionality.

You can accomplish anything through this project! Let’s learn more about the development progress and plan of this project, and the final complete intelligent agent that combines the local GPT repository can help you call any AI task! Questions, stars, forks,You can also become a developer.

Feature

  1. (YOCO) It is not just a tool that can prompt anything

    🔥 Data Engine:

    In addition, we will introduce video, audio, and 3D annotations in the future. YOCO relies on integrated multimodal models and auxiliary generators such as ChatGPT. Of course, it is not omnipotent. Through effective fully automatic annotation and stable diffusion series methods to produce and control data that meet the requirements, we complete the “data engine” and generate customized label formats that facilitate the training of conventional models.

    🔥 Model Training:

    For each model, we not only need to use it, but also read its paper, fine-tuning methods, and communicate with the original author to try some development work for improvement and better training. We use fine-tune large models and customized label formats generated by YOCO to more efficiently train conventional models.

structure

  1. 🚀 Interactive content creation and visual GPT

Integrate diversified GPT, mainly using the port of chatgpt, and use the open-source Tsinghua VISUALGLM to deploy and fine-tune localized GPT, as well as try to improve the model structure. Through multimodal application tools, we can conduct dialogues and content creation.

easy example( asr->llM_model->tts->a2f app)

a2f_gpt_demo.mp4
  1. ⭐ 3D && 2D Avatar(comming soon)

Complete a role design interaction through a 3D Engine combined with multimodal tasks such as GPT;

Complete a role design interaction through the Sadtalker open source project and multimodal tasks such as GPT.

  1. 🔥🔥🚀 Unlimited potential “Anything”

Through continuous creativity and accumulation, we will integrate and learn from Sota AI. We will record each integrated model and provide a detailed explanation and summary in the article. The author will summarize all the AI-related knowledge reserves and engineering experience for the local large model (this part is the final development function and is planned).

structure

⭐ Research🚀 project🔥 Inspiration(In preparation)
  At research level, Zero-shot comparative learning is research trend, we hope to understand as much as possible the model design details of the project we are applying, so that we want to combine text, images, and audio to design a strong aligned backbone.
  At project level, Tensorrt acceleration of the basic model accelerates efficiency.

🔥 [August , Update plan preview , Welcome fork]

  • 🔥 add gpt_academic repo crazy functions and add langchain\agent comming soon

  • Optimization of speech problems and code logic optimization before optimization, add Gilgen

  • 🔥Official latest model integration test for Tag2text version 2 in early June,add RAM(Done)

  • One-click fine-tuning button function, adding: visualglm (Done)

  • Voice text processing link GPT, joining chatglm with a2f APP( Done)

⭐[News list]

-【2023/8/7】   Fix bug with llm(chatglm2,gpt3.5 loads and improve gradio ui)

-【2023/7/21】  update tag2text and ram with offical repo

-【2023/6/7】   v1.15:add submodule SadTalker,update UI

-【2023/6/6】   v1.15:environment installation problems and supplementary instructions, special models are called independently, and no need to install dependencies; Added the function of one-click fine-tuning of VisualGLM, considering machine configuration and video memory with caution

-【2023/6/5】   v1.15 a vide demo and plan,fix asr bug ,chatgpt with asr and tts 

-【2023/5/31】  Fixed the already issue, add tts demo, the Linux platform is tested through all open features

-【2023/5/23】  add web demo:Add VisualGLM ,chatgpt from [Academic-gpt](https://github.com/binary-husky/gpt_academic)

-【2023/5/7】   add web demo:At present, the function of text generation, detection and segmentation of images or image folders on the website has been tested normally, and the program does not need to be restarted, and the last model loading configuration is remembered, and it will be continuously optimized in the future.

-【2023/5/4】   add  semantic segmentatio label, add args(--color-flag --save-mask )

-【2023/4/26】  YOCO,Automatic annotation TOOLS:Commit preliminary code ,For the input image or folder, you can obtain the results of detection, segmentation, and text annotation , optional chatgpt api.

Preliminary-Works

  • VisualGLM-6B : Visual ChatGlm(6B)

  • Segment Anything : Strong segmentation model. But it needs prompts (like boxes/points/text) to generate masks.

  • Grounding DINO : Strong zero-shot detector which is capable of to generate high quality boxes and labels with free-form text.

  • Stable-Diffusion : Amazing strong text-to-image diffusion model.

  • Tag2text : Efficient and controllable vision-language model which can simultaneously output superior image captioning and image tagging.

  • SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

  • lama : Resolution-robust large mask Inpainting with Fourier Convolutions

  • gpt_academic : LLM tools.

    🛠️ YOCO: Quick Start

First, Make sure you have a basic gpu deep learning environment.

(Linux is recommended, Windows may have problems compiling Grounded-DINO Deformable- transformer operator, see Grounding DINO )

git clone https://github.com/positive666/Prompt-Can-Anything
cd Prompt-Can-Anything

**Install environment **

Installation of basic environment

pip install -r requiremens  
or  
pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt

Installation of Ground detector (compiling)

cd model_cards
pip install -e .

Installation of Tsinghua VisualGLM (optional, better to use LINUX system, installation plan will be updated after testing on Windows)

git submodule update --init --recursive
cd VisualGLM_6B && pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt

Install SadTalker (optional )

git clone https://github.com/Winfredy/SadTalker.git
cd  SadTalker && pip install -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt

​ Tips:create two directories, checkpoints and gfpgan, and place them in the root directory. Download the extracted weights from the official website and put them into two folders,

Installation of LAMA model (optional, not yet released):

This environment has a relatively strict requirement for the Python version, you may need to manually override the installation by version specified in the txt below:

pip install -r model_cards/lama/requirements.txt

Installation of diffuser (optional):

pip install --upgrade diffusers[torch]

For more content, you can check requirements, “pip install < your missing packages>”, if there is an installation version issue, please carefully look at the requirement version.

Linux environment issue:

  1. for pyaudio

Method 1:

pip may not be successful on the Linux platform, go to this pagepyaudio-wheels · PyPI, select the version corresponding to your Python version, download it and pip install the whl file. Detailed instructions will be provided in the future.

Method 2:

sudo apt-get install portaudio19-dev
sudo apt-get install python3-all-dev
pip install pyaudio
  1. use qlora fine tune question

    pip install  bitsandbytes  -i https://mirrors.aliyun.com/pypi/simple
    

Windows installation issue

​ as Linux

For more content, you can check the requirements, “pip install < your missing packages>”, and if there are version installation issues, please check the version carefully in the requirements.

Run

  1. downloads models weights

    name backbone Data Checkpoint model-config
    1 Tag2Text-Swin Swin-Base COCO, VG, SBU, CC-3M, CC-12M Download link
    2 Segment-anything vit Download link| Download link| Download link
    3 Lama FFC Download link
    4 GroundingDINO-T Swin-T O365,GoldG,Cap4M Github link | HF link link
    5 GroundingDINO-B Swin-B COCO,O365,GoldG,Cap4M,OpenImage,ODinW-35,RefCOCO Github link | HF link link
  2. Configure privacy files and parameters in config_private.py. After downloading the model, configure the path in the “MODEL_xxxx_PATH” variable. If using ChatGPT, configure its proxy and API key. (If there are networking issues with other services such as TTS during use on the web UI, first turn off the VPN connection and only open it when using ChatGPT).

🏃Demo

[Video demo 1 online on baidu clound ](https://pan.baidu.com/s/1AllUjuOVhzJh7abe71iCxg?pwd=c6v6)
[ Video demo 2 ] (https://pan.baidu.com/s/1jdP9mgUhyfLh_hz1W3pkeQ?pwd=c6v6)

  1. Auto-label
"--input_prompt" :  You can manually input a prompt. For example, if you only want to detect target categories that interest you, you can directly input the prompt to the grounded detection model, or input it to the Tag2Text model.
'--color-flag': Using BOX’s tags, distinguish between category and instance segmentation: the category color of speech segmentation is distinguished using BOX’s tags.
python auto_lable_demo.py  --source <data path>  --save-txt  --save-mask --save-xml  --save_caption 

Example:

​ Support multi-tasks, such as :

​ default tasks include images understand /detect/instance segment .....(add methods for image generation and inpainting )

"Prompt" control models output, example

image-20230427093103453

  1. webui(all)
		python app.py

image-20230508075845259

image-20230527022556630

​ 2.1 audio2face with llm model (Beta)

​ In Fact, ASR\TTS\LLM ,They are all arbitrarily replaceable.

​ this is a easy example, support chatglm,chatgpt(you can use anything llm model,but you need custom )

​ start asr&tts with audio2face

​ you need install audio2face in omniverse APP,see

https://www.nvidia.cn/omniverse/

​ step1. In audio2face,open a demo ,choose a Player ,auto build Trt engine ,(not support GTX10xx GPU),latest version support chinese!

​ get model pim path.

image-20230725122731372

image-20230331372

image-20230725133326397

​ step 2. in webui , configure your Prim path "Avatar_instance_A" in config_private.py , click"start system" and" Speech_system"

🔨To Do List

  • Release demo and code.
  • web ui demo
  • Support ChatGPT/VISUALGLM/ASR/TTS
  • YOCO labeling fine-tuning of VISUALGLM demo[next week]
  • 3D && 2D avatar
  • Complete the planned AI combination “Anything”
  • Fine-tune the segmentation and ground detectors of SAM, and expand the input control of SAM
  • Release training methods
  • Knowledge cloning

💘 Acknowledgements