Source code for Chinese image captioning method based on deep multimodal semantic fusion runnable on GPU and CPU.
This code is released under the MIT License (refer to the LICENSE file for details).
The model is trained using Tensorflow, a popular Python framework for training deep neural network. To install Tensorflow, please refer to Installing Tensorflow.
The code is written in python, you also need to install following python dependencies:
- bottle==0.12.13
- ipdb==0.10.3
- matplotlib==2.1.0
- numpy==1.13.3
- Pillow==4.3.0
- scikit-image==0.13.1
- scipy==1.0.0
- jieba==0.38
For convenience, you can alse use requirements.txt to install python dependencies:
pip install -r requirements.txt
To use the evaluation script: see coco-caption for the requirements.
Though you can run the code on CPU, we highly recommend you to equip a GPU card. To run on cpu, please use
export CUDA_VISIBLE_DEVICES=""
To generate training data for Flickr8k-CN, please use build_flickr8k_data.py script:
python build_flickr8k_data.py
We use Google Inception V3 for single-lable visual encoding network: see Inception for the instructions.
Please run train_keyword.py using gpu:
CUDA_VISIBLE_DEVICES=0 python train_keyword.py
For multimodal caption generation network use train.py:
python train.py
Use server.py to load models, and use client.py to request caption generation:
python server.py
python client.py
To use tensorboard to monitor training process:
tensorboard --logdir="MODEL_PATH"