Using AI, machine learning, and NLP to generate memes.
This project is still under active development, and should be understood to be in beta mode. Preliminary official website to demonstrate this project in action can be found at: aimemes.fun.
The following steps should help you set up your environment:
This repository uses Git Large File Storage, which should be downloaded and installed before cloning our repository. In particular, make sure to run the following command before cloning:
git lfs install
git clone https://github.com/gbotev1/cgmfpim.git
Our code is tested using Python 3.9. The provided requirements.txt
file delineates all requirements necessary to run any script in this repository. If you plan on using only our pre-computed archives, then not all of these packages are necessary. Some scripts may necessitate the use of a GPU for which an appropriate version of CUDA must be installed. You should also make sure to install the FAISS Library on your machine. We used the pre-compiled Linux version from Anaconda with CUDA Toolkit 10.2 for our experiments.
pip3 install -U -r requirements.txt
Notes:
The
pytorch-lightning['extra']
PyPI package seems to not correctly install the extra dependencies, so we have modified ourrequirements.txt
file to manually install the FairScale PyPI package. In order to install Horovod properly for your machine, you should follow the official instructions here.
You might have to run
pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html
if you would like to run on an NVIDIA A100 GPU, which also requires CUDA 11.
The following sh
script is provided for convenience to easily extract meme_data.tsv
and meme_data_top.tsv
from this file of scraped captions from Imgflip on November 25, 2020 as well as a meme_templates
directory of meme image templates into the data
directory. The meme_data_top.tsv
file is a filtered version of the full meme_data.tsv
, where at most the top 100 meme captions by number of upvotes are saved. It also extracts our custom Google's Conceptual Captions (GCC) dataset. Once extracted, the GCC dataset file gcc_full.tsv
we provide is nothing but a concatenation of the train and validation files available for download from the official linked dataset page after running each of the captions through NLTK's Penn Treebank detokenizer. For those curious, this logic is defined in prepare_gcc.py
.
sh inflate_archives.sh
Along with the scripts that we used to generate these embeddings, we also provide a ready-to-use download of 2,841,059 2,048-dimensional embeddings for every image we could access from the Google's Conceptual Captions (GCC) dataset training and validation splits. These embeddings were obtained from the output of the avgpool
layer from the pre-trained Wide ResNet-101-2 on the ImageNet dataset.
Contributions are at the very essence of the open source community and are what keep projects alive and useful to the community that uses them. We wholeheartedly welcome any and all contributions.
- Fork the project
- Create your feature branch (
git checkout -b feature/DankFeature
) - Commit your changes (
git commit -m 'Made memes more dank'
) - Push to the branch (
git push origin feature/DankFeature
) - Open a pull request
Distributed under the GNU Affero General Public License v3.0. See LICENSE
for more information.
Listed in alphabetical order by last name:
- Georgie Botev - [email protected]
- Pursuing Masters in Computer Science at Johns Hopkins University
- Peter Ge - [email protected]
- Pursing PhD in Biomedical Engineering at Johns Hopkins University
- Samantha Zarate - [email protected]
- Pursuing PhD in Computer Science at Johns Hopkins University
- Huggingface's 🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
@inproceedings{wolf-etal-2020-transformers, title = "Transformers: State-of-the-Art Natural Language Processing", author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = oct, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6", pages = "38--45" }
- PyTorch Lightning: The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.
@article{falcon2019pytorch, title={PyTorch Lightning}, author={Falcon, WA}, journal={GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning}, volume={3}, year={2019} }
- FAISS: A library for efficient similarity search and clustering of dense vectors.
@article{JDH17, title={Billion-scale similarity search with GPUs}, author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}}, journal={arXiv preprint arXiv:1702.08734}, year={2017} }
- FairScale: PyTorch extensions for high performance and large scale training.
@misc{kim2020torchgpipe, title={torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models}, author={Chiheon Kim and Heungsub Lee and Myungryong Jeong and Woonhyuk Baek and Boogeon Yoon and Ildoo Kim and Sungbin Lim and Sungwoong Kim}, year={2020}, eprint={2004.09910}, archivePrefix={arXiv}, primaryClass={cs.DC} }
@misc{rajbhandari2020zero, title={ZeRO: Memory Optimizations Toward Training Trillion Parameter Models}, author={Samyam Rajbhandari and Jeff Rasley and Olatunji Ruwase and Yuxiong He}, year={2020}, eprint={1910.02054}, archivePrefix={arXiv}, primaryClass={cs.LG} }
@misc{shoeybi2020megatronlm, title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism}, author={Mohammad Shoeybi and Mostofa Patwary and Raul Puri and Patrick LeGresley and Jared Casper and Bryan Catanzaro}, year={2020}, eprint={1909.08053}, archivePrefix={arXiv}, primaryClass={cs.CL} }
@misc{johnson2020adascale, title={AdaScale SGD: A User-Friendly Algorithm for Distributed Training}, author={Tyler B. Johnson and Pulkit Agrawal and Haijie Gu and Carlos Guestrin}, year={2020}, eprint={2007.05105}, archivePrefix={arXiv}, primaryClass={cs.LG} }
- Horovod: Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
@article{sergeev2018horovod, Author = {Alexander Sergeev and Mike Del Balso}, Journal = {arXiv preprint arXiv:1802.05799}, Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}}, Year = {2018} }
- Dank Learning: Generating Memes Using Deep Neural Networks
@misc{peirson2018dank, title={Dank Learning: Generating Memes Using Deep Neural Networks}, author={Abel L Peirson V au2 and E Meltem Tolunay}, year={2018}, eprint={1806.04510}, archivePrefix={arXiv}, primaryClass={cs.CL} }
- Language Models are Unsupervised Multitask Learners
@article{radford2019language, title={Language Models are Unsupervised Multitask Learners}, author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya}, year={2019} }