Zhuoyan Luo*, Fengyuan Shi*, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan
ARC Lab Tencent PCG, Tsinghua University, Nanjing University
- 🚀 Super-large Codebook: Re-implements the advanced Lookup-Free Quantizer proposed by MAGVITv2, and achieves a super-large codebook (i.e., 2^18) with strong performance (1.17rFID).
- 💡 Auto-Regressive Innovation: Introduces asymmetric token factorization and the next sub-token prediction paradigm, enabling efficient generation with a super-large vocabulary and enhanced sub-token interactions.
- 🚀 Scalability: Validates the scalability of plain auto-regressive models across various parameter sizes (300M to 1.5B).
This repository provides the scripts and checkpoints to replicate our results.
- A series of image tokenizer for class-conditional image generation (8$\times$ and 16$\times$ downsampling rate with 2^18 codebook size) and text-conditional image generation (2^14 and 2^18 codebook size with 16$\times$ downsampling rate).
- A family of the autoregressive model ranging from 300M to 1.5B for class-conditional image generation.
🤗 Open-MAGVIT2 is still under active development. Stay tuned for the update!
-
$128\times 128$ Tokenizer Training
bash scripts/train_tokenizer/Open-MAGVIT2/run_128_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
-
$256\times 256$ Tokenizer Training
bash scripts/train_tokenizer/Open-MAGVIT/run_256_L.sh MASTER_ADDR MASTER_PORT NODE_RANK
-
$128\times 128$ Tokenizer Evaluation
bash scripts/evaluation/evaluation_128.sh
-
$256\times 256$ Tokenizer Evaluation
bash scripts/evaluation/evaluation_256.sh
Method | Token Type | #Tokens | Train Data | Codebook Size | rFID | PSNR | Codebook Utilization | Checkpoint |
---|---|---|---|---|---|---|---|---|
Open-MAGVIT2-20240617 | 2D | 16 |
256 |
262144 | 1.53 | 21.53 | 100% | - |
Open-MAGVIT2-20240617 | 2D | 16 |
128 |
262144 | 1.56 | 24.45 | 100% | - |
Open-MAGVIT2 | 2D | 16 |
256 |
262144 | 1.17 | 21.90 | 100% | IN256_Large |
Open-MAGVIT2 | 2D | 16 |
128 |
262144 | 1.18 | 25.08 | 100% | IN128_Large |
Open-MAGVIT2* | 2D | 32 |
128 |
262144 | 0.34 | 26.19 | 100% | above |
(*) denotes that the results are from the direct inference using the model trained with
Please see in scripts/train_autogressive/run.sh for different model configurations.
bash scripts/train_autogressive/run.sh MASTER_ADDR MASTER_PORT NODE_RANK
Please see in scripts/train_autogressive/run.sh for different sampling hyper-parameters for different scale of models.
bash scripts/evaluation/sample_npu.sh or scripts/evaluation/sample_gpu.sh Your_Total_Rank
Method | Params | #Tokens | FID | IS | Checkpoint |
---|---|---|---|---|---|
Open-MAGVIT2 | 343M | 16 |
3.08 | 258.26 | AR_256_B |
Open-MAGVIT2 | 804M | 16 |
2.51 | 271.70 | AR_256_L |
Open-MAGVIT2 | 1.5B | 16 |
2.33 | 271.77 | AR_256_XL |
We use LAION-COCO, CC12M, CC3M, LAION-HD, LAION-Aesthetic-umap, LAION-Aesthetic-v2 and JourneyDB for Pretraining. We recommend the data are organized in the following tar format.
data
└── LAION_COCO/
├── webdataset
├── 1.tar
├── 2.tar
├── 3.tar
├── ...
└── CC12M/
├── webdataset
├── 1.tar
├── 2.tar
├── 3.tar
├── ...
Before pretraining, the sample.json and filter_keys.json of each datasets should be prepared. Please refer to src/Open_MAGVIT2/data/prepare_pretrain.py
bash scripts/train_tokenizer/Open-MAGVIT2/pretrain_256.sh MASTER_ADDR MASTER_PORT NODE_RANK
-
$256\times 256$ Tokenizer Evaluation
bash scripts/evaluation/evaluation_256.sh
bash scripts/evaluation/evaluation_original.sh
Method | Quantizer Type | Training Data | Ratio | Resolution | Codebook Size | Checkpoint | rFID(COCO) | PSNR(COCO) | SSIM(COCO) | rFID(In1k) | PSNR(In1k) | SSIM(In1k) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LlamaGen | VQ | 70M | 16 | 256 |
16384 | - | 8.40 | 20.28 | 0.55 | 2.47 | 20.65 | 0.54 |
Show-o | LFQ | 35M | 16 | 256 |
8192 | - | 9.26 | 20.90 | 0.59 | 3.50 | 21.34 | 0.59 |
Cosmos | FSQ | - | 16 | 256 |
64000 | - | 11.97 | 19.22 | 0.48 | 4.57 | 19.93 | 0.49 |
Open-MAGVIT2 | LFQ | 100M | 16 | 256 |
16384 | Pretrain_256_16384 | 7.93 | 22.21 | 0.62 | 2.55 | 22.21 | 0.62 |
Open-MAGVIT2 | LFQ | 100M | 16 | 256 |
262144 | Pretrain_256_262144 | 6.76 | 22.31 | 0.65 | 1.67 | 22.70 | 0.64 |
Cosmos | FSQ | - | 16 | Original | 64000 | - | 7.51 | 20.45 | 0.52 | 1.93 | 20.56 | 0.51 |
Open-MAGVIT2 | LFQ | 100M | 16 | Original | 16384 | Pretrain_256_16384 | 6.65 | 21.61 | 0.57 | 1.39 | 21.74 | 0.56 |
Open-MAGVIT2 | LFQ | 100M | 16 | Original | 262144 | Pretrain_256_262144 | 5.10 | 22.18 | 0.60 | 0.78 | 22.24 | 0.59 |