Skip to content

Latest commit

 

History

History

COYO-Labeled-300M

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

🐺 COYO-Labeled-300M: Image-Multi-Label Pair Dataset

COYO-Labeled-300M

COYO-Labeled-300M is a dataset of machine-labeled 300M image-multi-label pairs. We labeled subset of COYO-700M with a large model (efficientnetv2-xl) trained on imagenet-21k. We followed the same evaluation pipeline as in efficientnet-v2. The labels are top 50 most likely labels out of 21,841 classes from imagenet-21k. The label probabilies are also provided with labels so that the user can select threshold of their choice for multi-label classification use or can take top-1 class for single class classification use.

In other words, COYO-Labeled-300M is a ImageNet-like dataset. Instead of human labeled 1.25 million samples, it's machine-labeled 300 million samples. This dataset is similar to JFT-300M which is not released to the public.

We found that our ViT implementation trained on COYO-Labeled-300M performs similar to the performance numbers in the ViT paper trained on JFT-300M.

We also provide weights for the pretrained ViT model on COYO-Labeled-300M as well as its training & fine-tuning code.

The basic instruction, licenses and contributors are the same as for the coyo-700m.

Dataset Preview

id url imagehash labels label_probs width height
315 daf5a50aae4aa54a [8087, 11054, 808... [0.4453125, 0.304... 240 240
463 8f3fd06161c41e3e [12213, 12215, 12... [0.71728516, 0.38... 219 230
601 8e70855e6ba3f825 [4945, 4953, 4942... [0.8496094, 0.069... 320 240
1105 ea90d5978594cbc6 [12352, 12658, 12... [0.2763672, 0.267... 310 310
2308 c0cf983c7c7074f4 [6728, 14101, 158... [0.32128906, 0.02... 702 336
3710 8f9694594b26371b [15721, 15722, 49... [0.4152832, 0.175... 600 467

Meta-Attributes

Attributes

name type description
id long Unique 64-bit integer ID generated by monotonically_increasing_id() which is the same value that is mapped with the existing COYO-700M.
url string The image URL extracted from the src attribute of the <img>
imagehash string The perceptual hash(pHash) of the image
labels sequence[integer] Inference results of EfficientNetV2-XL model trained on ImageNet-21K dataset (Top 50 indices among 21,841 classes)
label_probs sequence[float] Inference results of EfficientNetV2-XL model trained on ImageNet-21K dataset (Top 50 indices among 21,841 probabilites)
width integer The width of the image
height integer The height of the image

Statistics

  • Statistics for threshold-based label distribution
Threshold Labels per Image Unique Labels Sampling Ratio
0.00 50.00 21,841 100.00%
0.05 3.16 18,922 96.69%
0.10 1.98 18,471 85.83%
0.15 1.58 18,122 74.49%
0.20 1.37 17,781 64.30%
0.25 1.26 17,479 55.60%
  • Image Size

    img.png

  • Top1 Label Distribution img.png

Getting Started

Download

Experiments

  • We validated the quality of COYO-Labeled-300M dataset by re-implementing popular models, ViT.
  • We also provide pretraining & fine-tuning code and weights file for ViT
  • We report here the performance of vit trained on the coyo dataset.