docs: update docs about downloading images

Add reference to AWS dataset
openfoodfacts · May 15, 2023 · c207278 · c207278
1 parent 3c945d8
commit c207278
Show file tree

Hide file tree

Showing 2 changed files with 99 additions and 12 deletions.
diff --git a/docs/api/aws-images-dataset.md b/docs/api/aws-images-dataset.md
@@ -0,0 +1,44 @@
+# Open Food Facts AWS images dataset
+
+The Open Food Facts images dataset contains all images uploaded to Open Food
+Facts and the OCR results on these images obtained using Google Cloud Vision.
+
+The dataset is stored in the `openfoodfacts-images` bucket hosted in the
+`eu-west-3` region. All data is stored in a single `/data` folder.
+
+Data is synchronized every month between Open Food Facts server and S3 bucket,
+as such some recent images are likely to be missing. You should not assume all
+images are present on the S3 bucket.
+
+To know the bucket key associated with an image for the product with barcode
+'4012359114303', you should first split the barcode the following way:
+`/401/235/911/4303`.
+
+This splitting process is only relevant for EAN13 (barcodes with 13 digits),
+for barcodes with a smaller number of digit (like EAN8), the directory path is
+not splitted: `/20065034`.
+
+To get the raw image '1' for barcode '4012359114303', simply add the image ID:
+`/401/235/911/4303/1.jpg`. Here, you will get the "raw" image, as sent by the
+contributor. If you don't need the full resolution image, a 400px resized
+version is also available, by adding the `.400` suffix after the image ID:
+`/401/235/911/4303/1.400.jpg`.
+
+The OCR of the image is a gzipped JSON file, and has the same file name as the
+raw image, but with the `.json.gz` extension: `/401/235/911/4303/1.json.gz`
+
+To download images, you can either use AWS CLI, or perform an HTTP request
+directly:
+
+`wget https://openfoodfacts-images.s3.eu-west-3.amazonaws.com/data/401/235/911/4303/1.jpg`
+
+You can know all existing objects (images, OCR results) on the bucket by
+downloading the gzipped text file `s3://openfoodfacts-images/data/data_keys.gz`:
+
+`wget https://openfoodfacts-images.s3.eu-west-3.amazonaws.com/data/data_keys.gz`
+
+Then you can easily filter the files you want using `grep` (raw images, OCR
+JSON) before downloading them. For example, to keep only 400px versions of all
+images:
+
+`zcat data_keys.gz | grep '.400.jpg'`
diff --git a/docs/api/how-to-download-images.md b/docs/api/how-to-download-images.md
@@ -1,11 +1,44 @@
 # How to download product images
 
+The prefered method of downloading Open Food Facts images depends on what you
+which to achieve.
 
-All images can be found on [https://images.openfoodfacts.org/images/products/](https://static.openfoodfacts.org/images/products/). Images of a product are stored in a single directory. The path of this directory can be inferred easily from the product barcode. If the product barcode length is lower or equal to 8 (ex: \"22222222\"), the directory path is simply the barcode: all images can be found on `https://images.openfoodfacts.org/images/products/{barcode}`.
-Otherwise, the following regex is used to split the barcode into subfolders: `r"^(...)(...)(...)(.*)$"`. For example, the barcode 3435660768163 is split as follows: 343/566/076/8163, and all images of the products can be found on
+If you want to download a limited number of images, especially if these images
+have been uploaded recently, you should [download the image from Open Food
+Facts
+server](./how-to-download-images.md#download-from-open-food-facts-server).
+
+If you plan to download a large amount of images, you should on the contrary
+[use Open Food Facts images dataset hosted on
+AWS](./how-to-download-images.md#download-from-aws).
+
+## Download from AWS
+
+If you want to download a large number of images, this is the recommended
+option, as AWS S3 will be faster and allow concurrent download, contrary to
+Open Food Facts server, where you should preferably download images one at a
+time. See [AWS Images dataset](./aws-images-dataset.md) for more information
+about how to download images from AWS dataset.
+
+## Download from Open Food Facts server
+
+All images can be found on
+[https://images.openfoodfacts.org/images/products/](https://static.openfoodfacts.org/images/products/).
+Images of a product are stored in a single directory. The path of this
+directory can be inferred easily from the product barcode. If the product
+barcode length is lower or equal to 8 (ex: "22222222"), the directory path is
+simply the barcode: all images can be found on
+`https://images.openfoodfacts.org/images/products/{barcode}`.
+
+Otherwise, the following regex is used to split the barcode into subfolders:
+`r"^(...)(...)(...)(.*)$"`. For example, the barcode `3435660768163` is split as
+follows: `343/566/076/8163`, and all images of the products can be found on
 [https://images.openfoodfacts.org/images/products/343/566/076/8163](https://images.openfoodfacts.org/images/products/343/566/076/8163).
 
-To get the image file names, we have to use the database dump or the API. All images information are stored in the `images` field. For product [3168930010883](https://world.openfoodfacts.org/api/v0/product/3168930010883.json), we have:
+To get the image file names, we have to use the database dump or the API. All
+images information are stored in the `images` field. For product
+[3168930010883](https://world.openfoodfacts.org/api/v0/product/3168930010883.json),
+we have:
 
 ```json
     {
@@ -180,21 +213,31 @@ To get the image file names, we have to use the database dump or the API. All im
 
 The keys of the map are the keys of the images. These keys can be:
 
--   digits: the image is the raw image sent by the contributor.
--   selected images: `front_{lang}`, `nutrition_{lang}` and `ingredients_{lang}`, selected as front, nutrition and ingredients
-    images respectively for `lang`. Here, `lang` is a 2-letter ISO 639-1 language code (fr, en, es,\...).
+-   digits: the image is the raw image sent by the contributor (full resolution).
+-   selected images: `front_{lang}`, `nutrition_{lang}` and
+    `ingredients_{lang}`, selected as front, nutrition and ingredients images
+    respectively for `lang`. Here, `lang` is a 2-letter ISO 639-1 language code
+    (fr, en, es,\...).
 
-Each image is available in different resolutions: \"100\", \"200\", \"400\" or \"full\", each corresponding to image height (\"full\" means
-not resized). The available resolutions can be found in the `sizes` subfield.
+Each image is available in different resolutions: `100`, `200`, `400` or
+`full`, each corresponding to image height (`full` means not resized). The
+available resolutions can be found in the `sizes` subfield.
 
 Selected images have additional fields:
 
--   `rev` (as revision) indicates the revision number of the image to use (each time a new image is selected, cropped or rotated, a new image with an incremented rev is generated).
+-   `rev` (as revision) indicates the revision number of the image to use (each
+    time a new image is selected, cropped or rotated, a new image with an
+    incremented rev is generated).
 -   `imgid`, the image ID of the raw image used to generate the selected image.
 -   `angle`, `x1`, `x2`, `y1`, `y2`: rotation angle and cropping coordinates.
 
-For selected images, the file name is the image key followed by the revision number and the resolution: `front_fr.1.400.jpg`. For raw images, the file name is either the image ID (`1.jpg`) or the image ID followed by the resolution (`1.100.jpg`).
+For selected images, the file name is the image key followed by the revision
+number and the resolution: `front_fr.1.400.jpg`. For raw images, the file name
+is either the image ID (`1.jpg`) or the image ID followed by the resolution
+(`1.100.jpg`).
 
-To get the full URL, simply concatenate the product directory path and the image name.
+To get the full URL, simply concatenate the product directory path and the
+image name. Examples:
 
-If you want to download a significant number of images, let us know before on our [Slack](https://slack.openfoodfacts.org/) and don\'t be too eager to keep our servers safe!
+- [https://images.openfoodfacts.org/images/products/343/566/076/8163/1.jpg](https://images.openfoodfacts.org/images/products/343/566/076/8163/1.jpg)
+- [https://images.openfoodfacts.org/images/products/343/566/076/8163/1.400.jpg](https://images.openfoodfacts.org/images/products/343/566/076/8163/1.400.jpg)