Skip to content

Latest commit

 

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Dataset

The dataset contains Amazon products from 100 product categories for open-world product attribute mining. We provide full human annotations for selected products from 10 product categories. For the rest of the data, attribute values scrapped from Amazon are available, and can be used for evaluation or distant supervision.

The dataset was collected in 2021. The products may have been taken down from Amazon since the collection of the dataset.

Download

The dataset can be downloaded here.

Format

The dataset contains the following directories:

  • raw: contains product text and product type information. Each file contains products from one category. Each line in the file is a JSON object.
{
  "asin": A unique product identifier,
  "title": Product title,
  "bullet_point": Product highlights in bullet points,
  "description": Detailed product description.
}
  • candidate: contains phrase segmented product titles generated by our method [1]. For each product type, there are two files. [PT].chunk.jsonl contains segmented product title, [PT].asin.txt contains product ASINs aligned with the title file.

  • seed: contains weak supervision used for training. Each file contains seed attribute values from one product category. Each line has the following format:

["attribute_name": ["attribute_value_1", "attribute_value_2", ...]]
  • dev: development labels. These labels are obtained from the Amazon catalog. Each file contains labels for one product category. They follow the same format as seed documents.

  • test: test labels. These labels are derived from full human annotation. Each file contains labels for one product category. They follow the same format as seed documents.

  • annotations: full human annotations. Each file contains annotations for products from one category. Each line has the following format:

{
  "asin": A unique product identifier,
  "title": The product title,
  "entities": [
    {
      "startOffset": the start character offset in the title,
      "endOffset": the end charracter offset in the title,
      "label": the attribute name,
      "value": the attribute value.
    }
  ]
}

Annotation

We selected products from 10 product types based on the product pupolarity. We then hired 5 MTurk crowd workers to annotate the products. After that, the labels are consolidated by an expert knowledge associate.

The full annotations are in the annotations directory. The test directory contains labels in clustering evaluation format, derived from the full annotations.

Citation

Please cite our paper if you are using the dataset:

@inproceedings{zhang2022oamine,
author = {Zhang, Xinyang and Zhang, Chenwei and Li, Xian and Dong, Xin Luna and Shang, Jingbo and Faloutsos, Christos and Han, Jiawei},
title = {OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision},
year = {2022},
isbn = {9781450390965},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3485447.3512035},
doi = {10.1145/3485447.3512035},
booktitle = {Proceedings of the ACM Web Conference 2022},
pages = {3153–3161},
numpages = {9},
keywords = {weak supervision., Open-world product attribute mining},
location = {Virtual Event, Lyon, France},
series = {WWW '22}
}

[1] Xinyang Zhang, Chenwei Zhang, Xian Li, Xin Luna Dong, Jingbo Shang, Christos Faloutsos, and Jiawei Han. 2022. OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision. In Proceedings of the ACM Web Conference 2022 (WWW '22). Association for Computing Machinery, New York, NY, USA, 3153–3161. https://doi.org/10.1145/3485447.3512035

License

Copyright 2021 Amazon.com, LLC, paper authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.