Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds data and README.md #5

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,4 @@ dmypy.json

# Pyre type checker
.pyre/
.DS_Store
29 changes: 19 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,21 @@
# IAI Project Template
# CLEF 2024 CheckThat! task1 IAI participation.

[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

This repository serves as a template for software projects.

# Testing and GitHub actions

Using `pre-commit` hooks, `flake8`, `black`, `mypy`, `docformatter`, and `pytest` are locally run on every commit. For more details on how to use `pre-commit` hooks see [here](https://github.com/iai-group/guidelines/tree/main/python#install-pre-commit-hooks).

Similarly, Github actions are used to run `flake8`, `black` and `pytest` on every push and pull request. The `pytest` results are sent to [CodeCov](https://about.codecov.io/) using their API for to get test coverage analysis. Details on Github actions are [here](https://github.com/iai-group/guidelines/blob/main/github/Actions.md).
This repo contains the code and data for the CLEF 2024 CheckThat! task1 IAI participation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be helpful to add a one-line explanation of the task and also the link to checkthat.

## Folders
* checkthat - python module for the claim detection.
* data
* task1 - Contains data for task1 which is also uploaded to HuggingFace framework.
* The dataset is currently gated/private, make sure you have run huggingface-cli login
* Usage:
```
from datasets import load_dataset
# English data containing political debates.
dataset_en = load_dataset("iai-group/clef2024_checkthat_task1_en")
# Spanish data containing Tweets.
dataset_es = load_dataset("iai-group/clef2024_checkthat_task1_es")
# Dutch data containing Tweets.
dataset_nl = load_dataset("iai-group/clef2024_checkthat_task1_nl")
# Arabic data containing Tweets.
dataset_ar = load_dataset("iai-group/clef2024_checkthat_task1_ar")
```
File renamed without changes.
1 change: 1 addition & 0 deletions checkthat/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Code to train and evaluate the models for CheckThat tasks."""
1 change: 1 addition & 0 deletions checkthat/task1/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Code to train and evaluate the models for CheckThat tasks1."""
46 changes: 0 additions & 46 deletions code/main.py

This file was deleted.

8 changes: 7 additions & 1 deletion data/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# Data

This folder should contain the description of data that was used in the project, including datasets, queries, ground truth, run files, etc. Files under 10MB can be stored on GitHub, larger files should be stored on a server (e.g., gustav1). This README should provide a comprehensive overview of all the data that is used and where it originates from (e.g., part of an official test collection, generated using code in this repo or a third-party tool, etc.).
task1 data is under the folder `task1`. It also contains code to load and prepare the HuggingFace data. It is already uploaded to the Hugginface framework:
* English data: `iai-group/clef2024_checkthat_task1_en`
* Spanish data: `iai-group/clef2024_checkthat_task1_es`
* Dutch data: `iai-group/clef2024_checkthat_task1_nl`
* Arabic data: `iai-group/clef2024_checkthat_task1_ar`

This data is provided by the [CLEF 2024 CheckThat organizers](https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/tree/main/task1?ref_type=heads)
1,094 changes: 1,094 additions & 0 deletions data/task1/arabic/dev.tsv

Large diffs are not rendered by default.

501 changes: 501 additions & 0 deletions data/task1/arabic/test.tsv

Large diffs are not rendered by default.

7,334 changes: 7,334 additions & 0 deletions data/task1/arabic/train.tsv

Large diffs are not rendered by default.

99 changes: 99 additions & 0 deletions data/task1/clef24_dataset_debate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
"""Multilang Dataset loading script."""

from datasets import (
DatasetInfo,
BuilderConfig,
Version,
GeneratorBasedBuilder,
DownloadManager,
)
from datasets import SplitGenerator, Split, Features, Value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be functions be sorted alphabetically?

from typing import Generator, Tuple, Union

import os

_DESCRIPTION = """
This dataset includes English data for CLEF 2024 CheckThat! Lab task1.
"""

_CITATION = """\
@inproceedings{barron2024clef,
title={The CLEF-2024 CheckThat! Lab: Check-Worthiness, Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness},
author={Barr{\'o}n-Cede{\~n}o, Alberto and Alam, Firoj and Chakraborty, Tanmoy and Elsayed, Tamer and Nakov, Preslav and Przyby{\l}a, Piotr and Stru{\ss}, Julia Maria and Haouari, Fatima and Hasanain, Maram and Ruggeri, Federico and others},
booktitle={European Conference on Information Retrieval},
pages={449--458},
year={2024},
organization={Springer}
}
""" # noqa E501

_LICENSE = "Your dataset's license here."


class CLEF24EnData(GeneratorBasedBuilder):
"""A multilingual text dataset."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and further: this is not multilingual dataset, the class can only contain English data.


BUILDER_CONFIGS = [
BuilderConfig(
name="multilang_dataset",
version=Version("1.0.0"),
description="Multilingual dataset for text classification.",
),
]

DEFAULT_CONFIG_NAME = "multilang_dataset" # Default configuration name.

def _info(self):
"""Construct the DatasetInfo object."""
return DatasetInfo(
description=_DESCRIPTION,
features=Features(
{
"Sentence_id": Value("string"),
"Text": Value("string"),
"class_label": Value("string"),
}
),
supervised_keys=("Text", "class_label"),
homepage="https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/tree/main/task1", # noqa E501
citation=_CITATION,
license=_LICENSE,
)

def _split_generators(
self, dl_manager: DownloadManager
) -> list[SplitGenerator]:
"""Returns SplitGenerators."""
# Assumes your dataset is located in "."
data_dir = os.path.abspath(".")
splits = {
"train": Split.TRAIN,
"dev": Split.VALIDATION,
"test": Split.TEST,
}

return [
SplitGenerator(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split generator is a bit misleading name as it suggest that we are those generating the splits.

name=splits[split],
gen_kwargs={
"filepath": os.path.join(data_dir, f"{split}.tsv"),
"split": splits[split],
},
)
for split in splits.keys()
]

def _generate_examples(
self, filepath: Union[str, os.PathLike], split: str
) -> Generator[Tuple[str, dict], None, None]:
"""Yields examples."""
with open(filepath, encoding="utf-8") as f:
for id_, row in enumerate(f):
if id_ == 0: # Optionally skip header
continue
cols = row.strip().split("\t")
yield f"{split}_{id_}", {
"sentence_id": cols[0],
"sentence": cols[1],
"label": cols[2],
}
101 changes: 101 additions & 0 deletions data/task1/clef24_dataset_twitter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
"""Multilang Dataset loading script."""

from datasets import (
DatasetInfo,
BuilderConfig,
Version,
GeneratorBasedBuilder,
DownloadManager,
)
from datasets import SplitGenerator, Split, Features, Value
from typing import Generator, Tuple, Union

import os

_DESCRIPTION = """
This dataset includes Arabic/Dutch/Spanish Twitter data for CLEF 2024 CheckThat! Lab task1.
""" # noqa E501

_CITATION = """\
@inproceedings{barron2024clef,
title={The CLEF-2024 CheckThat! Lab: Check-Worthiness, Subjectivity, Persuasion, Roles, Authorities, and Adversarial Robustness},
author={Barr{\'o}n-Cede{\~n}o, Alberto and Alam, Firoj and Chakraborty, Tanmoy and Elsayed, Tamer and Nakov, Preslav and Przyby{\l}a, Piotr and Stru{\ss}, Julia Maria and Haouari, Fatima and Hasanain, Maram and Ruggeri, Federico and others},
booktitle={European Conference on Information Retrieval},
pages={449--458},
year={2024},
organization={Springer}
}
""" # noqa E501

_LICENSE = "Your dataset's license here."


class CLEF24TwitterData(GeneratorBasedBuilder):
"""A multilingual Twitter dataset for claim detection."""

BUILDER_CONFIGS = [
BuilderConfig(
name="clef24_tweet_data",
version=Version("1.0.0"),
description="Multilingual dataset for text classification.",
),
]

DEFAULT_CONFIG_NAME = "clef24_tweet_data" # Default configuration name.

def _info(self):
"""Construct the DatasetInfo object."""
return DatasetInfo(
description=_DESCRIPTION,
features=Features(
{
"tweet_id": Value("string"),
"tweet_url": Value("string"),
"tweet_text": Value("string"),
"class_label": Value("string"),
}
),
supervised_keys=("tweet_text", "class_label"),
homepage="https://gitlab.com/checkthat_lab/clef2024-checkthat-lab/-/tree/main/task1", # noqa E501
citation=_CITATION,
license=_LICENSE,
)

def _split_generators(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the name sounds misleading to me.

self, dl_manager: DownloadManager
) -> list[SplitGenerator]:
"""Returns SplitGenerators."""
# Assumes your dataset is located in "."
data_dir = os.path.abspath(".")
splits = {
"train": Split.TRAIN,
"dev": Split.VALIDATION,
"test": Split.TEST,
}

return [
SplitGenerator(
name=splits[split],
gen_kwargs={
"filepath": os.path.join(data_dir, f"{split}.tsv"),
"split": splits[split],
},
)
for split in splits.keys()
]

def _generate_examples(
self, filepath: Union[str, os.PathLike], split: str
) -> Generator[Tuple[str, dict], None, None]:
"""Yields examples."""
with open(filepath, encoding="utf-8") as f:
for id_, row in enumerate(f):
if id_ == 0: # Optionally skip header
continue
cols = row.strip().split("\t")
yield f"{split}_{id_}", {
"tweet_id": cols[0],
"tweet_url": cols[1],
"tweet_text": cols[2],
"class_label": cols[3],
}
Loading
Loading