Skip to content

Commit

Permalink
⚡ Improve the clustering performance
Browse files Browse the repository at this point in the history
  • Loading branch information
eriknovak committed Nov 19, 2023
1 parent 12c798d commit 38c56e9
Show file tree
Hide file tree
Showing 34 changed files with 121,654 additions and 2,515 deletions.
7 changes: 5 additions & 2 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
{
"python.formatting.provider": "black",
"jupyter.jupyterServerType": "local"
"python.formatting.provider": "none",
"jupyter.jupyterServerType": "local",
"[python]": {
"editor.defaultFormatter": "ms-python.black-formatter"
}
}
64 changes: 50 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,32 +11,29 @@ for news stream clustering and topic classification.

Before starting the project make sure these requirements are available:

- [conda][conda]. For setting up your research environment and python dependencies.
- [dvc][dvc]. For versioning your data.
- [git][git]. For versioning your code.
- [python]. For setting up your research environment and python dependencies.
- [dvc]. For versioning your data.
- [git]. For versioning your code.

## 🛠️ Setup

### Create a python environment

First create the virtual environment where all the modules will be stored.

#### Using virtualenv
#### Using venv

Using the `virtualenv` command, run the following commands:
Using the `venv` command, run the following commands:

```bash
# install the virtual env command
pip install virtualenv

# create a new virtual environment
virtualenv -p python ./.venv
python -m venv venv

# activate the environment (UNIX)
./.venv/bin/activate
source ./venv/bin/activate

# activate the environment (WINDOWS)
./.venv/Scripts/activate
./venv/Scripts/activate

# deactivate the environment (UNIX & WINDOWS)
deactivate
Expand Down Expand Up @@ -97,7 +94,7 @@ for the project.

### 🔍️ Collect the data via Event Registry API (required conda environment)

To collect the data via the [Event Registry API][er], follow the next steps:
To collect the data via the [Event Registry API], follow the next steps:

1. **Login into the Event Registry.** Create a user account in the Event Registry
service and retrieve the API key that has assigned to it. The API key can be
Expand Down Expand Up @@ -135,6 +132,46 @@ To collect the data via the [Event Registry API][er], follow the next steps:

The data should be collected and stored in the `/data` folder.

## 🚀 Running scripts

To run the scripts follow the next steps:

**Data cleanup**. To prepare and cleanup the data, run the following script:

```bash
python scripts/01_data_cleanup.py \
--raw_dir ./data/raw \
--results ./data/processed/articles.jsonl
```
This will retrieve the raw files found in the `raw_dir` folder, clean them up and store them in the `results` file.


**Split data into groups**. The processed `articles.jsonl` contains all of the articles together. However, each article is associated with a set of concepts used to retrieve them from Event Registry (during the news article collection step). To ensure the data clustering is as efficient as possible, we need to split the articles into groups. This is done with the following script:

```bash
python scripts/02_data_concepts_split.py \
--articles_dir ./data/processed \
--concepts_dir ./data/processed/concepts
```

**Monolingual news article clustering.**

```bash
python scripts/03_data_mono_clustering.py \
--concepts_dir ./data/processed/concepts \
--mono_events_dir ./data/final/monolingual
```

**Multilingual news event clustering.**

```bash
python scripts/04_data_multi_clustering.py \
--mono_events_dir ./data/final/mono
--multi_events_dir ./data/final/multi
```



## 📣 Acknowledgments

This work is developed by [Department of Artificial Intelligence][ailab] at [Jozef Stefan Institute][ijs].
Expand All @@ -144,10 +181,9 @@ Humane AI Network (grant no. 952026).


[python]: https://www.python.org/
[conda]: https://www.anaconda.com/
[git]: https://git-scm.com/
[dvc]: https://dvc.org/
[er]: https://eventregistry.org/
[Event Registry API]: https://eventregistry.org/

[ailab]: http://ailab.ijs.si/
[ijs]: https://www.ijs.si/
1,649 changes: 70 additions & 1,579 deletions notebooks/01-antonk-raw-data-analysis.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit 38c56e9

Please sign in to comment.