Skip to content

Commit

Permalink
Reformat the project structure
Browse files Browse the repository at this point in the history
  • Loading branch information
eriknovak committed Nov 26, 2021
1 parent aba237c commit 51d411f
Show file tree
Hide file tree
Showing 19 changed files with 90 additions and 39 deletions.
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[submodule "data-collector"]
path = data-collector
path = services/data-collector
url = https://github.com/ErikNovak/event-registry-collector.git
1 change: 0 additions & 1 deletion data-collector
Submodule data-collector deleted from 3b0479
1 change: 1 addition & 0 deletions models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Models
1 change: 1 addition & 0 deletions notebooks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Notebooks
13 changes: 13 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
jupyterlab
ipywidgets>=7.5
numpy
pandas
matplotlib
python-dotenv
tqdm
transformers
scikit-learn
black
datasets
dvc
torch
15 changes: 15 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Scripts

This folder contains the scripts for:

1. setting up the project environment;

```bash
bash setup_environment.sh {event-registry-api-key}
```

2. Collecting the raw news data from Event Registry;

```bash
bash run_news_collector.sh
```
47 changes: 15 additions & 32 deletions scripts/run_news_collector.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/bin/bash

# stop whole script when CTRL+C
trap "exit" INT

# ===============================================
# Check for python
# ===============================================
Expand Down Expand Up @@ -29,8 +32,8 @@ fi;
# ===============================================

# create a new folder
if [[ ! -d "../data/news" ]]; then
mkdir ../data/news
if [[ ! -d "../data/raw" ]]; then
mkdir ../data/raw
fi;

# ===============================================
Expand All @@ -53,7 +56,14 @@ declare -a LANGUAGES=(
)

declare -a CONCEPTS=(
"ever given,container ship,suez canal"
#"Kobe Bryant,Helicopter,Basketball"
#"Container ship,Suez Canal"
"Pandora Papers"
"2020-2021 global chip shortage"
"Hong Kong,Demonstration (Protest)"
"Basketball,National Basketball Association (NBA)"
"EuroBasket"
"UEFA Champions League,Association football,Match"
)

for i in ${!CONCEPTS[@]}; do
Expand All @@ -64,13 +74,7 @@ for i in ${!CONCEPTS[@]}; do
# get the current language
LANGUAGE=${LANGUAGES[$j]}

# # prepare the files and folders
# EVENTS_PATH="../data/news/${LANGUAGE}/events/${CONCEPT// /_}.jsonl"
# ARTICLES_PATH="../data/news/${LANGUAGE}/articles/${CONCEPT// /_}"

# TODO: figure out if we want to retrieve the articles or do
# TODO: we want to retrieve the events and the their articles
ARTICLES_PATH="../data/news/${LANGUAGE}/${CONCEPT// /_}.jsonl"
ARTICLES_PATH="../data/raw/${LANGUAGE}/${CONCEPT// /_}.jsonl"

# get the articles fitting the query parameters
collect articles \
Expand All @@ -80,28 +84,7 @@ for i in ${!CONCEPTS[@]}; do
--date_start=$DATE_START \
--save_to_file=$ARTICLES_PATH

awk '{print NF}' "$ARTICLES_PATH" | sort -nu | tail -n 1

# # get the events mentioning the concepts
# collect events \
# --max_repeat_request=5 \
# --concepts="$CONCEPT" \
# --languages=$LANGUAGE \
# --date_start=$DATE_START \
# --save_to_file=$EVENTS_PATH


# if [[ -f $EVENTS_PATH ]]; then
# # get the articles of the events acquired with the above command
# collect event_articles_from_file \
# --max_repeat_request=5 \
# --event_ids_file=$EVENTS_PATH \
# --save_to_file=$ARTICLES_PATH
# else
# echo "Event file non-existant! Parameters:
# --concepts='$CONCEPT'
# --languages='$LANGUAGE'
# --date_start='$DATE_START'
# "
# fi;
done
done
7 changes: 5 additions & 2 deletions scripts/setup_environment.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/bin/bash

# stop whole script when CTRL+C
trap "exit" INT

# ===============================================
# Retrieve all gitmodules
# ===============================================
Expand Down Expand Up @@ -38,7 +41,7 @@ fi;
# ===============================================

if [[ -d "../$REPO_ENV" ]]; then
pip install -e ../data-collector
pip install -e ../services/data-collector
echo "News collector initialized"
fi;

Expand All @@ -52,7 +55,7 @@ if [[ -z "$ER_API_KEY" ]]; then
else
echo "Copying the Event Registry API Key"
# create the .env file with the API key as the content
echo "API_KEY=$ER_API_KEY" > ../data-collector/.env
echo "API_KEY=$ER_API_KEY" > ../services/data-collector/.env
fi;

# ===============================================
Expand Down
4 changes: 4 additions & 0 deletions services/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Services

The external services, e.g. docker containers and data collectors, that are used
in this project.
1 change: 1 addition & 0 deletions services/data-collector
Submodule data-collector added at 1de613
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ services:
type: volume
volume:
nocopy: true
- ../data/label-studio:/label-studio/data:rw
- ../../data/external/label-studio:/label-studio/data:rw
- ./nginx/${NGINX_FILE:-default.conf}:/etc/nginx/conf.d/default.conf:ro
command: nginx -g "daemon off;"

Expand All @@ -36,7 +36,7 @@ services:
- LABEL_STUDIO_HOST=${LABEL_STUDIO_HOST:-}
- LABEL_STUDIO_COPY_STATIC_DATA=true
volumes:
- ../data/label-studio:/label-studio/data
- ../../data/external/label-studio:/label-studio/data
# keep in sync with deploy/docker-entrypoint.d/30-copy-static-data.sh
- source: static
target: /label-studio/static_volume
Expand All @@ -51,7 +51,7 @@ services:
environment:
- POSTGRES_HOST_AUTH_METHOD=trust
volumes:
- ${POSTGRES_DATA_DIR:-../data/postgres}:/var/lib/postgresql/data
- ${POSTGRES_DATA_DIR:-../../data/external/postgres}:/var/lib/postgresql/data

volumes:
static: {}
File renamed without changes.
File renamed without changes.
2 changes: 2 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[flake8]
max-line-length=120
20 changes: 20 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
from setuptools import setup, find_packages

with open("README.md", mode="r") as fh:
long_description = fh.read()

with open("requirements.txt", "r") as fh:
requirements = fh.readlines()

setup(
name="worldnews",
version="0.1.0",
author="Erik Novak",
author_email="[email protected]",
description="Setting up the worldnews data collection and preparation",
long_description=long_description,
long_description_content_type="text/markdown",
packages=find_packages(),
install_requires=[req for req in requirements if req[:2] != "# "],
setup_requires=["flake8"],
)
3 changes: 3 additions & 0 deletions src/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Source Code

The source code for this project.
Empty file added src/__init__.py
Empty file.
3 changes: 3 additions & 0 deletions src/models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Models

The source code for training the models.
3 changes: 3 additions & 0 deletions src/visualization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Visualization

The source code for creating visualizations.

0 comments on commit 51d411f

Please sign in to comment.