Prerequisite

Quickstart

git clone https://github.com/news-document-pipeline-htw-berlin/Analytics
Add 3 Spark-NLP-Models to main/resources folder (NamedEntityRecognition, StopWordCleaner, Lemmatizer - links are provided below)
cd Analytics
Edit inputUri and outputUri in App.class so they refer to where your inputData is stored and where you want to save your processed data
Make sure input MongoDB is structured according to cheat sheet
sbt run

_id = unique Hash to identify a single article
authors = an array containing the authors
crawl_time = time-spamp informing when the article was crawled
description = summery of the text, written by the author
departments = theme based categorisation of the article
entities = named entities and their predicted category (PERson, LOCation, ORGanisation, MISCellaneous)
image_links = an array, containing the links of the images used in the article
intro = introduction text
keywords = an array, containing keywords given by the author
keywords_extracted = an array, containing keywords extracted by the analytics team
lemmatizer = content of "StopWordCleaner", but tokens are reduced to their root/neutral form
links = an array, containing the links used in the article
long_url = the complete URL of the article
news_site = name of the source
published_time = time-stamp informing the article was published
read_time = estimated read time for the article
sentiments = calculated sentiment value for a given text
short_url = the shortened URL of the article
text = body of the article
textsum = a generated summarization of the text consisting of three sentences chosen by their calculated significance
title = title of the article

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
project		project
src		src
.gitignore		.gitignore
PipeLine.png		PipeLine.png
README.md		README.md
build.sbt		build.sbt