This is a machine learning project focussed on creating and serving a machine learning model to predict future waiting times in the Phantasialand amusement park.
To this end, waiting time data from wartezeiten.app, weather data from the Deutscher Wetterdienst and data about German public and school holidays were analyzed and used to train different machine learning models (Linear Regression, XGBoost and LightGBM). I tuned the hyperparameters of the best model (it was LightGBM) and serve it as a WebApp using Streamlit.
You can learn more about this project on our blog: CI Insights (German).
You can try the WebApp here: http://predict-phantasialand.herokuapp.com/
This is how the interface looks like:
This repository comes without the data and the trained models. In order to reproduce the results, you will have to download the data and train the models by yourself. (If you are a member of Cologne Intelligence, you can also find them in the CIDD sharepoint, look for "Einstiegsprojekt Phantasialand".)
Setup a virtual environment and install all project dependencies. Python 3.8 or higher is required.
> python3 -m venv .venv/
> source .venv/bin/activate
> pip install -r requirements_dev.txt
Unfortunately, most of the data retrieval must be done by hand.
- Download
tageswerte_KL_01327_19370101_20201231_hist.zip
andtageswerte_KL_02667_19570701_20201231_hist.zip
from OpenData.DWD - Historical - Download
tageswerte_KL_01327_akt.zip
andtageswerte_KL_02667_akt.zip
from OpenData.DWD - Recent - Place all four files in
data/raw/dwd_weather
- Download the iCal calendar file containing all German public holidays from
https://www.feiertage-deutschland.de/kalender-download/
and save it as
Feiertage Deutschland.ics
- Copy and paste the tables with school holiday information for 2019-2024 from
https://www.schulferien.org/deutschland/ferien/
and save them as
schulferien.txt
. Look atdata/raw/schulferien_template.txt
for the file structure. - Place both files in
data/raw/
Activate the virtual environment and run
> python src/data/download_waiting_times.py data/raw/wartezeiten_app.csv
This will download all waiting time data from https://www.wartezeiten.app/phantasialand/ and may take a moment.
Make sure that data/raw
looks like this:
data/raw
├── Feiertage Deutschland.ics
├── dwd_weather
│ ├── tageswerte_KL_01327_19370101_20201231_hist.zip
│ ├── tageswerte_KL_01327_akt.zip
│ ├── tageswerte_KL_02667_19570701_20201231_hist.zip
│ └── tageswerte_KL_02667_akt.zip
├── schulferien.txt
├── sources.md
├── wartezeiten_app.csv
Run
> make data
to process the raw data and
> python src/training/train_lightgbm.py
to train the LightGBM model.
You may also have a look at the other training scripts in src/training
or play with
the parameters. All models that are trained with this scripts are saved using the MLflow
model registry.
If you want to use the web app, you need to copy the desired model from the MLflow model
registry to models/best/
. The models trained and saved with MLflow are placed at
mlruns/0/<some hash>/artifacts/model
. Make sure to copy all files in this folder
(especially MLmodel
and model.pkl
).
If you trained only one model, it should be easy to see which model you want to copy.
Otherwise use the MLflow UI (mlflow ui --backend-store-uri sqlite:///mlflow.db
) to
find the path to your favorite model.
Afterwards you can open the WebApp with
> streamlit run src/app/app.py
You can deploy the web app including the model used for prediction as Docker container.
Follow this steps:
- Ensure that you have docker installed and
dockerd
is running - Ensure that you ran
make data
and placed your favorite model inmodels/best/
- If you want to deploy the container via Heroku, follow this
guide and
build the container using
heroku container:push -R
. This will use theDockerfile.web
file, which is optimized for Heroku. - Otherwise run
docker build -t phantasialand:latest .
to build the container. This will use theDockerfile
file, which exposes the WebApp on a fixed port (8501).
If you want to perform some data analyses or model evaluations, you may want to have a
look at the notebooks in notebooks/
.
End-to-End-Analysis: Note that the models are trained on exact weather data whereas
users can only see weather bins (like sunny, overcast, light rain, heavy rain) as we
do not have future weather data. This causes a bias in evaluation. To get realistic
evaluation results, you can use src/evaluation/test_e2e.py
. It transforms the exact
weather data into weather bins before querying the model. The scripts generates a csv
file containing predictions and actual values for all samples in the test set which can
then be analyzed, e.g. by using one of the notebooks/evaluation/fm_e2e_xxx
notebooks.