The Real-time Weather Data Pipeline is a simple data processing application that collects and analyses weather data from 60 cities around the world in real-time.
It leverages Apache Kafka for data ingestion and Spark for data processing and analytics. This pipeline provides a continuous stream of weather information, enabling users to monitor and analyse weather patterns across multiple locations simultaneously.
The application retrieves weather data from multiple cities using the OpenWeatherMap API. It periodically fetches geolocation and weather information for a list of 60 cities.
Kafka serves as the data streaming platform. A Kafka producer component is responsible for fetching weather data and sending it to Kafka topics. It collects data for multiple cities and batches it before transmitting.
Spark Structured Streaming is used for real-time data processing. It consumes weather data from Kafka topics, and computes various statistics, such as average temperature, wind speed, humidity, and pressure.
- Java Development Kit (JDK) installed on your machine.
- Download and extract Kafka: https://kafka.apache.org/downloads
- Generate your own OpenWeather API key: https://openweathermap.org/
-
Create a
.env
file in the project root and add your OpenWeather API key: -
Start a Zookeeper instance:
cd /path/to/kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
- Start Kafka Broker:
bin/kafka-server-start.sh config/server.properties
- Install dependencies and run the app:
cd /path/to/weather-data-stream
pip install -r requirements.txt
python main.py
python weather_data_streaming.py