Github Data Analysis by Big Data System

Repo Structure

/api: store API test result about Github, bilibli
- Just import file in “./postman” to Postman to try API
/cluster: contains automatic configure script and configuration file of all components of system
- /configs: main directory, contains the above files
- /backup-config and /test are some useless directory
/data_gen: python script to collect raw data
- /time_spider: timer_spider.py is the main script, it will download GH archive compressed package and call Github Open API to collect data. API data of users and repos will be stored in MySQL and merge with GH Archive data. After merging, the data will write to a json file, which is watched by flume
  - Github Open API limits the anonymous access to 50/hour, so to call API continuously by python, we need to use Github’s developer token to increase this limit to 5000/hour. The token list file is auth_token.txt, and all token has invalid after making this repo public, you need to get you own developer token.
/data_preprocess: process raw data of HDFS and extract activity table into Hive
- /load2hive: shell will call python script to load raw data file in HDFS to Hive base table
- /hive2event: shell will call python script to extract Hive base table to different activity table
- /mysql2hive: shell will call python script to load user and repo data in MySQL to Hive
/data_analysis: analyze data in Hive and store result to Hive, MySQL, MongoDB
- /sql: contains schema of MySQL
- analysis.sh will call python script to execute multiple tasks in a loop, spark will do data aggregation of Hive
/data_vis: ARCHIVED, contains config of superset, a data visualization component of big data, but we use grafana to replace it, and Grafana only need to install package and be configured by graphic interaction.
/real_data_process: only contains one python script which is used to run spark streaming to capture data from Kafka
- need to set config for Sparksession, or it will use all resources to run and block out other Spark tasks
/mysql_data_bk: contains some data collected by Github Open API, you just need to create corresponding schema and import these two sql file to MySQL

Data Source

Github Archive

Download json.gz from Github Archive according to timestamp, we download latest non-existing data each time
Filter the useless url property in json file
We regard this part as offline data, but in fact it’s almost real-time:
- we download one hour by one hour, and imitate it as a continuous data flow
- the data is updated by GH Archive Servers, though there are few hours’ latency,

Github Open API

We call Github’s Open API to query what we are interested in
- Github Archive only contains the basic information of activity
This will need to get a valid token from github account, then you can call API 5000 times per hour
- You can test API by Postman

Architecture

Whole Framework

Cluster and component structure

Data Schema

Auto Config Control Flow

Configuration

You can check ./cluster/configs directory to get details of configuration of each components

Spark

Some basic configurations of Spark are as follows:

MASTER=spark://master01:7077
SPARK_MASTER_WEBUI_PORT=8090 : to avoid port conflict
SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory=hdfs://master01:8020/user/hadoop/evtlogs: the history log directory is set to HDFS: /user/hadoop/evtlogs

Hive

There are two ways to configure Hive with Spark: Spark on Hive and Hive on Spark. The former is to use Hive as a metastore, and the latter is to use Hive as a SQL engine. In this project, we use the former way. Reasons are as followd:

Hive on Spark need to recompile Spark and import jar packages, which is not convenient for us to use.
Our main program will use pyspark, which is not compatible with Hive on Spark.
More and more companies use SparkSQL to construct their data warehouse, and Hive is just a metastore.

Port

Port	Service	Description
8088	YARN	YARN ResourceManager
9870	HDFS	HDFS NameNode
9868	HDFS	HDFS SecondaryNameNode
9864	HDFS	HDFS DataNode
8042	YARN	YARN NodeManager
8090	Spark	Spark Master WebUI
8091	Spark	Spark Worker WebUI
18080	Spark	Spark History Server
9083	Hive	Hive Metastore
10000	Hive	HiveServer2
3306	MySQL	MySQL Server
8020	HDFS
37017	MongoDB	MongoDB Server
3000	Grafana	Grafana Server
12222	SSH	SSH Port
2888	Zookeeper	Zookeeper Server
3888	Zookeeper
9092	Kafka	Kafka Server
2181	Kafka	zookeeper and kafka
7890	Clash	VPN Server
7891	Clash	VPN Server

Some default port of components has been changed for security. You need to open most of them only in internal network for security consideration.

Start Step

Cluster Config and Start

All corresponding files and directories are located at cluster/configs

local PC:
- modify hosts file with LF line ending, need to add one empty line at the end of the file
- copy configs directory to $HOME at master01
Servers:
- modify privilege of private key ali-5003.pem to 600 , so can be used by ssh
- add execution priviledge to all shell scripts in configs/tools directory
- run sudo ./0-remoteConfig.sh to config all nodes
  - The most part of this shell aims to config ssh-no-password-login with ensure the security
  - Aliyun servers are frequently attacked, so we need to do this.
  - Sometimes the execution of shell scripts will be terminated with no reason, and cannot finish the whole installation, need to check log to ensure installation finish
  - This will need about 5-10 minutes
- run sudo ./1-remoteStart.sh to start all nodes

Attention

After running 0-remoteConfig.sh, the ssh port will be modified to 12222
- Aliyun server will be attacked very often, and the 22 port is the most dangerous port
- You need to add parameter to set ssh login instruction with ssh user@ip -p 12222
Logs of initialization and starting are located at cluster/configs/logs, shell script won't producce too much info, so details can be found in logs.

Data Pipeline

Run time_spider.py to provide a stable data source for flume
1. Flume and Kafka will import data from local json to HDFS automatically
Use scripts to run data_preprocess in the background
1. Use nphup ... >> execute.log & to ensure timer shell scripts won’t be interrupt by ssh disconnection
Use scripts to run data_analysis in the background
Use script to run real_time_process in the frontground
Data Visualization need to create dashboard, create graph by writing SQL in the WebUI

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
README.assets		README.assets
api		api
cluster		cluster
data_analysis		data_analysis
data_gen		data_gen
data_preprocess		data_preprocess
data_vis		data_vis
mysql_data_bk		mysql_data_bk
presentation		presentation
proposal		proposal
real_data_process		real_data_process
report		report
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Github Data Analysis by Big Data System

Repo Structure

Data Source

Github Archive

Github Open API

Architecture

Configuration

Spark

Hive

Port

Start Step

Cluster Config and Start

Attention

Data Pipeline

About

Releases

Packages

Contributors 2

Languages

License

gjbang/github-data-analysis-Spark

Folders and files

Latest commit

History

Repository files navigation

Github Data Analysis by Big Data System

Repo Structure

Data Source

Github Archive

Github Open API

Architecture

Configuration

Spark

Hive

Port

Start Step

Cluster Config and Start

Attention

Data Pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages