User Behaviors Analysis Using Spark-SQL

###This is the term project of course COMP7305, Cluster and Cloud Computing, of Dept. of Computer Science, The University of Hong Kong. Supervised by Prof. Cho-Li Wong.

In this project, we analyse the user behaviours based on an online education website focused on programming techniques.

For project enviornment, we installed Xen virtual machines on our lab PCs, in order to create a 8-VM cluster. After that, we installed Hadoop and Yarn on this private cluster, and setup Spark on Yarn. Then we installed a Mysql Database on one of the VM machines to store the final cleaned data from Spark SQL, via JDBC.

Main programming language used in our project is Scala. We also wrote Crawler in Python to collect Course Labels and Catalogs from website, and then join the tabel with cleaned website log.

We debugged code using --master("local[2]") locally via IntelliJ Idea, and then assembled all the dependancies into .jar package using Mvn, pushing to the cluster. We used ./bin/spark-submit to submit the task to Yarn.

We focused on Parallel Programming features in Spark using Scala, and parameters effect when submitting, as well as discovered VCpus and Memory configuration in Xen.

After a series of improvements, we accelerate the running time form more than 3 minutes to less than 1.5 minutes.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.idea		.idea
src		src
target		target
.gitattributes		.gitattributes
README.md		README.md
pom.xml		pom.xml
sqltest1.iml		sqltest1.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

User Behaviors Analysis Using Spark-SQL

About

Releases

Packages

Languages

CaptainDuke/Spark-SQL

Folders and files

Latest commit

History

Repository files navigation

User Behaviors Analysis Using Spark-SQL

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages