-
Notifications
You must be signed in to change notification settings - Fork 8
yangqiang/BigDataBench-Spark
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
BigDataBench-Spark is an integrated part of the open source big data benchmark suite project: BigDataBench, publicly available from: http://prof.ict.ac.cn/BigDataBench This version is for Spark-1.3.x. If you need a citation for BigDataBench-Spark, please cite the following paper: [BigDataBench: a Big Data Benchmark Suite from Internet Services.](http://prof.ict.ac.cn/BigDataBench/wp-content/uploads/2013/10/Wang_BigDataBench.pdf) Lei Wang, Jianfeng Zhan, ChunjieLuo, Yuqing Zhu, Qiang Yang, Yongqiang He, WanlingGao, Zhen Jia, Yingjie Shi, Shujie Zhang, Cheng Zhen, Gang Lu, Kent Zhan, Xiaona Li, and BizhuQiu. The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, USA. How to use BigDataBench's Spark workloads? Compile the source code or download a pre-build package(can be found is the `pre-build' folder). For compiling, please refer to: how-to-compile.txt Preparations: Make sure Spark-1.3.x has been successfully installed. Configure you bash environment: $SPARK_HOME points to the path where spark installed; Add $SPARK_HOME/bin to the $PATH variable. The workloads inculde: Sort, Grep, Word Count, NaiveBayesTrainer, BayesClassifier, ConnectedComponent, PageRank, KMeans, and CF(Collaborate Filtering -- ALS) How to run: Assume the bigdatabench-spark_*-1.3.0.jar file locates in $JAR_FILE. Sort run: spark-submit --class cn.ac.ict.bigdatabench.Sort $JAR_FILE <data_file> <save_file> [<slices>] parameters: <data_file>: the HDFS path of input data, for example: /test/data.txt <save_file>: the HDFS path to save the result [<slices>]: optional, times of number of workers input data format: ordinary text files Grep run: spark-submit --class cn.ac.ict.bigdatabench.Grep $JAR_FILE <data_file> <keyword> <save_file> [<slices>] parameters: <data_file>: the HDFS path of input data, for example: /test/data.txt <keyword>: the keyword to filter the text <save_file>: the HDFS path to save the result [<slices>]: optional, times of number of workers input data format: ordinary text files WordCount run: spark-submit --class cn.ac.ict.bigdatabench.WordCount $JAR_FILE <data_file> <save_file> [<slices>] parameters: <data_file>: the HDFS path of input data, for example: /test/data.txt <save_file>: the HDFS path to save the result [<slices>]: optional, times of number of workers input data format: ordinary text files NaiveBayesTrainer run: spark-submit --class cn.ac.ict.bigdatabench.NaiveBayesTrainer $JAR_FILE <data_file> <save_file> [<slices>] parameters: <data_file>: the HDFS path of input data, for example: /test/data.txt <save_file>: the HDFS path to save the result [<slices>]: optional, times of number of workers input data format: classname text_content for example: (class: dog/cat) dog Dogs are awesome, cats too. I love my dog cat Cats are more preferred by software developers. I never could stand cats. I have a dog dog My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs cat Cats are difficult animals, unlike dogs, really annoying, I hate them all NaiveBayesClassifier run: spark-submit --class cn.ac.ict.bigdatabench.NaiveBayesClassifier $JAR_FILE <data_file> <model_file> <save_file> [<slices>] parameters: <data_file>: the HDFS path of input data, for example: /test/data.txt <model_file>: the HDFS path of Bayes model data(generated with the training program), for example: /test/bayes_model <save_file>: the HDFS path to save the classification result [<slices>]: optional, times of number of workers input data format: text_content for example: Dogs are awesome, cats too. I love my dog Cats are more preferred by software developers. I never could stand cats. I have a dog My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs Cats are difficult animals, unlike dogs, really annoying, I hate them all output data format: classname text_content for example: (class: dog/cat) dog Dogs are awesome, cats too. I love my dog cat Cats are more preferred by software developers. I never could stand cats. I have a dog dog My dog's name is Willy. He likes to play with my wife's cat all day long. I love dogs cat Cats are difficult animals, unlike dogs, really annoying, I hate them all ConnectedComponent run: spark-submit --class cn.ac.ict.bigdatabench.ConnectedComponent $JAR_FILE <data_file> [<slices>] parameters: <data_file>: the HDFS path of input data, for example: /test/data.txt [<slices>]: optional, times of number of workers input data format: from_vertex to_vertex for example: 1 2 1 3 2 5 4 6 6 7 PageRank run: spark-submit --class cn.ac.ict.bigdatabench.PageRank $JAR_FILE <file> <number_of_iterations> <save_path> [<slices>] parameters: <file>: the HDFS path of input data, for example: /test/data.txt <number_of_iterations>: number of iterations to run the algorithm <save_path>: path to save the result [<slices>]: optional, times of number of workers input data format page neighbour_page for example: a b a c b d CF(Collaborate Filtering, ALS) run: spark-submit --class cn.ac.ict.bigdatabench.ALS $JAR_FILE <ratings_file> <rank> <iterations> [<splits>] parameters: <ratings_file>: path of input data file <rank>: number of features to train the model <iterations>: number of iterations to run the algorithm [<splits>]: optional, level of parallelism to split computation into input data: userID,productID,rating for example: 1,1,5 1,3,4 1,5,1 2,1,4 2,5,5 KMeans run spark-submit --class cn.ac.ict.bigdatabench.KMeans $JAR_FILE <input_file> <k> <max_iterations> [<splits>] parameters: <input_file>: the HDFS path of input data, for example: /test/data.txt <k>: number of centers <max_iterations>: number of iterations to run the algorithm [<splits>]: optional, level of parallelism to split computation into input data: x11 x12 x13 ... x1n x21 x22 x23 ... x2n for example 1.0 1.1 1.3 1.4 2.1 2.4 2.6 2.7 3.1 3.3 3.6 3.7