Home

Big Data:

Essential Concepts and Tools

Darrell Aucoin

Stats Club

> In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. > —Grace Hoppe

What is Big Data?

alt text

The 4 V's of Big Data

Volume: The quantity of data.

Usually too big to fit into the memory of a single machine.

Veracity: The quality of data can vary. The inconsistency of data.

Variety: Data often comes in a variety of formats and sources often needing to be combined for a data analysis.

Velocity: The speed of generation of new data.

Volume

100 terabytes of data are uploaded to Facebook daily.
90% of all data ever created was generated in the past 2 years.

Problem: Impossible to do data analysis with one computer on data this size.

Solution: A distributed computing system.

alt text

###Veracity

Big Data sets do not have the controls of regular studies
- Naming inconsistency
- Inconsistency in signal strength (i.e. Boston's Bump App, Google Flu)
Cannot simply assume data missing at random

Veracity naming inconsistency: a musician named several different ways in several different files

###Variety

Most of Big Data is unstructured or semi-structured data
- Doesn't have the guarantees of SQL
Data can be structured, semi-structured, or unstructured
Often have to combine various datasets from a variety of sources and formats

###Velocity

Speed that data is created, stored, analyzed to create actionable intelligence
Every min:
- 100 hours is uploaded to Youtube
- 200 million emails are sent
- 20 million photos are viewed
Often need to be very agile in creating a data product

Tools

Popular Hadoop Projects

Hadoop: A distributed file system and MapReduce engine YARN.

Spark: An in-memory based alternative to Hadoop's MapReduce which is better for machine learning algorithms.

Spark SQL, MLlib (machine learning), GraphX (graph-parallel computation), and Spark Streaming.

Storm: Distributed tool for processing fast, large streams of data.

Cassandra: NoSQL system implemented on Hadoop.

Hive: Allows users to create SQL-like queries (HQL) and convert them to MapReduce.

HCatalog: A centralized metadata management and sharing service for Hadoop, allowing a unified view of all data in Hadoop clusters.

Pig: An easy to learn hadoop-based language that is adept at very deep, very long data pipelines.

Mahout: A data mining library using the most popular data mining algorithms using the Map Reduce model.

Non-Hadoop Projects

NoSQL (Not Only SQL): A database that is not based storage and retrieval of tabular relations used in relational databases. Some can provide a distributed database.

Examples: MongoDB, CouchDB, Accumulo, and some NoSQL databases are implemented on Hadoop: Cassandra, HBase.

SQL: Can spill to disk allowing datasets to be larger than memory size.

MADlib: Machine learning library extension for PostgreSQL.

Hadoop Ecosystem

alt text

The blue is the necessary components of a Hadoop Ecosystem
Some tools provide several functionalities.
- i.e. Hadoop is a distributed file system with MapReduce engine and scheduler.

Distributed File System

alt text

HDFS (Hadoop Distributed File System) is a distributed file-system across multiple interconnected computer systems (nodes).

Data is stored across multiple hard drives.

Lustre: DFS used by most enterprise High Performance Clusters (HPC). Usually uses a shared networked drive.

Google File System (GFS): Google propriety distributed file system.

MapR: DFS inspired by HDFS but written in C++ instead of Java.

MapReduce

alt text

MapReduce is the engine that processes data in a Hadoop Ecosystem.
Spark and Tez uses a more flexiable in memory model of MapReduce which is better for Machine Learning algorithms.

Scheduler

alt text

In order for multiple people to run on the same cluster, a scheduler is needed.

Data Manipulation

alt text

These are the tools to help parse, transform, and combine various datasets.

Hive, Spark SQL, Impala, Cassandra, and HBase all use a SQL-like language to help manipulate data.
Hive can be implemented using the Spark MapReduce Engine (significantly speeding it it's processes).

Data Analysis

alt text

There are several Machine Learning algorithms already in place for the Hadoop Ecosystem.

Mahout can be implemented on Spark, Tez, and Hadoop
Spark also has GraphX, which uses graphs to perform analytics (PageRank, etc.)

There is also specialized tools:

Hadoop Image Processing Interface (HIPI): Image processing package helping to determine image similarity.

SpatialHadoop: Extension to process datasets of spatial data in Hadoop.

Serialization

alt text

Parsing, transforming and combining the data into a useable dataset can be time consuming. Thus, once a suitable amount of work is done to create a useable dataset it is best to save it for future work.

Serialization saves the state of the data, allowing it to be recreated at a later date.

JAVA Serialization is the worst of the above and should only be used for legacy reasons

Avro: Serialization made for Hadoop.

JSON: Java Script Object Notation is a convenient way of describing, serializing, and transferring data.

Protocol Buffers: More optimal serialization that requires the precise structure of the data when job is being run. Has less support for programming languages.

Parquet: A columnar data storage format, allowing it perform well for structured data with a fair amount of repetition.

Data Transfer

alt text

Data transfer of large amounts of data to and from dfs.

Flume, DistCp: Move files and flat text into Hadoop.

Sqoop: Move data between Hadoop and SQL.

Streaming

alt text

Streaming provides new calculations based on incoming data.

Example: Netflix 'Trending Now' feature. Possibly with personalized medicine to use medical devices to detect heart attacks before they happen.

Spark Streaming: Uses a micro-batch model that checks updates every 0.5-10 seconds and updates it's model.

Storm: Uses either streaming or micro-batch updates to update model.

Management and Monitoring

alt text

Node configuration management: Puppet, Chef. Change operating system parameters and install software.

Resource Tracking: Monitor the performance of many tools.

Coordination: Helps synchronize many tools in a single application: Zookeeper.

Ambari: Tool to help install, starting, stopping, and reconfiguring Hadoop cluster.

HCatalog: Central catalog of file formats and locations of data. Data looks like a table-like to user.

Nagios: Alert failures and problems through a graphical interface and alert administration though email of problems.

Puppet, Chef: Manager for configuration of a large number of machines.

Zookeeper: Helps coordination of tools.

Oozie: Workflow scheduler to start, stop, suspend and restart jobs, controlling the workflow so that no task is performed before it is ready.

Ganglia: Visualize how systems are being used and keeping track of general health of cluster.

Security, Access Control, and Auditing

alt text

Hadoop in itself doesn't provide much security. As Hadoop increased in popularity, so has security projects.

Kerberos, Sentry, Knox are such projects.

Cloud Computing and Virtualization

alt text

Sometimes you only need intermediate use of a cluster and creating/maintaining one of your own is prohibitively expensive.

Cloud computing and virtualization tools provide easy construction of Hadoop environments with relative ease on cloud computering environments like AWS.

Distribution Platforms

Distribution platforms help (for a cost) easy installation and software maintaince of a Hadoop cluster.

Tool versions are checked for compatability, usually meaning that they are not the newest versions

Some of these are: Cloudera, MapR, and Hortonworks.

MapReduce

Problem: Can't use a single computer to process the data (take too long to process data).

Solution: Use a group of interconnected computers (processor, and memory independent).

Problem: Conventional algorithms are not designed around memory independence.

Solution: MapReduce

Definition. MapReduce is a programming paradigm model of using parallel, distributed algorithims to process or generate data sets. MapRedeuce is composed of two main functions:

Map(k,v): Filters and sorts data.

Reduce(k,v): Aggregates data according to keys (k).

MapReduce Phases

MapReduce is broken down into several steps:

Record Reader
Map
Combiner (Optional)
Partitioner
Shuffle and Sort
Reduce
Output Format

Record Reader

Record Reader splits input into fixed-size pieces for each mapper.

The key is positional information (the number of bytes from start of file) and the value is the chunk of data composing a single record.

alt text

In hadoop, each map task's is an input split which is usually simply a HDFS block
- Hadoop tries scheduling map tasks on nodes where that block is stored (data locality)
- If a file is broken mid-record in a block, hadoop requests the additional information from the next block in the series

Map

Map User defined function outputing intermediate key-value pairs alt text

key (k_2): Later, MapReduce will group and possibly aggregate data according to these keys, choosing the right keys is here is important for a good MapReduce job.

value (v_2): The data to be grouped according to it's keys.

Combiner (Optional)

Combiner UDF that aggregates data according to intermediate keys on a mapper node alt text

This can usually reduce the amount of data to be sent over the network increasing efficiency

alt text

Combiner should be written with the idea that it is executed over most but not all map tasks. ie.
Usually very similar or the same code as the reduce method.

Partitioner

Partitioner Sends intermediate key-value pairs (k,v) to reducer by
alt text

alt text

will usually result in a roughly balanced load across the reducers while ensuring that all key-value pairs are grouped by their key on a single reducer.
A balancer system is in place for the cases when the key-values are too unevenly distributed.
In hadoop, the intermediate keys (k_2, v_2) are written to the local hard drive and grouped by which reduce they will be sent to and their key.

Shuffle and Sort

Shuffle and Sort On reducer node, sorts by key to help group equivalent keys alt text

Reduce

Reduce User Defined Function that aggregates data (v) according to keys (k) to send key-value pairs to output alt text

Output Format

Output Format Translates final key-value pairs to file format (tab-seperated by default). alt text

MapReduce Example: Word Count

alt text
Image Source: Xiaochong Zhang's Blog

DAG Models

A more flexible form of MapReduce is used by Spark using Directed Acyclic Graphs (DAG). alt text

For a set of operations:

Create a DAG for operations
Divide DAG into tasks
Assign tasks to nodes

MapReduce Programming Models

Looking for parameter(s) (theta) of a model (mean, parameters of regression, etc.)

Partition and Model:
1. Partition data,
2. Apply unbiased estimator,
3. Average results.
Sketching / Sufficient Statistics:
1. Partition data,
2. Reduce dimensionality of data applicable to model (sufficient statistic or sketch),
3. Construct model from sufficient statistic / sketch.

Partition and Model

alt text

Notes

theta_hat_bar is as efficient as theta_hat with the whole dataset when:

theta_hat comes from a normal distribution
x is IID and
x is equally partitioned

Can use algorithms already in R, Python to get theta_hat_i

Restrictions

Values must be IID (i.e. not sorted, etc.)
Model must produce an unbiased estimator of theta, denoted theta_hat
Has to have finite variance

Sketching / Sufficient Statistics

Streaming (Online) Algorithm: An algorithm in which the input is processed item by item. Due to limited memory and processing time, the algorithm produces a summary or “sketch” of the data.

Sufficient Statistic: A statistic with respect to a model and parameter theta, such that no other statistic from the sample will provide additional information.

Sketching / Sufficient Statistics programming model aims to break down the algorithm into sketches which is then passed off into the next phase of the algorithm.

alt text

Notes

Not all algorithms can be broken down this way.

Restrictions

All sketches must be communicative and associative
- For each item to be processed, the order of operations must not matter

Example: Mean

Partition and Model

alt text

Sufficient Statistic / Sketch

alt text

Example: Mean Code

    from numpy.random import randint, chisquare
    import numpy as np
    X = sc.parallelize(randint(0, 101, 200000))
    X_part = X.repartition(10).cache()

Example: Mean - Partition and Model

    def mean_part(interator):
        x = 0
        n = 0
        for i in interator:
            x+=i
            n+=1
        avg = x / float(n)
        return [[avg, n]]
    model_mean_part = X_part.mapPartitions(mean_part)  \
        .collect()
    model_mean_part

    [[50.336245888157897, 19456],
     [50.215136718750003, 20480],
     [50.007421874999999, 20480],
     [50.135214401294498, 19776],
     [50.141858552631582, 19456],
     [50.08115748355263, 19456],
     [50.2578125, 20480],
     [50.243945312500003, 20480],
     [49.786543996710527, 19456],
     [50.072363281249999, 20480]]

Example: Mean - Partition and Model

The partitions are not equally sized, thus we'll use a weighted average.

    def weighted_avg(theta):
        total = 0
        weighted_avg = 0
        for i in xrange(len(theta)):
            weighted_avg += theta[i][0] * theta[i][1]
            total+= theta[i][1]
        theta_bar = weighted_avg / total
        return theta_bar
    print("Mean via Partition and Model:")
    weighted_avg(model_mean_part)

    Mean via Partition and Model:
    50.128590000000003

Example: Mean - Sufficient Statistics / Sketching

    sketch_mean = X_part.map(lambda num: (num, 1)) \
        .reduce(lambda x, y: (x[0]+y[0], x[1]+y[1]) ) 
    x_bar_sketch = sketch_mean[0] / float(sketch_mean[1])
    print("Mean via Sketching:")
    x_bar_sketch

    Mean via Sketching:
    50.128590000000003

Example: Variance

alt text

Example: Variance - Partition and Model

    def var_part(interator):
        x = 0
        x2 = 0
        n = 0
        for i in interator:
            x += i
            x2 += i **2
            n += 1
        avg = x / float(n)
        var = (x2 - n * avg ** 2) / (n-1)
        return [[var, n]]
    var_part_model = X_part.mapPartitions(var_part)   \
        .collect()
    print("Variance via Partitioning:")
    print(weighted_avg(var_part_model))

    Variance via Partitioning:
    851.095985421

Example: Variance - Sufficient Statistics / Sketching

    sketch_var = X_part.map(lambda num: (num, num**2, 1)) \
        .reduce(lambda x, y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]) ) 
    x_bar_4 = sketch_var[0] / float(sketch_var[2])
    N = sketch_var[2]
    print("Variance via Sketching:")
    (sketch_var[1] - N * x_bar_4 ** 2 ) / (N -1)

    Variance via Sketching:
    851.07917000774989

Home

Big Data:

Essential Concepts and Tools

Darrell Aucoin

Stats Club

What is Big Data?

The 4 V's of Big Data

Volume

Tools

Popular Hadoop Projects

Non-Hadoop Projects

Hadoop Ecosystem

Distributed File System

MapReduce

Scheduler

Data Manipulation

Data Analysis

Serialization

Data Transfer

Streaming

Management and Monitoring

Security, Access Control, and Auditing

Cloud Computing and Virtualization

Distribution Platforms

MapReduce

MapReduce Phases

Record Reader

Map

Combiner (Optional)

Partitioner

Shuffle and Sort

Reduce

Output Format

MapReduce Example: Word Count

DAG Models

MapReduce Programming Models

Partition and Model

Notes

Restrictions

Sketching / Sufficient Statistics

Notes

Restrictions

Example: Mean

Partition and Model

Sufficient Statistic / Sketch

Example: Mean Code

Example: Mean - Partition and Model

Example: Mean - Partition and Model

Example: Mean - Sufficient Statistics / Sketching

Example: Variance

Example: Variance - Partition and Model

Example: Variance - Sufficient Statistics / Sketching

Clone this wiki locally