-
Notifications
You must be signed in to change notification settings - Fork 2
Home
> In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. > —Grace Hoppe
Volume: The quantity of data.
- Usually too big to fit into the memory of a single machine.
Veracity: The quality of data can vary. The inconsistency of data.
Variety: Data often comes in a variety of formats and sources often needing to be combined for a data analysis.
Velocity: The speed of generation of new data.
-
100 terabytes of data are uploaded to Facebook daily.
-
90% of all data ever created was generated in the past 2 years.
Problem: Impossible to do data analysis with one computer on data this size.
Solution: A distributed computing system.
###Veracity
- Big Data sets do not have the controls of regular studies
- Naming inconsistency
- Inconsistency in signal strength (i.e. Boston's Bump App, Google Flu)
- Cannot simply assume data missing at random
Veracity naming inconsistency: a musician named several different ways in several different files
###Variety
- Most of Big Data is unstructured or semi-structured data
- Doesn't have the guarantees of SQL
- Data can be structured, semi-structured, or unstructured
- Often have to combine various datasets from a variety of sources and formats
###Velocity
-
Speed that data is created, stored, analyzed to create actionable intelligence
-
Every min:
- 100 hours is uploaded to Youtube
- 200 million emails are sent
- 20 million photos are viewed
-
Often need to be very agile in creating a data product
Hadoop: A distributed file system and MapReduce engine YARN.
Spark: An in-memory based alternative to Hadoop's MapReduce which is better for machine learning algorithms.
- Spark SQL, MLlib (machine learning), GraphX (graph-parallel computation), and Spark Streaming.
Storm: Distributed tool for processing fast, large streams of data.
Cassandra: NoSQL system implemented on Hadoop.
Hive: Allows users to create SQL-like queries (HQL) and convert them to MapReduce.
HCatalog: A centralized metadata management and sharing service for Hadoop, allowing a unified view of all data in Hadoop clusters.
Pig: An easy to learn hadoop-based language that is adept at very deep, very long data pipelines.
Mahout: A data mining library using the most popular data mining algorithms using the Map Reduce model.
NoSQL (Not Only SQL): A database that is not based storage and retrieval of tabular relations used in relational databases. Some can provide a distributed database.
Examples: MongoDB, CouchDB, Accumulo, and some NoSQL databases are implemented on Hadoop: Cassandra, HBase.
SQL: Can spill to disk allowing datasets to be larger than memory size.
MADlib: Machine learning library extension for PostgreSQL.
- The blue is the necessary components of a Hadoop Ecosystem
- Some tools provide several functionalities.
- i.e. Hadoop is a distributed file system with MapReduce engine and scheduler.
HDFS (Hadoop Distributed File System) is a distributed file-system across multiple interconnected computer systems (nodes).
- Data is stored across multiple hard drives.
Lustre: DFS used by most enterprise High Performance Clusters (HPC). Usually uses a shared networked drive.
Google File System (GFS): Google propriety distributed file system.
MapR: DFS inspired by HDFS but written in C++ instead of Java.
- MapReduce is the engine that processes data in a Hadoop Ecosystem.
- Spark and Tez uses a more flexiable in memory model of MapReduce which is better for Machine Learning algorithms.
In order for multiple people to run on the same cluster, a scheduler is needed.
These are the tools to help parse, transform, and combine various datasets.
- Hive, Spark SQL, Impala, Cassandra, and HBase all use a SQL-like language to help manipulate data.
- Hive can be implemented using the Spark MapReduce Engine (significantly speeding it it's processes).
There are several Machine Learning algorithms already in place for the Hadoop Ecosystem.
- Mahout can be implemented on Spark, Tez, and Hadoop
- Spark also has GraphX, which uses graphs to perform analytics (PageRank, etc.)
There is also specialized tools:
Hadoop Image Processing Interface (HIPI): Image processing package helping to determine image similarity.
SpatialHadoop: Extension to process datasets of spatial data in Hadoop.
Parsing, transforming and combining the data into a useable dataset can be time consuming. Thus, once a suitable amount of work is done to create a useable dataset it is best to save it for future work.
Serialization saves the state of the data, allowing it to be recreated at a later date.
- JAVA Serialization is the worst of the above and should only be used for legacy reasons
Avro: Serialization made for Hadoop.
JSON: Java Script Object Notation is a convenient way of describing, serializing, and transferring data.
Protocol Buffers: More optimal serialization that requires the precise structure of the data when job is being run. Has less support for programming languages.
Parquet: A columnar data storage format, allowing it perform well for structured data with a fair amount of repetition.
Data transfer of large amounts of data to and from dfs.
Flume, DistCp: Move files and flat text into Hadoop.
Sqoop: Move data between Hadoop and SQL.
Streaming provides new calculations based on incoming data.
Example: Netflix 'Trending Now' feature. Possibly with personalized medicine to use medical devices to detect heart attacks before they happen.
Spark Streaming: Uses a micro-batch model that checks updates every 0.5-10 seconds and updates it's model.
Storm: Uses either streaming or micro-batch updates to update model.
Node configuration management: Puppet, Chef. Change operating system parameters and install software.
Resource Tracking: Monitor the performance of many tools.
Coordination: Helps synchronize many tools in a single application: Zookeeper.
Ambari: Tool to help install, starting, stopping, and reconfiguring Hadoop cluster.
HCatalog: Central catalog of file formats and locations of data. Data looks like a table-like to user.
Nagios: Alert failures and problems through a graphical interface and alert administration though email of problems.
Puppet, Chef: Manager for configuration of a large number of machines.
Zookeeper: Helps coordination of tools.
Oozie: Workflow scheduler to start, stop, suspend and restart jobs, controlling the workflow so that no task is performed before it is ready.
Ganglia: Visualize how systems are being used and keeping track of general health of cluster.
Hadoop in itself doesn't provide much security. As Hadoop increased in popularity, so has security projects.
Kerberos, Sentry, Knox are such projects.
Sometimes you only need intermediate use of a cluster and creating/maintaining one of your own is prohibitively expensive.
Cloud computing and virtualization tools provide easy construction of Hadoop environments with relative ease on cloud computering environments like AWS.
Distribution platforms help (for a cost) easy installation and software maintaince of a Hadoop cluster.
- Tool versions are checked for compatability, usually meaning that they are not the newest versions
Some of these are: Cloudera, MapR, and Hortonworks.
Problem: Can't use a single computer to process the data (take too long to process data).
Solution: Use a group of interconnected computers (processor, and memory independent).
Problem: Conventional algorithms are not designed around memory independence.
Solution: MapReduce
Definition. MapReduce is a programming paradigm model of using parallel, distributed algorithims to process or generate data sets. MapRedeuce is composed of two main functions:
Map(k,v): Filters and sorts data.
Reduce(k,v): Aggregates data according to keys (k).
MapReduce is broken down into several steps:
- Record Reader
- Map
- Combiner (Optional)
- Partitioner
- Shuffle and Sort
- Reduce
- Output Format
Record Reader splits input into fixed-size pieces for each mapper.
- The key is positional information (the number of bytes from start of file) and the value is the chunk of data composing a single record.
- In hadoop, each map task's is an input split which is usually simply a HDFS block
- Hadoop tries scheduling map tasks on nodes where that block is stored (data locality)
- If a file is broken mid-record in a block, hadoop requests the additional information from the next block in the series
Map User defined function outputing intermediate key-value pairs
key (k_2): Later, MapReduce will group and possibly aggregate data according to these keys, choosing the right keys is here is important for a good MapReduce job.
value (v_2): The data to be grouped according to it's keys.
Combiner UDF that aggregates data according to intermediate keys on a mapper node
- This can usually reduce the amount of data to be sent over the network increasing efficiency
-
Combiner should be written with the idea that it is executed over most but not all map tasks. ie.
-
Usually very similar or the same code as the reduce method.
Partitioner Sends intermediate key-value pairs (k,v) to reducer by
- will usually result in a roughly balanced load across the reducers while ensuring that all key-value pairs are grouped by their key on a single reducer.
- A balancer system is in place for the cases when the key-values are too unevenly distributed.
- In hadoop, the intermediate keys (k_2, v_2) are written to the local hard drive and grouped by which reduce they will be sent to and their key.
Shuffle and Sort On reducer node, sorts by key to help group equivalent keys
Reduce User Defined Function that aggregates data (v) according to keys (k) to send key-value pairs to output
Output Format Translates final key-value pairs to file format (tab-seperated by default).
Image Source: Xiaochong Zhang's Blog
A more flexible form of MapReduce is used by Spark using Directed Acyclic Graphs (DAG).
For a set of operations:
- Create a DAG for operations
- Divide DAG into tasks
- Assign tasks to nodes
- Looking for parameter(s) (theta) of a model (mean, parameters of regression, etc.)
-
Partition and Model:
- Partition data,
- Apply unbiased estimator,
- Average results.
-
Sketching / Sufficient Statistics:
- Partition data,
- Reduce dimensionality of data applicable to model (sufficient statistic or sketch),
- Construct model from sufficient statistic / sketch.
theta_hat_bar is as efficient as theta_hat with the whole dataset when:
- theta_hat comes from a normal distribution
- x is IID and
- x is equally partitioned
- Can use algorithms already in R, Python to get theta_hat_i
- Values must be IID (i.e. not sorted, etc.)
- Model must produce an unbiased estimator of theta, denoted theta_hat
- Has to have finite variance
Streaming (Online) Algorithm: An algorithm in which the input is processed item by item. Due to limited memory and processing time, the algorithm produces a summary or “sketch” of the data.
Sufficient Statistic: A statistic with respect to a model and parameter theta, such that no other statistic from the sample will provide additional information.
- Sketching / Sufficient Statistics programming model aims to break down the algorithm into sketches which is then passed off into the next phase of the algorithm.
- Not all algorithms can be broken down this way.
- All sketches must be communicative and associative
- For each item to be processed, the order of operations must not matter
from numpy.random import randint, chisquare
import numpy as np
X = sc.parallelize(randint(0, 101, 200000))
X_part = X.repartition(10).cache()
def mean_part(interator):
x = 0
n = 0
for i in interator:
x+=i
n+=1
avg = x / float(n)
return [[avg, n]]
model_mean_part = X_part.mapPartitions(mean_part) \
.collect()
model_mean_part
[[50.336245888157897, 19456],
[50.215136718750003, 20480],
[50.007421874999999, 20480],
[50.135214401294498, 19776],
[50.141858552631582, 19456],
[50.08115748355263, 19456],
[50.2578125, 20480],
[50.243945312500003, 20480],
[49.786543996710527, 19456],
[50.072363281249999, 20480]]
The partitions are not equally sized, thus we'll use a weighted average.
def weighted_avg(theta):
total = 0
weighted_avg = 0
for i in xrange(len(theta)):
weighted_avg += theta[i][0] * theta[i][1]
total+= theta[i][1]
theta_bar = weighted_avg / total
return theta_bar
print("Mean via Partition and Model:")
weighted_avg(model_mean_part)
Mean via Partition and Model:
50.128590000000003
sketch_mean = X_part.map(lambda num: (num, 1)) \
.reduce(lambda x, y: (x[0]+y[0], x[1]+y[1]) )
x_bar_sketch = sketch_mean[0] / float(sketch_mean[1])
print("Mean via Sketching:")
x_bar_sketch
Mean via Sketching:
50.128590000000003
def var_part(interator):
x = 0
x2 = 0
n = 0
for i in interator:
x += i
x2 += i **2
n += 1
avg = x / float(n)
var = (x2 - n * avg ** 2) / (n-1)
return [[var, n]]
var_part_model = X_part.mapPartitions(var_part) \
.collect()
print("Variance via Partitioning:")
print(weighted_avg(var_part_model))
Variance via Partitioning:
851.095985421
sketch_var = X_part.map(lambda num: (num, num**2, 1)) \
.reduce(lambda x, y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]) )
x_bar_4 = sketch_var[0] / float(sketch_var[2])
N = sketch_var[2]
print("Variance via Sketching:")
(sketch_var[1] - N * x_bar_4 ** 2 ) / (N -1)
Variance via Sketching:
851.07917000774989