Wednesday, 19 February 2014

All About Hadoop : Issue#1

If you are new to Hadoop, then this post is for you. Altough, it is very difficult to cover everything about Hadoop in few pages, but I have tried to touch every important term and concept that defines Hadoop. In this issue #1 of two part series, I will cover the facts and components of Hadoop. In the next post, we will discuss the application development and will learn how to get the data into Hadoop. So let's get started.

Just Some Facts


Hadoop is a computing environmemnt built on top of a distributed clustered file system that was designed specifically for very large-scale data operations. Hadoop was inspired by Google's work on its Google File System and the MapReduce programming paradigm, in which work is broken down into mapper and reducer taks to manipulate data that is stored across a cluster of servers for massive parallelism. Unlike transaction systems, Hadoop can scan through large data sets to produce it's results through scalable, distributed batch processing system. Hadoop is not about speed-of-thought response times, real-time warehousing, or blazing transactional speeds, it is about discover and making the once near-impossible possible from a scalability and analysis perspective. The Hadoop methodologhy is built around a function-to-data model as opposed to data-to-function, because there is so much data, the analysis programs are sent to the data.

Hadoop, basically has two parts: a file system (Hadoop Distributed File Sytem) and a programming paradigm (MapReduce). One of the key components of Hadoop is the redundancy. Not only data is the data redundantly stored i multiple places across the cluster, but the programming model is such that failures are expected and are resolved automatically by running portions of the program on various servers in the cluster.This Redundancy makes possible to distribute the data and it's associated programming across a very large cluster of commodity components. Obviously, from a large number of commodity hardware components, a few will fail, but this redundancy comes into play and provides fault tolerance, and a capability for the Hadoop cluster to heal itself. This allows Hadoop to scale out worloads across large clusters of inexpensive machines to work on Big Data problems.

Components Of Hadoop


Hadoop is comprised of three components :Hadoop Distributed File System, Hadoop MapReduce model and Hadoop Common.Let us undertand these compnents one by one.

Hadoop Distributed File System

Data in Hadoop cluster is broken down into smaller pieces called as blocks and distributed throughout the cluster. Due to this, the map and reduce functions can be executed on smaller subsets of our larger data sets, and this provides the scalability that is needed for Big Data processing. It can be infered that the goal of Hadoop is to use commonly available servers in a very large cluster where each server has a set of inexpensive interal disk drives. For achieving higher performance, MapReduce tries to asign workloads to these servers where the data to be processed is stored. This concept is known as data locality. The cool thing about Hadoop is that it has built-in fault tolerance and fault compensation capabilities. In HDFS too, data is divided into blocks and copies of the blocks are stored on other servers in the Hadoop cluster. It means, that an single file is actually stored as smaller blocks that are replicated across multiple servers in the entire cluster. The redundancy provides various benefits, among them higher availability comes on top. In addition, redundancy allows the Hadoop cluster to break work up into smaller chunks and run those jobs on all the servers in the cluster for better scalability.

A data file in HDFS is divided into blocks, and the deafult size of these blocks for Apache Hadoop is 64 MB. For larger files, a higher block size is a good idea, as this will greatly reduce the amount of metadata required by NameNode. The expected workload is another consideration, as nonsequenctial access patterns will perform more optimally with smaller block size. Coordination across a cluster has significant overhead, so the ability to process large chunks of work locally without sending data to other nodes helps improve both performance and the overhead to real work ratio. Each data block is stored on 3 differnt servers in Hadoop, this is implemented by HDFS working behind the scenes to make sure at least two blocks are atored on a separate server rack to improve realiability in the event of losing an entire rack of servers.

All of Hadop's data placement logic is managed by a special server called NameNode. This NameNode server keep track of all the data files in HDFS, such as where the blocks are stoed and more. All of the NameNode's information is stored in memory, which allows it to provide quick response times to storage manipulation to read requests. Now as there is only one NameNode for entire cluster, storing this information in memory creates a single point of failure. So server acting as NameNode must be more robust than the rest of the servers in cluster, obviously to minimize the possibility of failures. One more thing, it is highly recommended to have regular backup process for the cluster metadat stored in the NameNode.

Map Reduce

MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or even thousands of servers in a Hadoop cluster. The term MapReduce actually refers to two separate and distinct task Hadoop program perform. The first is the map job, which takes a set of data and coverts it into another set of data, in which individual elements are broken down into key/value pairs. The reduce job takes the output from a map as input and combines those data key/value pairs into smaller aet of tuples. As the nsme implies, the reduce job is always performed after the map job.
Take an example to understand it more clearly. We have 4 files, each having 2 columns, City and Temperature. Sample data of a file is given below:
London, 12
New Delhi, 33
Venice, 22
Mumbai, 26
London, 22
New Delhi, 43
Venice, 27
Mumbai, 27

Now if we want to calculate maximum temperature of each city, of all the files. By using MapReduce framework, we can acheive this by breaking down into 4 map tasks, each working on one file. Each map task will return maximum temperature for each city in the file. Like from the sample file above, the map will return.

(London, 22) , (New Delhi, 43) , (Venice, 27) , (Mumbai,27)
Now assume that other map tasks working on other 3 files return this data.
(London, 23) , (New Delhi, 44) , (Venice, 23) , (Mumbai,29)
(London, 22) , (New Delhi, 41) , (Venice, 24) , (Mumbai,21)
(London, 24) , (New Delhi, 42) , (Venice, 25) , (Mumbai,24)
All the output data of 4 map tasks are fed into reduce tasks , which combine the input result and return output as single value for each city i.e. the maximum value. The key thing here to notice that, there may be more than multiple reducer working in parralel. In such case. all the key/ value pairs of each city sholud go to the same reducer to find the maximum temerature. This directing of records to reduce tasks is known as shuffle (learn more about shuffle) , which takes input from map tasks and and direct the output to a specific reduce task. So the final result generated by job will be:
(London, 24) , (New Delhi, 44) , (Venice, 27) , (Mumbai,29)


A MapReduce program is referred as job. A job is executed by subsequently breaking it down into small pieces called tasks. An application submits the job to a specific node in cluster, which is running a daemon (software) called JobTracker. The JobTracker communicates with the NameNode to find out where all of the data required for this job exists in the cluster and then breaks the job down into map and reduce tasks for each node to work on the cluster where the data exists. JobTracker tries to avoid the case where a node is given a task for which the data needed by that task is not local to that node. It does so by attempting to shedule tasks where the data is stored. This is the concept of data locality and it is very vital when working with large volumes of data.

In a cluster, a set of continually running daemons, known as TaskTracker agents, moniter the status of each task. If a task fails to complete, the status of that failure is reported back to the JobTracker, which will then reschedule that task on another node in the cluster. Hadoop gives the option to perform local aggregation on the output of each map task before sending the results off to a reduce task through a local aggregation called a Combiner (learn more about Combiner). Obviously when multiple reduce tasks are running overhead increases but for large datasets it improves overall performance. All MapReduce programs that run natively under Hadoop are written in Java, and it is the Java Archive file (jar) that's distributed by the JobTracker to the various Hadoop cluster nodes to execute the map and reduce tasks.

Check out the Fisrt Program in MapReduce and you will an idea what we just discussed. Issue#2 of this All About Hadoop series will cover applications and getting data into Hadoop.