Hadoop Soup: hadoop introduction

Showing posts with label hadoop introduction. Show all posts

Wednesday, 19 February 2014

All About Hadoop : Issue#1

If you are new to Hadoop, then this post is for you. Altough, it is very difficult to cover everything about Hadoop in few pages, but I have tried to touch every important term and concept that defines Hadoop. In this issue #1 of two part series, I will cover the facts and components of Hadoop. In the next post, we will discuss the application development and will learn how to get the data into Hadoop. So let's get started.

Just Some Facts

Hadoop is a computing environmemnt built on top of a distributed clustered file system that was designed specifically for very large-scale data operations. Hadoop was inspired by Google's work on its Google File System and the MapReduce programming paradigm, in which work is broken down into mapper and reducer taks to manipulate data that is stored across a cluster of servers for massive parallelism. Unlike transaction systems, Hadoop can scan through large data sets to produce it's results through scalable, distributed batch processing system. Hadoop is not about speed-of-thought response times, real-time warehousing, or blazing transactional speeds, it is about discover and making the once near-impossible possible from a scalability and analysis perspective. The Hadoop methodologhy is built around a function-to-data model as opposed to data-to-function, because there is so much data, the analysis programs are sent to the data.

Hadoop, basically has two parts: a file system (Hadoop Distributed File Sytem) and a programming paradigm (MapReduce). One of the key components of Hadoop is the redundancy. Not only data is the data redundantly stored i multiple places across the cluster, but the programming model is such that failures are expected and are resolved automatically by running portions of the program on various servers in the cluster.This Redundancy makes possible to distribute the data and it's associated programming across a very large cluster of commodity components. Obviously, from a large number of commodity hardware components, a few will fail, but this redundancy comes into play and provides fault tolerance, and a capability for the Hadoop cluster to heal itself. This allows Hadoop to scale out worloads across large clusters of inexpensive machines to work on Big Data problems.

Components Of Hadoop

Hadoop is comprised of three components :Hadoop Distributed File System, Hadoop MapReduce model and Hadoop Common.Let us undertand these compnents one by one.

Hadoop Distributed File System

Data in Hadoop cluster is broken down into smaller pieces called as blocks and distributed throughout the cluster. Due to this, the map and reduce functions can be executed on smaller subsets of our larger data sets, and this provides the scalability that is needed for Big Data processing. It can be infered that the goal of Hadoop is to use commonly available servers in a very large cluster where each server has a set of inexpensive interal disk drives. For achieving higher performance, MapReduce tries to asign workloads to these servers where the data to be processed is stored. This concept is known as data locality. The cool thing about Hadoop is that it has built-in fault tolerance and fault compensation capabilities. In HDFS too, data is divided into blocks and copies of the blocks are stored on other servers in the Hadoop cluster. It means, that an single file is actually stored as smaller blocks that are replicated across multiple servers in the entire cluster. The redundancy provides various benefits, among them higher availability comes on top. In addition, redundancy allows the Hadoop cluster to break work up into smaller chunks and run those jobs on all the servers in the cluster for better scalability.

A data file in HDFS is divided into blocks, and the deafult size of these blocks for Apache Hadoop is 64 MB. For larger files, a higher block size is a good idea, as this will greatly reduce the amount of metadata required by NameNode. The expected workload is another consideration, as nonsequenctial access patterns will perform more optimally with smaller block size. Coordination across a cluster has significant overhead, so the ability to process large chunks of work locally without sending data to other nodes helps improve both performance and the overhead to real work ratio. Each data block is stored on 3 differnt servers in Hadoop, this is implemented by HDFS working behind the scenes to make sure at least two blocks are atored on a separate server rack to improve realiability in the event of losing an entire rack of servers.

All of Hadop's data placement logic is managed by a special server called NameNode. This NameNode server keep track of all the data files in HDFS, such as where the blocks are stoed and more. All of the NameNode's information is stored in memory, which allows it to provide quick response times to storage manipulation to read requests. Now as there is only one NameNode for entire cluster, storing this information in memory creates a single point of failure. So server acting as NameNode must be more robust than the rest of the servers in cluster, obviously to minimize the possibility of failures. One more thing, it is highly recommended to have regular backup process for the cluster metadat stored in the NameNode.

Map Reduce

MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or even thousands of servers in a Hadoop cluster. The term MapReduce actually refers to two separate and distinct task Hadoop program perform. The first is the map job, which takes a set of data and coverts it into another set of data, in which individual elements are broken down into key/value pairs. The reduce job takes the output from a map as input and combines those data key/value pairs into smaller aet of tuples. As the nsme implies, the reduce job is always performed after the map job.
Take an example to understand it more clearly. We have 4 files, each having 2 columns, City and Temperature. Sample data of a file is given below:

London, 12
New Delhi, 33
Venice, 22
Mumbai, 26
London, 22
New Delhi, 43
Venice, 27
Mumbai, 27

Now if we want to calculate maximum temperature of each city, of all the files. By using MapReduce framework, we can acheive this by breaking down into 4 map tasks, each working on one file. Each map task will return maximum temperature for each city in the file. Like from the sample file above, the map will return.

(London, 22) , (New Delhi, 43) , (Venice, 27) , (Mumbai,27)

Now assume that other map tasks working on other 3 files return this data.

(London, 23) , (New Delhi, 44) , (Venice, 23) , (Mumbai,29)
(London, 22) , (New Delhi, 41) , (Venice, 24) , (Mumbai,21)
(London, 24) , (New Delhi, 42) , (Venice, 25) , (Mumbai,24)

All the output data of 4 map tasks are fed into reduce tasks , which combine the input result and return output as single value for each city i.e. the maximum value. The key thing here to notice that, there may be more than multiple reducer working in parralel. In such case. all the key/ value pairs of each city sholud go to the same reducer to find the maximum temerature. This directing of records to reduce tasks is known as shuffle (learn more about shuffle) , which takes input from map tasks and and direct the output to a specific reduce task. So the final result generated by job will be:

(London, 24) , (New Delhi, 44) , (Venice, 27) , (Mumbai,29)

A MapReduce program is referred as job. A job is executed by subsequently breaking it down into small pieces called tasks. An application submits the job to a specific node in cluster, which is running a daemon (software) called JobTracker. The JobTracker communicates with the NameNode to find out where all of the data required for this job exists in the cluster and then breaks the job down into map and reduce tasks for each node to work on the cluster where the data exists. JobTracker tries to avoid the case where a node is given a task for which the data needed by that task is not local to that node. It does so by attempting to shedule tasks where the data is stored. This is the concept of data locality and it is very vital when working with large volumes of data.

In a cluster, a set of continually running daemons, known as TaskTracker agents, moniter the status of each task. If a task fails to complete, the status of that failure is reported back to the JobTracker, which will then reschedule that task on another node in the cluster. Hadoop gives the option to perform local aggregation on the output of each map task before sending the results off to a reduce task through a local aggregation called a Combiner (learn more about Combiner). Obviously when multiple reduce tasks are running overhead increases but for large datasets it improves overall performance. All MapReduce programs that run natively under Hadoop are written in Java, and it is the Java Archive file (jar) that's distributed by the JobTracker to the various Hadoop cluster nodes to execute the map and reduce tasks.

Check out the Fisrt Program in MapReduce and you will an idea what we just discussed. Issue#2 of this All About Hadoop series will cover applications and getting data into Hadoop.

Wednesday, 15 January 2014

Hadoop Introduction

What is Hadoop ? “Hadoop”, the name itself is weird, isn’t it? the term Apache Hadoop was created by Doug Cutting. Hadoop came from the name of a toy elephant. Hadoop is all about processing large amount of data irrespective of whether its structured or unstructured, huge data means hundreds of GIGs and more. Traditional RDBMS system may not be apt when you have to deal with huge data sets. Even though “database sharding” is trying to address this issue, chances of node failure makes this less approachable.

Hadoop is an open-source software framework which enables applications to work with multiple nodes which can store enormous amount of data. It comprises of Two components:

Apache Hadoop Distributed File System (HDFS)
Google’s MapReduce Framework

Apache Hadoop was created by Doug Cutting, he named it after his son’s toy elephant. Hadoop’s original purpose was to support the Nutch search engine project. But Hadoop’s significance grown too far from that, now its a top level Apache project and is being used by a large community of users, to name a few, Facebook, New York times, Yahoo are some of the examples of Apache Hadoop implementations.

Hadoop is written in the Java programming language!! The Apache Hadoop framework is composed of the following modules :

Hadoop Common - contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
Hadoop MapReduce - a programming model for large scale data processing.

Significance of Hadoop

The data on the World Wide Web is growing at an enormous rate. As the number of active internet users increases, the amount of data getting uploaded is increasing. Some of the estimates related to the growth of data are as follows:

In 2006, the total estimated size of the digital data stood at 0.18 zettabytes
By 2011, a forecast estimates it to stand at 1.8 zettabytes.
One zettabyte = 1000 exabytes = 1 million petabytes = 1 billion terabytes

Social networking sites hosting photos, Video streaming sites, Stock exchange transactions are some of the major reasons of this huge amount of data. The growth of data also brings some challenges with it. Even though the amount of data storage has increased over time, the data access speed has not increased at the same rate.

If all the data resides on one node, then it deteriorates the overall data access time. Reading becomes slower; writing becomes even slower. As a solution to this, if the same data is accessed from multiple nodes in parallel, then the overall data access time can be reduced. In order to implement this, we need the data to be distributed among multiple nodes and there should be a framework to control these multiple nodes’ Read and write. Here comes the role of Hadoop kind of system.

Let’s see the problems that can happen with shared storage and how Apache Hadoop framework overcomes it.
Hardware Failure Hadoop is not expecting all nodes to be up and running all the time. Hapoop has a mechanism to handle the node failures, it replicates the data. Combining the data retrieved from multiple nodes Combining the output of each worker node is a challenge, Google’s MapReduce framework helps to solve this problem. Map is more like a key-value pair. MapReduce framework has a mechanism of mapping the data retrieved from the multiple disks and then, combining them to generate one output

Components Of Apache Hadoop: Hadoop framework is consisting of 2 parts Apache Hadoop Distributed File System (HDFS) and MapReduce.

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System is a distributed file system which is designed to run on commodity hardware. Since the Hadoop treats node failures as a norm rather than an exception, HDFS has been designed to be highly fault tolerant. And moreover, it is designed to run on low cost shared hardware.

HDFS is designed to reliably store very large files across machines in a large cluster
HDFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance and this replication is configurable.
An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode

Replica placement is crucial for faster retrieval of data by the clients, For this, HDFS uses a technique known as Rack Awareness HDFS tries to satisfy a read request from a replica that is closest to the client. All HDFS communication protocols are layered on top of the TCP/IP protocol.

MapReduce :

MapReduce is the framework which helps in the data analysis part of Apache Hadoop implementation. Following are the notable points of MapReduce.

MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers
MapReduce framework is inspired by map and reduce functions commonly used in functional programming
MapReduce is consisting of a Map step and a Reduce step to solve a given problem.

Map Step:

The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

Reduce Step:

The master node then takes the answers to all the sub-problems and combines them in a way to get the output. All Maps steps execute in a parallel fashion The Reduce step takes in the input from the Map step. All the Maps with the same key fall under one reducer. However, there are multiple reducers and it will work in parallel. This parallel execution offers the possibility of recovery from partial failure. If one node (Mapper/Reducer) fails, then its work can be re-scheduled to another node.