Wednesday 15 January 2014

Hadoop Introduction


What is Hadoop ? “Hadoop”, the name itself is weird, isn’t it? the term Apache Hadoop was created by Doug Cutting. Hadoop came from the name of a toy elephant. Hadoop is all about processing large amount of data irrespective of whether its structured or unstructured, huge data means hundreds of GIGs and more. Traditional RDBMS system may not be apt when you have to deal with huge data sets. Even though “database sharding” is trying to address this issue, chances of node failure makes this less approachable.

Hadoop is an open-source software framework which enables applications to work with multiple nodes which can store enormous amount of data. It comprises of Two components:

Apache Hadoop Distributed File System (HDFS)
Google’s MapReduce Framework

Apache Hadoop was created by Doug Cutting, he named it after his son’s toy elephant. Hadoop’s original purpose was to support the Nutch search engine project. But Hadoop’s significance grown too far from that, now its a top level Apache project and is being used by a large community of users, to name a few, Facebook, New York times, Yahoo are some of the examples of Apache Hadoop implementations.

Hadoop is written in the Java programming language!! The Apache Hadoop framework is composed of the following modules :

Hadoop Common - contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
Hadoop MapReduce - a programming model for large scale data processing.

Significance of Hadoop

The data on the World Wide Web is growing at an enormous rate. As the number of active internet users increases, the amount of data getting uploaded is increasing. Some of the estimates related to the growth of data are as follows:

In 2006, the total estimated size of the digital data stood at 0.18 zettabytes
By 2011, a forecast estimates it to stand at 1.8 zettabytes.
One zettabyte = 1000 exabytes = 1 million petabytes = 1 billion terabytes

Social networking sites hosting photos, Video streaming sites, Stock exchange transactions are some of the major reasons of this huge amount of data. The growth of data also brings some challenges with it. Even though the amount of data storage has increased over time, the data access speed has not increased at the same rate.

If all the data resides on one node, then it deteriorates the overall data access time. Reading becomes slower; writing becomes even slower. As a solution to this, if the same data is accessed from multiple nodes in parallel, then the overall data access time can be reduced. In order to implement this, we need the data to be distributed among multiple nodes and there should be a framework to control these multiple nodes’ Read and write. Here comes the role of Hadoop kind of system.

Let’s see the problems that can happen with shared storage and how Apache Hadoop framework overcomes it.
Hardware Failure Hadoop is not expecting all nodes to be up and running all the time. Hapoop has a mechanism to handle the node failures, it replicates the data. Combining the data retrieved from multiple nodes Combining the output of each worker node is a challenge, Google’s MapReduce framework helps to solve this problem. Map is more like a key-value pair. MapReduce framework has a mechanism of mapping the data retrieved from the multiple disks and then, combining them to generate one output

Components Of Apache Hadoop: Hadoop framework is consisting of 2 parts Apache Hadoop Distributed File System (HDFS) and MapReduce.

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System is a distributed file system which is designed to run on commodity hardware. Since the Hadoop treats node failures as a norm rather than an exception, HDFS has been designed to be highly fault tolerant. And moreover, it is designed to run on low cost shared hardware.

HDFS is designed to reliably store very large files across machines in a large cluster
HDFS stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
The blocks of a file are replicated for fault tolerance and this replication is configurable.
An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.

The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode

Replica placement is crucial for faster retrieval of data by the clients, For this, HDFS uses a technique known as Rack Awareness HDFS tries to satisfy a read request from a replica that is closest to the client. All HDFS communication protocols are layered on top of the TCP/IP protocol.

MapReduce :

MapReduce is the framework which helps in the data analysis part of Apache Hadoop implementation. Following are the notable points of MapReduce.

MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers
MapReduce framework is inspired by map and reduce functions commonly used in functional programming
MapReduce is consisting of a Map step and a Reduce step to solve a given problem.

Map Step:

The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node.

Reduce Step:

The master node then takes the answers to all the sub-problems and combines them in a way to get the output. All Maps steps execute in a parallel fashion The Reduce step takes in the input from the Map step. All the Maps with the same key fall under one reducer. However, there are multiple reducers and it will work in parallel. This parallel execution offers the possibility of recovery from partial failure. If one node (Mapper/Reducer) fails, then its work can be re-scheduled to another node.