Monday, 30 December 2013

MapReduce: Introduction and Need

Data is everywhere. There is no way to measure the exact total volume of data stored electronically but it should be in zettabytes (A zettabyte is 1 billion terabytes). So we have a lot of data and we are struggling to store and analyze it. Now let's try to figure out what is wrong with dealing with such an enormous volume of data. The answer is same what's wrong in data from PC to portable drives: Speed. Over the years the storage capacities of hard drives have increased like anything but the rate at which data can be read from drives have not kept up and writing is even slower.

Solution: To read from multiple disks at a time. For instance if we have to store 100 GB of data and we have 100 drives with 100 GB storage space , then it would be faster to read data from 100 drives, each holding 1 GB of data than a single drive holding 100 GB of data.

Problem with solution: First, when using a large number of hardware pieces (hard disks) there is high possibility that one or two might fail.Second, correctly combining the data from different hard disks. MapReduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into computation over sets of keys and values. So , MapReduce is a programming model for data processing.

What Hadoop provides and why is MapReduce needed? Hadoop provides a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. But Why can’t we use databases with lots of disks to do large-scale batch analysis? This is due to the fact that seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. We can define MapReduce as a complement to a Rational Database Management(RDBMS).

Comparison with RDBMS:
  1. MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not intrinsic properties of the data, but they are chosen by the person analyzing the data.
  2. MapReduce works well for the applications where data is written once and read many times,whereas a relational database is good for datasets that are continually updated.
  3. MapReduce works on petabytes of data. On the other hand tradiional RDBMS works with gigabytes of data.
In MapReduce the programmer writes two functions: a map function and a reduce function, each of which defines a mapping from one set of key-value pairs to another. These functions are unaffected to the size of the data or the cluster that they are operating on, so they can be used unchanged for a small dataset and for a massive one. One more important thing to remember is, if you double the size of the input data, a job will run twice as slow. But if you also double the size of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. It is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.The data used in MapReduce is semi structured and record oriented. MapReduce works by breaking the processing into two phases: Map phase and Reduce phase. Each phase has a key value pair as input and output,types of which are choosen by programmer. The programmer also specifies two functions: Map function and Reduce function.

To take advantage of the parallel processing that Hadoop provides we need to express our query as a MapReduce job. After some local,small scale testing we can run it on a cluster of machines.Broadly MapReduce working can be broken down into three components:
  • Map Method
  • Reduce Method
  • Code to run
The map function is represented by the mapper class , which declares an abstract map() method.Mapper class is agenric type,with four formal type parameters that specify the input key,output key,input value and output value.

Map function


It's just a data preparation phase,setting up data in such a way that the reducer function can do its work on it.The map function is also a good place to drop bad records.The output from the map function is processed by the MapReduce framework before being sent to the reduce function.

The map() function is passed a key and value.We convert the Text value containing the line of input into a Java String,then use its substring() method to extract the columns we are interested in.The map() function also provides an instance of Context(Context object is used to store the data which is used by reduce method) to write the output to.

Sample code for mapper class and map method.
public static class WordCountMap extends Mapper <LongWritable,Text,Text,IntWritable> { @Override public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException { // ********some code*********** context.write(new Text(Count), new IntWritable(1)); } } }

Point to remember: Rather than using bult-in Java types,Hadoop provides its own set of basic types that are optimized for network serialization.Few are given below:
  • LongWritable-Corresponds to Java Long
  • Text -Corresponds to Java String
  • IntWritable -Corresponds to Java Integer

Reduce function


Just like map() function, four formal parameters are used to specify the input and output types for reduce() function.The input types of reduce function must match the output types of the map function. Reduces a set of intermediate values which share a key to a smaller set of values. The number of Reducers for the job is set by the user via JobConf.setNumReduceTasks(int). Reducer implementations can access the JobConf for the job via the JobConfigurable.configure(JobConf) method and initialize themselves. Similarly they can use the Closeable.close() method for de-initialization.

Sample code for reducer class and reduce method:
public static class WordCountReduce extends Reducer <Text,IntWritable,Text,IntWritable> { public void reduce(Text key,Iterable values,Context context) throws IOException, InterruptedException { // ******some code********* context.write(key, new IntWritable(sum)); } }

Reducer has 3 primary phases:Shuffle, Sort and Reduce.

Shuffle: Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
Sort:The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
Reduce: In this phase the reduce(Object, Iterator, OutputCollector, Reporter) method is called for each ltkey, (list of values)> pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(Object, Object). The output of the Reducer is not re-sorted.

24 comments:

  1. Did I miss anything? Or you have any doubt.. !!! feel free to ask !!

    ReplyDelete
  2. Can we start processing Reduce workers before finishing all mapper workers execution

    ReplyDelete
    Replies
    1. No we can't. But if you have seen reducer progressing simultaneously with mapper ( something like this : map 80% reduce 10% ) then it might be because input data is very large. Due to the fact that mapper begins emitting it's output and reducer starts receiving the output of mapper, although reducer can't operate on that data unless complete input data is processed by the mapper and received by the reducer.

      Delete
  3. These are some really nice articles written to explain concepts of hadoop, perfect for beginners. Thanks again Aman

    ReplyDelete
  4. The expansion of internet and intelligence in business process lead the way to huge volume of data. It is important to maintain and process these data to be efficient in data handling.

    ReplyDelete

  5. The way you have explained the Big Data concept is really superb! I have never come across such a informative blog in my career. Nowadays, Hadoop course is in high demand by most of the professional to enhance their career.

    Regards:

    Big Data Training |
    Big Data Course in Chennai

    ReplyDelete
  6. As technology grows, we have to be updated in our domain to sustain in the IT industry. Your post made me impressed in the way of your writing.
    I was searching for the right blog to get Hadoop updates to know what is happening in the Big Data industry. I found your blog where I can get a lot of new updates in storing and retrieving the data concept. Thank you admin I would like to share this blog with my friends also. Keep updating, waiting for your next article.
    Regards
    Hadoop Training Chennai | Hadoop Training in Chennai

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. I appreciate your efforts because it conveys the message of what you are trying to say. It's a great skill to make even the person who doesn't know about the subject could able to understand the subject . Your blogs are understandable and also elaborately described. I hope to read more and more interesting articles from your blog. keep it up guys
    Ai & Artificial Intelligence Course in Chennai
    PHP Training in Chennai
    Ethical Hacking Course in Chennai Blue Prism Training in Chennai
    UiPath Training in Chennai

    ReplyDelete
  9. Thanks for sharing...Thank you for your post. This is excellent information. oracle training in chennai

    ReplyDelete
  10. Great post. keep sharing such a worthy information.
    Python Coaching Centre In Chennai

    ReplyDelete