Wednesday 5 February 2014

MapReduce Inputs And Splitting

Uptill now we have used only Text file as input files. Can we use any other file format? Can we use binary format or XML format? Let us find out. First, we need to get familiar with few terms . Input split: It is part of input processed by a single map. Each split is processed by a single map. In other words InputSplit represents the data to be processed by an individual Mapper. Each split is divided into records , and the map processes each record, which is a key value pair. Split is basically a number of rows and record is that number.

The length of the InputSplit is measured in bytes. Every InputSplit has a storage locations (hostname strings). The storage locations are used by the MapReduce system to place map tasks as close to split's data as possible. The tasks are processed in the order of the size of the splits, largest one get processed first(greedy approximation algorithm). This is done in order to minimize the job runtime. One important thing to remember is that InputSplit doesn't contain input data but a reference to the data.

public abstract class InputSplit { public abstract long getLength() throws IOException, InterruptedException; public abstract String[] getLocations() throws IOException,InterruptedException; }
As a user, we don't have to use InputSplits directly, InputFormat does that job. An InputFormat is a class that provides the following functionality:
  • Selects the files or other objects that should be used for input.
  • Defines the InputSplits that break a file into tasks.
  • Provides a factory for RecordReader objects that read the file.
The overall process can be explained in following points:
  • The client which runs the job calculates the splits for the job by calling getSplits().
  • Client then sends the splits to the jobtracker, which uses their storage locations to schedule map tasks that will process them on the tasktrackers.
  • On a tasktracker, the map task passes the split to the createRecordReader() method on InputFormat to obtain a RecordReader for that split.
  • Map task uses RecordReader to generate record key-value pairs, which it passes to the map function. We can see this by looking at the Mapper’s run() method:
public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); }
First, the setup() method is called, then the nextKeyValue() is called repeatedly on the Context to populate the key and value objects for the mapper. Each key-value pair is retreived from the RecordReader and are passed to the map() method. The nextKeyValue() method returns false, when there is no more key-value pair left to get read. Then the map task runs its cleanup() method at the end.

FileInputFormat


It is the base class for all implementations of InputFormat. It provides two things: a place to define which files are included as the input to a job and an implementation for generating splits for the input files.

How to split input files?


Altough, FileInputFormat splits only those files which are larger than HDFS block. The split size can be controlled by various Hadoop properties. Input path and filter properties are given in the below table.
Property name Type Default value Description mapred.min.split.size int 1 Smallest valid size in bytes for a file split mapred.max.split.size long Long.MAX_VALUE, that is, Largest valid size in 9223372036854775807 bytes for a file split dfs.block.size long 64 MB, The size block in HDFS

The minimum split size is usually 1 byte, although some formats have a lower bound on the split size. We may impose a minimum split size. By setting this to a value larger than the block size, they can force splits to be larger than a block. But this is not good while using HDFS, because doing so will increase the number of blocks that are not local to a map task. The maximum split size defaults to the maximum value that can be represented by a Java long type. It has an effect only when it is less than the block size, forcing splits to be smaller than a block. The split size is calculated by the formula (see the computeSplitSize() method in FileInputFormat):

max(minimumSize, min(maximumSize, blockSize))
and by default:
minimumSize < blockSize < maximumSize
so the split size is blockSize.

How to prevent splitting?


Some applications don’t want files to be split, as this allows a single mapper to process each input file in its entirety. For example, a simple way to check if all the records in a file are sorted is to go through the records in order, checking whether each record is not less than the preceding one. Implemented as a map task, this algorithm will work only if one map processes the whole file. There are a couple of ways to ensure that an existing file is not split. The first (quick and dirty) way is to increase the minimum split size to be larger than the largest file in your system. Setting it to its maximum value, Long.MAX_VALUE, has this effect. The second is to subclass the concrete subclass of FileInputFormat that you want to use, to override the isSplitable() method4 to return false. For example, here’s a nonsplittable TextInputFormat:
import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.JobContext; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; public class NonSplittableTextInputFormat extends TextInputFormat { @Override protected boolean isSplitable(JobContext context, Path file) { return false; } }

Text Input


TextInputFormat: As yoy may know, text files are usually taken as input files.Similarly, TextInputFormat is the default InputFormat. Now each record is a line of input and key (a LongWritable) is the byte offset within the file of the beginning of the line. Finally, value is the contents of the line( excuding any line terminators).For instance, file having folowing content:

A king should hunt regularly A queen should shop daily, Other people should just try.

The records are interpreted as the following key-value pairs.
0, A king should hunt regularly 29,A queen should shop daily, 55,Other people should just try.

Rememeber guys, the keys are not line numbers but offset from begining of the file. Offset are sufficient to serve as a unique identifier for each line. And if we combine it with the file name, it would be unique with in the filesystem.

NLineInputFormat: The number of lines that each mapper receives depends on the size of split and the length of the lines. If we want to set this number then NLineInputFormat is the InputFormat to use. In this type too, the keys are the byte offsets within the file and values are the lines themselves. Here N refers to the number of lines of input that each mapper receives. The default value of N is 1, so each mapper receives excatly one line of input.
mapreduce.input.lineinputformat.linespermap

Above writen property controls the value of N.

XML: Hadoop provides class for called StreamXmlRecordReader. We can use it by setting our input format to StreamInputFormat and setting the stream.recordreader.class property to org.apache.hadoop.streaming.StreamXmlRecordReader. The reader is configured by setting job configuration properties to tell it the patterns for the start and end tags.

Binary Input


Here is the answer to the question asked at the beginning of this post. Yes we can use data other than textual data. Hadoop MapReduce also support binary formats.

SequenceFileInputFormat : It stores sequence of binary key-value pairs. To use data from sequence files as the input to MapReduce, you use SequenceFileIn putFormat.To use data from sequence files as the input to MapReduce, you use SequenceFileIn putFormat. SequenceFileIn putFormat. SequenceFileInputFormat can read MapFiles as well as sequence files. SequenceFileInputFormat assumes that it is reading a MapFile and uses its datafile. This is why there is no MapFileInputFormat class.

SequenceFileAsTextInputFormat: SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat which converts the sequence file’s keys and values to Text objects. The conversion is performed by calling toString() method on the keys and values. This format makes sequence files suitable input for Streaming.

Database Input and Output


DBInputFormat is an input format for reading data from a relational database, using JDBC. We need to be careful not to overwhelm the database from which you are reading by running too many mappers, as it doesn’t have any sharding capabilities. For this reason, it is best used for loading relatively small datasets, perhaps for joining with larger datasets from HDFS using MultipleInputs. The corresponding output format is DBOutputFormat, which is useful for dumping job outputs (of modest size) into a database.

28 comments:

  1. Aman, Thanks much for compiling these important to understand underpinnings of MapReduce moving parts. Your article served as a very good refresher for me.

    ReplyDelete
  2. Hey Aman,

    This is good article.

    ReplyDelete
  3. Hi,
    Nice article. I have a couple of questions.

    How to split file manually, ie by change of a value inside the data ?
    What will be the result if most blocks are of sizes around 60-70 mb, and very few is 70-100 mb size? blocks

    ReplyDelete
  4. whats the difference between split size & block & what will be the conce of increasing split size > block size

    ReplyDelete
  5. How to read & split video input file?

    ReplyDelete
  6. Thanks for sharing this information. You may also refer for Hadoop practice http://www.s4techno.com/lab-setup/

    ReplyDelete

  7. It was really a nice article and i was really impressed by reading this Big data hadoop online Training

    ReplyDelete
  8. Exallent Article
    Yaaron Studios is one of the rapidly growing editing studios in Hyderabad. We are the best Video Editing services in Hyderabad. We provides best graphic works like logo reveals, corporate presentation Etc. And also we gives the best Outdoor/Indoor shoots and Ad Making services.
    video editors studio in hyderabad
    short film editors in hyderabad
    corporate video editing studio in hyderabad
    ad making company in hyderabad

    ReplyDelete
  9. Herpes Virus whether it is oral or genital. To control its symptoms, you usually do many things but it doesn’t give you the expected results. And sometimes some medicines can even give you side effects which can make your situation more critical. Personally I always prefer natural cure for herpes Or any Other Infection because they won’t give you side effects. You can cure your infection/Diseases smoothly and with less trouble with natural remedies. I Strongly Recommend Herbal doctor Razor's Traditional Medicine , Get in touch with him on his Facebook Page https://web.facebook.com/HerbalistrazorMedicinalcure He is blessed with the wisdom to get rid of this virus and other Diseases. I had suffered from this Virus since I was a child, I'd learnt to live with it but still wanted to get cured of it and DOC RAZOR simply helped me with that . All thanks To Doctor Razor Who Rescued Me. Contact him on email : drrazorherbalhome@gmail.com, . Reach Him directly on https://wa.me/message/USI4SETUUEW4H1

    ReplyDelete
  10. Custom perfume boxes are used to attract customers.wholesale perfume boxes
    Perfumes, colognes, and such fragrances are highlighted on the shelves by packing in an innovative package. These luxe items are packed in enticing boxes.

    ReplyDelete