Hadoop Soup: 01/08/14

The first thing that comes into mind while writing a MapReduce program is the types we you are going to use in the code for Mapper and Reducer class.There are few points that should be followed for writing and understanding Mapreduce program.Here is a recap for the data types used in MapReduce (in case you have missed the MapReduce Introduction post).

Broadly the data types used in MapRduce are as follows.

LongWritable-Corresponds to Java Long
Text -Corresponds to Java String
IntWritable -Corresponds to Java Integer
NullWritable - Corrresponds to Null Values

Having a quick overview, we can jump over to the key thing that is data type in MapReduce. Now MapReduce has a simple model of data processing: inputs and outputs for the map and reduce functions are key-value pairs

The map and reduce functions in MapReduce have the following general form:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
- K1-Input Key
- V1-Input value
- K2-Output Key
- V2-Output value
In general,the map input key and value types (K1 and V1) are different from the map output types (K2 and V2). However, the reduce input must have the same types as the map output, although the reduce output types may be different again (K3 and V3).
As said in above pont even though the map output types and the reduce input types must match, this is not enforced by the Java compiler. If the reduce output types may be different from the map output types (K2 and V2) then we have to specify in the code the types of both the map and reduce function else error will be thrown.So if k2 and k3 are the same, we don't need to call setMapoutputKeyClass().Similarly, if v2 and v3 are the same, we only need to use setOutputValueClass()
NullWritable is used when the user want to pass either key or value (generally key) of map/reduce method as null.
If a combine function is used, then it is the same form as the reduce function (and is an implementation of Reducer), except its output types are the intermediate key and value types (K2 and V2), so they can feed the reduce function: map: (K1, V1) → list(K2, V2) combine: (K2, list(V2)) → list(K2, V2) reduce: (K2, list(V2)) → list(K3, V3) Often the combine and reduce functions are the same, in which case K3 is the same as K2, and V3 is the same as V2.
The partition function operates on the intermediate key and value types (K2 and V2) and returns the partition index. In practice, the partition is determined solely by the key (the value is ignored): partition: (K2, V2) → integer

Default MapReduce Job:No Mapper, No Reducer

Ever tried to run MapReduce program without setting a mapper or a reducer? Here is the minimal MapReduce program.

Run it over a small data and check the output. Here is little data which I used and the final result.You can take a larger data set.

Notice the result file we get after running the above code on the given data. It added an extra column with some numbers as data.What happened is the that the the newly added column contains the key for every line. The number is the offset of the line from the first line i.e. how far the beginning of the first line is placed from the first line(0 of course)similarly how many characters away is the second line from first. Count the characters, it will be 16 and so on.

This offset is taken as a key and emitted in the result.

Wednesday, 8 January 2014

MapReduce Types

Default MapReduce Job:No Mapper, No Reducer