Hadoop Soup: 01/07/14

When a mapper runs a produces it's intermediate output, it is first written to disk before sent over the network through the shuffle/sort and on to the reducer.
If we were to do a word count on a book , our mapper would read in the text line-by-line and emit a key value pair that would consist of a key of an individual word and a value of 1. All those key/value pairs will be first written to disk before been sent along the process over the network. For one book that may not be a problem, but we are talking Big Data here and so this is sub-optimal.
To work around this issues we can use the concept know as local aggregation which simply means that we want to consolidate the data before it writing it to disk. Local aggregations can be implemented in two way. First we could use internal structures to store data directly in the mapper. The downside to this approach is memory pressure and the potential that we exceed the amount of memory allocated to the JVM that the map job runs in.
A better method is to make use of the a combiner. Combiners act as local reducers aggregating data by key while its in memory. The difference between the two methods discussed is the combiners will spill or write to disk as buffer limits are reached. This obviously resolves the potentially out of memory issue. It can however results in duplicate keys being emitted to the Shuffle/Short which is generally not an issue considering where started from.
The primary goal of combiners is to optimize/minimize the number of key value pairs that will be shuffled accross the network between mappers and reducers. A combiner can be considered as a “mini reducer” that will be applied potentially several times still during the map phase before to send the new (reduced) set of key/value pairs to the reducer(s). This is why a combiner must implement the Reducer interface (or extend the Reducer class as of hadoop 0.20).If a combiner is used , then its key value dataypes should match with reducers key value datatypes

map:(k1,v1)->list(k2,v2)
combine:(k2,list(v2))->list(k2,v2)
reduce:(k2,list(v2))->list(k3,v3)

As explained above, the combine and reduce functions are the same, so k3 is the same as k2, and v3 is the same as v2.
Suppose 5 key/value pairs emitted from the mapper for a given key k: (k,40), (k,30), (k,20), (k,2), (k,8). Without combiner, when the reducer will receive the list (k,{40,30,20,2,8}), the mean output will be 20, but if a combiner were applied before on the two sets ((k,40), (k,30), (k,20)) and ((k,2), (k,8)) separately, then the reducer would have received the list and the output would have been different (17.5) which is an unexpected behavior.
More generally, combiners can be used when the function you want to apply is both commutative and associative . That’s the case for the addition function, this is why the word count example can benefit from combiners but not for the mean function Not that if the reducer function is both associative and commutative (i.e. sum of word counts) a reducer can function as both as a reducer and a combiner.
Do not assume that the combiner will run. Treat the combiner only as an optimization. The Combiner is not guaranteed to run over all of your data. In some cases when the data doesn't need to be spilled to disk, MapReduce will skip using the Combiner entirely. Note also that the Combiner may be ran multiple times over subsets of the data! It'll run once per spill.

Tuesday, 7 January 2014

Combiner: Introduction & Use (Or Not To Use)