Friday, 24 January 2014

Secondary Sort In MapReduce: "Sort By Value"

As you may know, MapReduce by defult sorts the keys( Shuffle and Sort Phase) before sending the records to reducers. However, the values are not sorted. The order in which values appear to reducers differ from run to run. This is due to the fact that values are emitted from different map tasks, which may finish at different times from run to run. Generally, MapReduce programs are written in such a way that the order of values reaching reduce method doesn't matter. But if we want to impose an order on the values by sorting and grouping the keys in a particular way? Or if we also want to sort by value?

Let's understand the concept of secondary sorting with the help of an example. Consider the MapReduce program for calculating the maximum temperature for each year ( I shamelessly admit that I am taking this example from " Hadoop, The definitive Guide" and the data used is weather data set). With a slight modification in the format of the keys, secondary sorting gives us the ability to take the value into account during the sort phase. There are two possible approaches which can be followed.

The first approach involves having the reducer buffer all of the values for a given key and do an in-reducer sort on the values. Since the reducer will be receiving all values for a given key, this approach could possibly cause the reducer to run out of memory. The second approach involves creating a composite key by adding a part of, or the entire value to the natural key to achieve your sorting objectives.

We will stick to the second approach for the time being. For this we will need to write a custom partitioner to ensure all the data with same key (the natural key not including the composite key with the value) is sent to the same reducer and a custom Comparator so the data is grouped by the natural key once it arrives at the reducer. To achieve this, we change our keys to be composite: a combination of year and temperature. We want the sort order for keys to be by year (ascending) and then by temperature (descending): favorite According to the definitive guide example of secondary sorting We want the sort order for keys to be by year (ascending) and then by temperature (descending):

1900 35°C
1900 34°C 
1900 34°C
 ... 
1901 36°C 
1901 35°C

By setting a partitioner to partition by the year part of the key, we can guarantee that records for the same year go to the same reducer. This still isn’t enough to achieve our goal, however. A partitioner ensures only that one reducer receives all the records for a year; it doesn’t change the fact that the reducer groups by key within the partition Since we would have already written our own partitioner which would take care of the map output keys going to particular reducer". So, in order to get the desired reult we are going to need 3 main components:

  1. Key should be composite, having both year(natural key) and temperature(natural value).
  2. A partitioner which would pass common years to same partition.
  3. Two comparator,one for comparing year and another for comparing temperature.