Hadoop Soup: 08/27/14

Wednesday, 27 August 2014

Ranker: Top N values

Ranker finds the top N values in a column, for each group.

Mapper:

Mapper will pass each group as a key and entire row as value. User is entering the column delimiter, group column number, value column number, and order (true for ascending, false for descending).
Suppose our input file is:

1,11,a
2,12,a
3,13,b
4,14,c
5,15,d
6,16,c
7,17,g
8,18,e
9,19,a

A sample command:

hadoop jar /root/Documents/iv3.jar iv3.TopValues  MapReduceInput/xy1.txt  MapReduceOutput/TopVal "," 3 1 1 true

Here "," is column delimiter.
"3" is group key i.e the 3rd column.
"1" is the column on which ranking will be done.
"1" means top 1 value.
"true" means we are expecting result in ascending order.

Then mapper will send key value pair as:

a,(1,11,a)
a,(2,12,a)
b,(3,13,b)
...
...

Reducer :

Reducer is using a TreeMap for storing the data.

key:value
1: 1,11,a
2: 2,12,a
9: 9,19,a

When number of enteries exceed the N value, we are deleting one entry i.e. the entry with highest key. For descending order it will delete the entry with lowest value.

So for key "a" it will keep just one entries and would delete the entry with "9" and "2" as key. Similarly each key (group) is processed. So the output will be:

1,11,a
3,13,b
4,14,c
5,15,d
8,18,e

Here is the entire code.