Thursday 23 January 2014

Partitioning in MapReduce


As you may know, when a job (it is a MapReduce term for program) is run it goes to the the mapper, and the output of the mapper goes to the reducer. Ever wondered how many mapper and how many reducers is required for a job execution? What are parameters taken into consideration for deciding number of mapper and reducer required in order to complete the execution of a job successfully? Can we decide it? If yes then how?

Let us understand the concept of passing the job to jobtracker upto the final result produced by the reducer.

Initially the job resides on the HDFS, when it is executed it goes to the jobtracker. Now jobtracker decides on the bais of size of the job that how many mappers are required. In MapReduce the size of the block is 128 MB, so if the size of the job is of 256 MB then jobtracker split the job into two blocks, each of 128 MB. These blocks are sent to datanodes or Tasktracer for execution. Each datanode has 2 mappers slots and 2 reducers slots. Now jobtrackers has the option to choose which mapper slot it want to assign the block.

How does jobtracker decides which mapper slot to use and from which datanode?

In our example our 256 MB block was splitted into 2 128 MB blocks. Suppose there are two datanodes availablw , with all empty mapper slots.Now there are two possibilities:
  1. Either jobtracker assigns both blocks to a single tasktracker.
  2. Or jobtracker assigns one block to one tasktracker and one to another.

Jobtracker will follow the second approach.It will assign one 128 block to one mapper slot of a tasktracker/datanode and another 128 MB block to another tasktracker/datanode. If another job comes then jobtracker can use the unused mapper slot of the datanode. After mapper's job is done the output of the mapper goes to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called partitioning. In Hadoop, the default partitioner is HashPartitioner, which hashes a record’s key to determine which partition (and thus which reducer) the record belongs in.The number of partition is then equal to the number of reduce tasks for the job.

Why Partitioning is important? First, partitioning has a direct impact on the overall performance of job we want to run. Second, it maybe sometimes required to control the key/value pairs (emitted from mapper) partitioning over the reducers. Let's understand this with the help of a simple example. Suppose we want to sort the output of the wordcount on the basis of number of occurences of tokens. Assume that our job will be handled by 2 reducers( We can specify that by using conf.setNumReduceTasks(0);). If we run our job without using any user defined partitioner, we will get output like this:
No_Of_Occur   Word/Token                No_Of_Occur   Word/Token

1                Hi                      2              so
3                a                       4              because
6                the                     5              is

      Reducer 1                               Reducer 2

This is certainly not what we wanted. Intead we were expecting the output to come like this:
No_Of_Occur   Word/Token                No_Of_Occur   Word/Token

1                Hi                         4              because
2                so                         5              is
3                a                          6              the

      Reducer 1                               Reducer 2

This would happen if we use correct partitioning function: all the tokens having a number of occurrences less than N (here 4) are sent to reducer 1 and the others are sent to reducer 2, resulting in two partitions. Since the tokens are sorted on each partition, we get the expected total order on the number of occurrences. Suppose we hava sample data.

aman 1
ashima 2
kaushik 3
sood 4
tony 5
stark 6
bruce 7
wayne 8
james 9
bond 10
mark 11
zuckerberg 12
saun 13
parker 14

And we want the result in such a way that names with number from 1 to 5 should appear in one file and rest in another file.Here is the code to achieve that:

package Partitioner; import java.io.IOException; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class UserDefinedPartitioner { static String[] line=null; public static class PartitionerMap extends Mapper { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { line=value.toString().split("\t"); context.write(new Text(line[0]), new Text(line[1])); } } public static class MyPartitioner extends org.apache.hadoop.mapreduce.Partitioner&ltText,Text&gt { @Override public int getPartition(Text key, Text value, int numPartitions) { int seed =Integer.parseInt(line[1]); if((seed>=1)&&(seed<=5)) return 0; else return 1; } } public static void main(String args[]) throws Exception { Job job = new Job(); job.setJarByClass(UserDefinedPartitioner.class); FileInputFormat.addInputPath(job, new Path(args [0])); FileOutputFormat.setOutputPath(job, new Path(args [1])); job.setMapperClass(PartitionerMap.class); job.setPartitionerClass(MyPartitioner.class); job.setNumReduceTasks(2); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
When we will list the contents of our output folder, we will get to files, each having result of one partitioner

$ hadoop fs -ls training/aman/nwc14 Found 3 items -rw-r--r-- 2 gpadmin hadoop 0 2014-01-24 16:17 training/aman/nwc14/_SUCCESS -rw-r--r-- 2 gpadmin hadoop 0 2014-01-24 16:17 training/aman/nwc14/part-r-00000 -rw-r--r-- 2 gpadmin hadoop 120 2014-01-24 16:17 training/aman/nwc14/part-r-00001 $ hadoop fs -cat training/aman/nwc15/part-r-00001 bond 10 bruce 7 james 9 mark 11 parker 14 saun 13 stark 6 wayne 8 zuckerberg 12 $ hadoop fs -cat training/aman/nwc15/part-r-00000 aman 1 ashima 2 kaushik 3 sood 4 tony 5
Obsereve the job.setNumReduceTasks(2); line in run method? If we don't write this line our code will work. Two partitions would be created but if size of the partition in too small, chances are there that only one outfile file would be created . So to delibrately tell the compiler to create two output files for each partition this line must be used.

90 comments:

  1. Did I miss anything? Or you have any doubt.. !!! Feel free to ask !!

    ReplyDelete
  2. Excellent Job. Thanks for the Awesome tutorial Aman :)

    ReplyDelete
  3. This information you provided in the blog that was really unique I love it!!, Thanks for sharing such a great blog..Keep posting..

    Hadoop Training in Chennai

    ReplyDelete
  4. Thanks Aman for the clear post. But i have one doubt, here we are partitioning based on single field which is seed i.e. if((seed>=1)&&(seed<=5)), but what if i have multiple conditions on different fields something like if((seed>=1)&&(seed<=5) && (x=="Name")&&(gender=="Male")). How to make it work. Any suggestions.

    Thanks in Advance.

    ReplyDelete
  5. There was a logical error in my code which is now fixed and is working fine as expected. Please ignore my post

    ReplyDelete
  6. Hai thanks for that...i have to learn to lot of information...good training in chennai...Awesome blogs!!!ap training in chennai...................

    ReplyDelete
  7. Thank you for this wonderful tutorial. It helped me to learn some new things. So once again thanks

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. Great post. I really learned some new things. So thanks

    seo training in chennai

    ReplyDelete
  10. When i am setting numof reduce task to 2 but only one reducer is running

    ReplyDelete
  11. There is something thing I am not able to understand. So in the first example there is a word 'the' whose number of occurrence is 6.now suppose if one mapper gives count as 4 for 'the' and another mapper give output as'2' then by condition less than 4 the word 'the' with count as 2 should go to a different reducer because partition will work on individual mapper. Then shuffling happens. Please let me know how it is happening. Thanks in advance

    ReplyDelete
  12. There is something thing I am not able to understand. So in the first example there is a word 'the' whose number of occurrence is 6.now suppose if one mapper gives count as 4 for 'the' and another mapper give output as'2' then by condition less than 4 the word 'the' with count as 2 should go to a different reducer because partition will work on individual mapper. Then shuffling happens. Please let me know how it is happening. Thanks in advance

    ReplyDelete
  13. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing
    qlikview training in chennai

    ReplyDelete
  14. thanks for sharing this great content to my vision, keep sharing..
    informatica training in chennai

    ReplyDelete
  15. Our Digital Marketing Training is tailored for beginners who want to learn how to stand out digitally, whether it is for their own business or a personal brand.

    digital marketing course

    ReplyDelete
  16. Terrific post blog! of the topics was very helps to us aboard from my knowledge is improve.It's technical concepts here you having to update.If you want to learn BigData to reach and click that us,Hadoop Training in Chennai
    Hadoop Training Institute in Chennai

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. https://stitchclothes.com here we stitch clothes as per your design and selection


    ReplyDelete
  19. Hi your post on partitioning in hadoop mapreduce was very much understandable thanks for the clear explanation Hadoop Training in Velachery | Hadoop Training .

    ReplyDelete
  20. I simply wanted to thank you so much again. I am not sure the things that I might have gone through without the type of hints revealed by you regarding that situation.Authorized Dot Net training in chennai

    ReplyDelete
  21. Thank you a lot for providing individuals with a very spectacular possibility to read critical reviews from this site.
    AWS training in bangalore

    ReplyDelete
  22. My spouse and I love your blog and find almost all of your post’s to be just what I’m looking for.
    nebosh course in chennai

    ReplyDelete
  23. A universal message I suppose, not giving up is the formula for success I think. Some things take longer than others to accomplish, so people must understand that they should have their eyes on the goal, and that should keep them motivated to see it out til the end.
    Python training course in Chennai | Data science training in pune | Data science online training

    ReplyDelete
  24. Amazing information,thank you for your ideas.after along time i have studied
    an interesting information's.we need more updates in your blog.
    android training in bangalore with placement
    Android Training in Mogappair
    Android Training in Saidapet
    Android Training in Karapakkam

    ReplyDelete
  25. Awesome article. It is so detailed and well formatted that i enjoyed reading it as well as get some new information too.
    Awesome article. It is so detailed and well formatted that i enjoyed reading it as well as get some new information too.

    ReplyDelete
  26. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    rpa training in bangalore
    best rpa training in bangalore
    rpa training in pune | rpa course in bangalore
    rpa training in chennai

    ReplyDelete
  27. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Best Devops Training in pune
    Devops Training in Bangalore
    Microsoft azure training in Bangalore
    Power bi training in Chennai

    ReplyDelete
  28. This comment has been removed by the author.

    ReplyDelete
  29. Thanks For Sharing Your Information Please Keep UpDating Us The Information Shared Is Very Valuable Time Went On Just Reading The Article Python Online Training Devops Online Training
    Aws Online Training DataScience Online Training
    Hadoop Online Training

    ReplyDelete

  30. Hello, I read your blog occasionally, and I own a similar one, and I was just wondering if you get a lot of spam remarks? If so how do you stop it, any plugin or anything you can advise? I get so much lately it’s driving me insane, so any assistance is very much appreciated.
    Android Course Training in Chennai | Best Android Training in Chennai
    Selenium Course Training in Chennai | Best Selenium Training in chennai
    Devops Course Training in Chennai | Best Devops Training in Chennai

    ReplyDelete
  31. Thanks for sharing your informative post on development.Your work is very good and I appreciate you and hoping for some more informative posts.keep writing
    and sharing.
    R Language Training in Chennai

    ReplyDelete
  32. Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging.
    Dot Net training in Electronic City

    ReplyDelete
  33. Amazing article. Your blog helped me to improve myself in many ways thanks for sharing this kind of wonderful informative blogs in live.
    IT Institute in KK nagar | angularjs training in chennai | dot net training in chennai | Web Designing Training in Chennai

    ReplyDelete
  34. Thank you sharing on Map reduce topic ,I read your articles 2 or 3 about it it is really good.
    Learn Big Data Hadoop from Basics to Advance level...

    Hadoop Training in Hyderabad

    ReplyDelete
  35. The development of artificial intelligence (AI) has propelled more programming architects, information scientists, and different experts to investigate the plausibility of a vocation in machine learning. Notwithstanding, a few newcomers will in general spotlight a lot on hypothesis and insufficient on commonsense application. Machine Learning Final Year Projects In case you will succeed, you have to begin building machine learning projects in the near future.

    Projects assist you with improving your applied ML skills rapidly while allowing you to investigate an intriguing point. Furthermore, you can include projects into your portfolio, making it simpler to get a vocation, discover cool profession openings, and Final Year Project Centers in Chennai even arrange a more significant compensation.


    Data analytics is the study of dissecting crude data so as to make decisions about that data. Data analytics advances and procedures are generally utilized in business ventures to empower associations to settle on progressively Python Training in Chennai educated business choices. In the present worldwide commercial center, it isn't sufficient to assemble data and do the math; you should realize how to apply that data to genuine situations such that will affect conduct. In the program you will initially gain proficiency with the specialized skills, including R and Python dialects most usually utilized in data analytics programming and usage; Python Training in Chennai at that point center around the commonsense application, in view of genuine business issues in a scope of industry segments, for example, wellbeing, promoting and account.

    ReplyDelete
  36. Amazing blog, thanks for sharing this intersting blog

    ReplyDelete
  37. Its really helpful for the users of this site. I am also searching about these type of sites now a days. So your site really helps me for searching the new and great stuff.
    Java Training in Chennai

    Java Training in Velachery

    Java Training inTambaram

    Java Training in Porur

    Java Training in Omr

    Java Training in Annanagar

    ReplyDelete
  38. This is the exact information I am been searching for, Thanks for sharing the required infos with the clear update and required points. To appreciate this I like to share some useful information.
    Software Testing Training in Chennai

    Software Testing Training in Velachery

    Software Testing Training in Tambaram

    Software Testing Training in Porur

    Software Testing Training in Omr

    Software Testing Training in Annanagar

    ReplyDelete
  39. Meritstep is the Top & Best Software & IT Training for Blockchain Online Training & Learning Organizations India, USA, UK. Enroll now for Best Blockchain Online Training Institutions USA, UK, India.

    ReplyDelete
  40. if ur interested in learning AWS course please visit our websiteAWS Training in Hyderabad

    ReplyDelete
  41. The AWS certification course has become the need of the hour for freshers, IT professionals, or young entrepreneurs. AWS is one of the largest global cloud platforms that aids in hosting and managing company services on the internet. It was conceived in the year 2006 to service the clients in the best way possible by offering customized IT infrastructure. Due to its robustness, Digital Nest added AWS training in Hyderabad under the umbrella of other courses.

    ReplyDelete
  42. Fantastic blog, I’m glad to read this blog. Informative and knowledgeable content. Keep sharing more articles with us. Thank you.
    Data Science Course Training Institute in Hyderabad with Placements

    ReplyDelete
  43. If you are aspiring to become a Data Scientist? Then 360DigiTMG is the answer for you. A job-ready curriculum and expert trainers will help you in making your journey easier. data scientist course in chennai

    ReplyDelete