Hadoop Soup: Multiple Input Files In MapReduce: Side Data Distribution

Monday, 27 January 2014

Multiple Input Files In MapReduce: Side Data Distribution

You may come to face problems which require more than one input files. For instance, you may want to join records from two input files. In such cases, where we want to use more than one input file, we have following options to do that.

First, we can put the number of input files we want to use in a single directory, and give the path of directory as input file path.
Second, we can use the concept of side data distribution, which implements distributed cache API.
Third, we can simply use for more than one input files, and specify their paths.

Let us understand first two approaches here(Third method will be explained in my next post).

In first approach, we just put all input files in a single directory and give the path of the directory. This approach has a limitation that we can't use input files with different data structures. Thus this approach is of very limited use. In second approach, we use a main (usually large) input file or main dataset and other small input files. Ever heard the term "Look up file" ? In our case understand it in this way: It is a file containing very less volume of data compared to our main input file ( look up files in Distributed Cache ). This approach implements the concept of side data distribution. Side data can be defined as extra read-only data needed by a job to process the main dataset.

Distributed Cache

Rather than serializing side data in the job configuration, it is preferable to distribute datasets using Hadoop’s distributed cache mechanism. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to any particular node once per job. To understand this concept more clearly, take this example: Suppose we have two input files, one small and another comparatively large. Let us assume this the larger file i.e the input file .

101 Vince 12000
102 James 33
103 Tony 32
104 John 25
105 Nataliya 19
106 Anna 20
107 Harold 29

And this is the smaller file.

101 Vince 12000
102 James 10000
103 Tony 20000
104 John 25000
105 Nataliya 15000

Now what we want is to get those results which have common Id Number. So, in order to achieve this use smaller file as look up file and larger file as input file. The complete java code and explanation of each component is given below:



public class Join extends Configured implements Tool 
{

public static class JoinMapper extends Mapper
{
   Path[] cachefiles = new Path[0]; //To store the path of lookup files
   List exEmployees = new ArrayList();//To store the data of lookup files

  /********************Setup Method******************************************/
  @Override
  public void setup(Context context) 
 
   {
    Configuration conf = context.getConfiguration();
   
   try 
   {

  cachefiles = DistributedCache.getLocalCacheFiles(conf);
  BufferedReader reader = new BufferedReader(new FileReader(cachefiles[0].toString())); 
    
      String line;

 while ((line = reader.readLine())!= null) 
  {
   exEmployees.add(line);  //Data of lookup files get stored in list object
  }
    
       }
   
 catch (IOException e) 
 {
  e.printStackTrace();
 }

   } setup method ends


    /***********************************************************************/

   /********************Map Method******************************************/

     public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException
    {
   
   
   String[] line = value.toString().split("\t");
   
  
   for (String e : exEmployees)  
     {
   
      String[] listLine = e.toString().split("\t");
    
       if(line[0].equals(listLine[0]))
   
      {
     context.write(new Text(line[0]), new Text(line[1]+"\t"+line[2]+"\t"+listLine[2]));
      }

    }
    
    
   }    //map method ends
 /***********************************************************************/


}

  /********************run Method******************************************/

    public int run(String[] args) throws Exception 
    {

      Configuration conf = new Configuration();
      Job job = new Job(conf, "aggprog");
      job.setJarByClass(Join.class);
      DistributedCache.addCacheFile(new Path(args[0]).toUri(), job.getConfiguration());

      FileInputFormat.addInputPath(job, new Path(args [1]));
      FileOutputFormat.setOutputPath(job, new Path(args [2]));
      job.setMapperClass(JoinMapper.class);
      job.setOutputKeyClass(Text.class);
      job.setOutputValueClass(Text.class);
      return (job.waitForCompletion(true) ? 0 : 1);


     }

 public static void main (String[] args) throws Exception 
 {
  int ecode = ToolRunner.run(new Join(), args);
  System.exit(ecode);
 }



}

This is the result we will get after running the above code.


102     James   33      10000
103     Tony    32      20000
104     John    25      25000
105     Nataliya        19      15000

Run() Method:

In run() Method, we used


public void addCacheFile(URI uri)

method to add file to distributed cache. If you go through code carefully, you will notice there is no reduce() method. Hence, there is no job.setReducerClass() in run method. In our above example there is in fact no need for using reducer as the common Id numbers are identified in map method only. Due to the same reason the job.setOutputkeyClass(Text.class); and job.setOutputValueClass(Text.class); have data-types of key_out and value_out datatypes of the mapper, and not the data-types of reducer.

Setup() method :

In setup() method,


cachefiles=DistributedCache.getLocalCacheFiles(conf);

is very important to understand. Here we are extracting the path of the file in distributed cache.


BufferedReader reader = new BufferedReader(new FileReader(cachefiles[0].toString()));

After that we have stored the contents of the file using BufferReader in a List object for further operations. Remember when the input files were created, we gave tab("\t") as delimiter to read it properly later.

Map() method :

In map method, we receive and extract the lines of main dataset one by one, break them into words, by using tab("\t") as delimiter, parse them into string and store them in a string array( String[] line).



String[] line = value.toString().split("\t");

We do the same processing with contents of the string to match the id i.e. first column of both the main data set and the look up file.


String[] listLine = e.toString().split("\t");

If the Id number matches i.e. Id of a record in main dataset is also present in the look up file, then the contents of both the files are emitted using context object.


if(line[0].equals(listLine[0]));
context.write(new Text(line[0]), new Text(line[1]+"\t"+line[2]+"\t"+listLine[2]));

39 comments:

Unknown29 January 2014 at 01:47
If anything in the code is not clear.. feel free to ask...!!!
ReplyDelete
Replies
Anonymous29 January 2014 at 17:34
Hi are you working on Hadoop ? I need some real time knowledge yaar if that's is ok with you.
ReplyDelete
Replies
Unknown31 January 2014 at 01:28
great work bro!!!!
i have small doubt can u tell me have u faced any real time problems using mapreduce,hive,pig. Some scenarios with example. That would be helpful.
ReplyDelete
Replies
Anonymous16 February 2014 at 17:38
in real world scenarios did you face any requirement to write new writable, comparator, partitioner etc. I am having difficulty in connecting the dots as I don't have real time knowledge and the use cases... if you can write an article on the real time scenarios that would be of great help for people like me... as material is available everywhere but not the usage... hope I am not asking too much.... :) ... just my thought...
I can contact you via mail or phone if I am not clear on my question ....
ReplyDelete
Replies
Unknown29 March 2014 at 11:09
Aman,
i need an example without using distributed cache using joins in map-reduce?is it correct way to do that ?if yes give me example
ReplyDelete
Replies
Anonymous29 January 2015 at 09:59
Hi aman,
i am new for hadoop.. please tell me some real time scenario of mapreduce which you used in your project..
ReplyDelete
Replies
Unknown10 June 2015 at 12:25
Nice work :)
ReplyDelete
Replies
Unknown3 August 2018 at 05:03
Thanks for sharing the valuable information to share with us. For more informarion please visit our website. Get Information Regarding Semi IT Course Hadoop Training In Hyderabad#Book Now Online
ReplyDelete
Replies
Unknown6 August 2018 at 04:52
Nice article, Thanks for sharing the more valuable information to share with us. For more details please visit our website. Class Room Based Learning for Hadoop Training In Ameerpet// Visit Our Site

ReplyDelete
Replies
Anonymous18 October 2019 at 21:54
This comment has been removed by the author.
ReplyDelete
Replies
Anonymous18 October 2019 at 21:55
Very Excellent Post! Thank you so much for sharing this good post, it was so nice to read and useful to improve my Technical knowledge as updated one, keep blogging.
Map Reduce Training in Electronic city
ReplyDelete
Replies
Austin P10 April 2022 at 17:14
Herpes Virus whether it is oral or genital. To control its symptoms, you usually do many things but it doesn’t give you the expected results. And sometimes some medicines can even give you side effects which can make your situation more critical. Personally I always prefer natural cure for herpes Or any Other Infection because they won’t give you side effects. You can cure your infection/Diseases smoothly and with less trouble with natural remedies. I Strongly Recommend Herbal doctor Razor's Traditional Medicine , Get in touch with him on his Facebook Page https://web.facebook.com/HerbalistrazorMedicinalcure He is blessed with the wisdom to get rid of this virus and other Diseases. I had suffered from this Virus since I was a child, I'd learnt to live with it but still wanted to get cured of it and DOC RAZOR simply helped me with that . All thanks To Doctor Razor Who Rescued Me. Contact him on email : drrazorherbalhome@gmail.com, . Reach Him directly on https://wa.me/message/USI4SETUUEW4H1
ReplyDelete
Replies
Muharrem23427 September 2023 at 20:18
https://bayanlarsitesi.com/
Manisa
Denizli
Malatya
Çankırı

04SN
ReplyDelete
Replies
kaban28 September 2023 at 02:28
Yalı
Beyazkent
Hisardere
Orhaniye
Karacakaya
5C0
ReplyDelete
Replies
EchoCyberMatrix132 October 2023 at 17:22
Erzurum
Elazığ
Konya
Zonguldak
Eskişehir
Cİ6
ReplyDelete
Replies
Itır55 October 2023 at 14:43
yozgat
tunceli
hakkari
zonguldak
adıyaman
VP86TK
ReplyDelete
Replies
Esen18 October 2023 at 17:00
görüntülü show
ücretlishow
NCW7U
ReplyDelete
Replies
Enis15 October 2023 at 12:40
https://titandijital.com.tr/
manisa parça eşya taşıma
balıkesir parça eşya taşıma
eskişehir parça eşya taşıma
ardahan parça eşya taşıma
2OD
ReplyDelete
Replies
C68B1Danny2DAD77 November 2023 at 04:33
D14A6
Samsun Evden Eve Nakliyat
Yozgat Evden Eve Nakliyat
Kars Evden Eve Nakliyat
Hakkari Evden Eve Nakliyat
Niğde Evden Eve Nakliyat
ReplyDelete
Replies
42538HalleC5C659 November 2023 at 15:02
21988
Bitlis Evden Eve Nakliyat
Şırnak Parça Eşya Taşıma
Ordu Evden Eve Nakliyat
Kocaeli Evden Eve Nakliyat
Binance Referans Kodu
Probit Güvenilir mi
Tekirdağ Fayans Ustası
Mamak Fayans Ustası
Ünye Boya Ustası
ReplyDelete
Replies
AE451William095AC10 November 2023 at 18:15
7D7EC
Kars Evden Eve Nakliyat
Ordu Şehir İçi Nakliyat
Bolu Parça Eşya Taşıma
Çerkezköy Sineklik
Amasya Lojistik
Eskişehir Şehirler Arası Nakliyat
Çerkezköy Koltuk Kaplama
Kastamonu Şehirler Arası Nakliyat
Kütahya Şehirler Arası Nakliyat
ReplyDelete
Replies
5BD05ReinaC71DC1 December 2023 at 09:05
FAA63
binance indirim kodu
ReplyDelete
Replies
552A4AylinD1AC68 December 2023 at 01:16
831FF
Binance Nasıl Oynanır
Binance Kimin
Kripto Para Kazanma
Mexc Borsası Güvenilir mi
Bitcoin Oynama
Coin Madenciliği Nasıl Yapılır
Kripto Para Kazanma Siteleri
Kripto Para Madenciliği Nedir
Coin Kazanma
ReplyDelete
Replies
87DCEShannon479E620 December 2023 at 15:26
3399F
sakarya görüntülü canlı sohbet
samsun seslı sohbet sıtelerı
parasız görüntülü sohbet uygulamaları
izmir chat sohbet
bayburt yabancı sohbet
sivas telefonda rastgele sohbet
ücretsiz sohbet sitesi
karabük sesli mobil sohbet
yozgat ücretsiz sohbet uygulamaları
ReplyDelete
Replies
8B13CTimothy2837423 December 2023 at 14:01
737CA
bilecik mobil sohbet odaları
en iyi ücretsiz sohbet siteleri
adana telefonda görüntülü sohbet
mobil sohbet
aydın mobil sohbet chat
van sesli sohbet siteler
denizli yabancı sohbet
igdir görüntülü sohbet kızlarla
Osmaniye Canlı Sohbet Odaları
ReplyDelete
Replies
46243Gabriel360D55 January 2024 at 02:39
42344
kilis parasız sohbet
tekirdağ kadınlarla ücretsiz sohbet
sinop görüntülü sohbet sitesi
çanakkale telefonda görüntülü sohbet
Konya Parasız Görüntülü Sohbet Uygulamaları
sohbet uygulamaları
batman sesli sohbet
konya sohbet sitesi
Amasya Yabancı Canlı Sohbet
ReplyDelete
Replies
0D838Maritza2866E17 January 2024 at 16:23
A1F3A
Nonolive Takipçi Satın Al
Sohbet
Bitcoin Madenciliği Nedir
Binance Madencilik Nasıl Yapılır
Binance Nasıl Oynanır
Binance Sahibi Kim
Pi Network Coin Hangi Borsada
Shinja Coin Hangi Borsada
Binance Borsası Güvenilir mi
ReplyDelete
Replies
05A5DRoy1F49917 January 2024 at 22:41
21791
Sohbet
Paribu Borsası Güvenilir mi
Paribu Borsası Güvenilir mi
Dlive Takipçi Hilesi
Soundcloud Beğeni Satın Al
Nonolive Takipçi Satın Al
Coin Nasıl Alınır
Discord Sunucu Üyesi Hilesi
Likee App Beğeni Satın Al
ReplyDelete
Replies
6D606Leo84CC117 January 2024 at 23:30
45A5C
Likee App Takipçi Satın Al
Dlive Takipçi Hilesi
Gate io Borsası Güvenilir mi
Kripto Para Madenciliği Nasıl Yapılır
Clubhouse Takipçi Satın Al
Lunc Coin Hangi Borsada
Vector Coin Hangi Borsada
Parasız Görüntülü Sohbet
Youtube Beğeni Satın Al
ReplyDelete
Replies
C0633Earl5A9C519 January 2024 at 00:42
6246E
Btcst Coin Hangi Borsada
Bitcoin Nasıl Üretilir
Mexc Borsası Güvenilir mi
Telegram Abone Satın Al
Threads Yeniden Paylaş Hilesi
Youtube İzlenme Hilesi
Threads İzlenme Hilesi
Görüntülü Sohbet Parasız
Likee App Beğeni Satın Al
ReplyDelete
Replies
E930FBruce98C5E19 January 2024 at 06:11
92F66
Madencilik Nedir
Bitcoin Nasıl Alınır
Luffy Coin Hangi Borsada
Twitter Takipçi Satın Al
Arbitrum Coin Hangi Borsada
Tumblr Takipçi Hilesi
Kwai Takipçi Satın Al
Area Coin Hangi Borsada
Snapchat Takipçi Satın Al
ReplyDelete
Replies
7C8B66F0B2Oliver6371F79B2728 December 2024 at 16:22
8DA5B74218
takipçi satın al instagram
ReplyDelete
Replies
Anonymous30 January 2025 at 20:45
7459FD0E5F
gercek takipci satin al
ReplyDelete
Replies
Anonymous3 February 2025 at 04:34
3AABA526FA
instagram garantili takipçi
ReplyDelete
Replies
Anonymous6 February 2025 at 06:20
A0BF12426F
gerçek takipçi
101 Okey Yalla Hediye Kodu
Avast Etkinleştirme Kodu
Pubg Hassasiyet Kodu
Pokemon GO Promosyon Kodu
Zula Hediye Kodu
Pubg New State Promosyon Kodu
Bitcoin En Güvenilir Nereden Alınır
Kaspersky Etkinleştirme Kodu
ReplyDelete
Replies
Anonymous28 March 2025 at 12:09
26853FCC20
Telegram Güvenilir Mining Botları
Telegram Para Kazandıran Oyunlar
Yeni Telegram Mining Botları
Telegram Para Kazanma Grupları
Binance Hesabi Acma
ReplyDelete
Replies

Add comment