- First, we can put the number of input files we want to use in a single directory, and give the path of directory as input file path.
- Second, we can use the concept of side data distribution, which implements distributed cache API.
- Third, we can simply use for more than one input files, and specify their paths.
In first approach, we just put all input files in a single directory and give the path of the directory. This approach has a limitation that we can't use input files with different data structures. Thus this approach is of very limited use. In second approach, we use a main (usually large) input file or main dataset and other small input files. Ever heard the term "Look up file" ? In our case understand it in this way: It is a file containing very less volume of data compared to our main input file ( look up files in Distributed Cache ). This approach implements the concept of side data distribution. Side data can be defined as extra read-only data needed by a job to process the main dataset.
Distributed Cache
Rather than serializing side data in the job configuration, it is preferable to distribute datasets using Hadoop’s distributed cache mechanism. This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to any particular node once per job. To understand this concept more clearly, take this example: Suppose we have two input files, one small and another comparatively large. Let us assume this the larger file i.e the input file .
101 Vince 12000 102 James 33 103 Tony 32 104 John 25 105 Nataliya 19 106 Anna 20 107 Harold 29And this is the smaller file.
101 Vince 12000 102 James 10000 103 Tony 20000 104 John 25000 105 Nataliya 15000Now what we want is to get those results which have common Id Number. So, in order to achieve this use smaller file as look up file and larger file as input file. The complete java code and explanation of each component is given below:
public class Join extends Configured implements Tool
{
public static class JoinMapper extends Mapper
{
Path[] cachefiles = new Path[0]; //To store the path of lookup files
List exEmployees = new ArrayList();//To store the data of lookup files
/********************Setup Method******************************************/
@Override
public void setup(Context context)
{
Configuration conf = context.getConfiguration();
try
{
cachefiles = DistributedCache.getLocalCacheFiles(conf);
BufferedReader reader = new BufferedReader(new FileReader(cachefiles[0].toString()));
String line;
while ((line = reader.readLine())!= null)
{
exEmployees.add(line); //Data of lookup files get stored in list object
}
}
catch (IOException e)
{
e.printStackTrace();
}
} setup method ends
/***********************************************************************/
/********************Map Method******************************************/
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
String[] line = value.toString().split("\t");
for (String e : exEmployees)
{
String[] listLine = e.toString().split("\t");
if(line[0].equals(listLine[0]))
{
context.write(new Text(line[0]), new Text(line[1]+"\t"+line[2]+"\t"+listLine[2]));
}
}
} //map method ends
/***********************************************************************/
}
/********************run Method******************************************/
public int run(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "aggprog");
job.setJarByClass(Join.class);
DistributedCache.addCacheFile(new Path(args[0]).toUri(), job.getConfiguration());
FileInputFormat.addInputPath(job, new Path(args [1]));
FileOutputFormat.setOutputPath(job, new Path(args [2]));
job.setMapperClass(JoinMapper.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main (String[] args) throws Exception
{
int ecode = ToolRunner.run(new Join(), args);
System.exit(ecode);
}
}
This is the result we will get after running the above code.
102 James 33 10000
103 Tony 32 20000
104 John 25 25000
105 Nataliya 19 15000
Run() Method:
In run() Method, we used
public void addCacheFile(URI uri)
method to add file to distributed cache. If you go through code carefully,
you will notice there is no reduce() method. Hence, there is no job.setReducerClass() in run method. In our above example
there is in fact no need for using reducer as the common Id numbers are identified in map method only. Due to the same reason the
job.setOutputkeyClass(Text.class); and job.setOutputValueClass(Text.class); have data-types of key_out and value_out
datatypes of the mapper, and not the data-types of reducer.Setup() method :
In setup() method,
cachefiles=DistributedCache.getLocalCacheFiles(conf);
is very important to understand. Here we are extracting
the path of the file in distributed cache.
BufferedReader reader = new BufferedReader(new FileReader(cachefiles[0].toString()));
After that we have stored the contents of the file using BufferReader in a List object for further operations. Remember when
the input files were created, we gave tab("\t") as delimiter to read it properly later.Map() method :
In map method, we receive and extract the lines of main dataset one by one, break them into words, by using tab("\t") as delimiter, parse them into string and store them in a string array( String[] line).
String[] line = value.toString().split("\t");
We do the same processing with contents of the string to match the id i.e. first column of both the main data set and the look up file.
String[] listLine = e.toString().split("\t");
If the Id number matches i.e. Id of a record in main dataset is also present in the look up file, then the contents of both the files are emitted using context object.
if(line[0].equals(listLine[0]));
context.write(new Text(line[0]), new Text(line[1]+"\t"+line[2]+"\t"+listLine[2]));
If anything in the code is not clear.. feel free to ask...!!!
ReplyDeleteHi are you working on Hadoop ? I need some real time knowledge yaar if that's is ok with you.
ReplyDeleteThis comment has been removed by the author.
DeleteYeah what do u want to know?
Deletegreat work bro!!!!
ReplyDeletei have small doubt can u tell me have u faced any real time problems using mapreduce,hive,pig. Some scenarios with example. That would be helpful.
in real world scenarios did you face any requirement to write new writable, comparator, partitioner etc. I am having difficulty in connecting the dots as I don't have real time knowledge and the use cases... if you can write an article on the real time scenarios that would be of great help for people like me... as material is available everywhere but not the usage... hope I am not asking too much.... :) ... just my thought...
ReplyDeleteI can contact you via mail or phone if I am not clear on my question ....
Sure thing bro... I will write a couple of posts on real time scenarios very soon !!
DeleteAman,
ReplyDeletei need an example without using distributed cache using joins in map-reduce?is it correct way to do that ?if yes give me example
Hi aman,
ReplyDeletei am new for hadoop.. please tell me some real time scenario of mapreduce which you used in your project..
Nice work :)
ReplyDeleteThanks for sharing the valuable information to share with us. For more informarion please visit our website. Get Information Regarding Semi IT Course Hadoop Training In Hyderabad#Book Now Online
ReplyDeleteNice article, Thanks for sharing the more valuable information to share with us. For more details please visit our website. Class Room Based Learning for Hadoop Training In Ameerpet// Visit Our Site
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteVery Excellent Post! Thank you so much for sharing this good post, it was so nice to read and useful to improve my Technical knowledge as updated one, keep blogging.
ReplyDeleteMap Reduce Training in Electronic city
Herpes Virus whether it is oral or genital. To control its symptoms, you usually do many things but it doesn’t give you the expected results. And sometimes some medicines can even give you side effects which can make your situation more critical. Personally I always prefer natural cure for herpes Or any Other Infection because they won’t give you side effects. You can cure your infection/Diseases smoothly and with less trouble with natural remedies. I Strongly Recommend Herbal doctor Razor's Traditional Medicine , Get in touch with him on his Facebook Page https://web.facebook.com/HerbalistrazorMedicinalcure He is blessed with the wisdom to get rid of this virus and other Diseases. I had suffered from this Virus since I was a child, I'd learnt to live with it but still wanted to get cured of it and DOC RAZOR simply helped me with that . All thanks To Doctor Razor Who Rescued Me. Contact him on email : drrazorherbalhome@gmail.com, . Reach Him directly on https://wa.me/message/USI4SETUUEW4H1
ReplyDeletehttps://bayanlarsitesi.com/
ReplyDeleteManisa
Denizli
Malatya
Çankırı
04SN
Yalı
ReplyDeleteBeyazkent
Hisardere
Orhaniye
Karacakaya
5C0
Erzurum
ReplyDeleteElazığ
Konya
Zonguldak
Eskişehir
Cİ6
yozgat
ReplyDeletetunceli
hakkari
zonguldak
adıyaman
VP86TK
görüntülü show
ReplyDeleteücretlishow
NCW7U
https://titandijital.com.tr/
ReplyDeletemanisa parça eşya taşıma
balıkesir parça eşya taşıma
eskişehir parça eşya taşıma
ardahan parça eşya taşıma
2OD
D14A6
ReplyDeleteSamsun Evden Eve Nakliyat
Yozgat Evden Eve Nakliyat
Kars Evden Eve Nakliyat
Hakkari Evden Eve Nakliyat
Niğde Evden Eve Nakliyat
21988
ReplyDeleteBitlis Evden Eve Nakliyat
Şırnak Parça Eşya Taşıma
Ordu Evden Eve Nakliyat
Kocaeli Evden Eve Nakliyat
Binance Referans Kodu
Probit Güvenilir mi
Tekirdağ Fayans Ustası
Mamak Fayans Ustası
Ünye Boya Ustası
7D7EC
ReplyDeleteKars Evden Eve Nakliyat
Ordu Şehir İçi Nakliyat
Bolu Parça Eşya Taşıma
Çerkezköy Sineklik
Amasya Lojistik
Eskişehir Şehirler Arası Nakliyat
Çerkezköy Koltuk Kaplama
Kastamonu Şehirler Arası Nakliyat
Kütahya Şehirler Arası Nakliyat
FAA63
ReplyDeletebinance indirim kodu
831FF
ReplyDeleteBinance Nasıl Oynanır
Binance Kimin
Kripto Para Kazanma
Mexc Borsası Güvenilir mi
Bitcoin Oynama
Coin Madenciliği Nasıl Yapılır
Kripto Para Kazanma Siteleri
Kripto Para Madenciliği Nedir
Coin Kazanma
3399F
ReplyDeletesakarya görüntülü canlı sohbet
samsun seslı sohbet sıtelerı
parasız görüntülü sohbet uygulamaları
izmir chat sohbet
bayburt yabancı sohbet
sivas telefonda rastgele sohbet
ücretsiz sohbet sitesi
karabük sesli mobil sohbet
yozgat ücretsiz sohbet uygulamaları
737CA
ReplyDeletebilecik mobil sohbet odaları
en iyi ücretsiz sohbet siteleri
adana telefonda görüntülü sohbet
mobil sohbet
aydın mobil sohbet chat
van sesli sohbet siteler
denizli yabancı sohbet
igdir görüntülü sohbet kızlarla
Osmaniye Canlı Sohbet Odaları
42344
ReplyDeletekilis parasız sohbet
tekirdağ kadınlarla ücretsiz sohbet
sinop görüntülü sohbet sitesi
çanakkale telefonda görüntülü sohbet
Konya Parasız Görüntülü Sohbet Uygulamaları
sohbet uygulamaları
batman sesli sohbet
konya sohbet sitesi
Amasya Yabancı Canlı Sohbet
A1F3A
ReplyDeleteNonolive Takipçi Satın Al
Sohbet
Bitcoin Madenciliği Nedir
Binance Madencilik Nasıl Yapılır
Binance Nasıl Oynanır
Binance Sahibi Kim
Pi Network Coin Hangi Borsada
Shinja Coin Hangi Borsada
Binance Borsası Güvenilir mi
21791
ReplyDeleteSohbet
Paribu Borsası Güvenilir mi
Paribu Borsası Güvenilir mi
Dlive Takipçi Hilesi
Soundcloud Beğeni Satın Al
Nonolive Takipçi Satın Al
Coin Nasıl Alınır
Discord Sunucu Üyesi Hilesi
Likee App Beğeni Satın Al
45A5C
ReplyDeleteLikee App Takipçi Satın Al
Dlive Takipçi Hilesi
Gate io Borsası Güvenilir mi
Kripto Para Madenciliği Nasıl Yapılır
Clubhouse Takipçi Satın Al
Lunc Coin Hangi Borsada
Vector Coin Hangi Borsada
Parasız Görüntülü Sohbet
Youtube Beğeni Satın Al
6246E
ReplyDeleteBtcst Coin Hangi Borsada
Bitcoin Nasıl Üretilir
Mexc Borsası Güvenilir mi
Telegram Abone Satın Al
Threads Yeniden Paylaş Hilesi
Youtube İzlenme Hilesi
Threads İzlenme Hilesi
Görüntülü Sohbet Parasız
Likee App Beğeni Satın Al
92F66
ReplyDeleteMadencilik Nedir
Bitcoin Nasıl Alınır
Luffy Coin Hangi Borsada
Twitter Takipçi Satın Al
Arbitrum Coin Hangi Borsada
Tumblr Takipçi Hilesi
Kwai Takipçi Satın Al
Area Coin Hangi Borsada
Snapchat Takipçi Satın Al