Data is everywhere. There is no way to measure the exact total volume of data stored electronically but it
should be in zettabytes (A zettabyte is 1 billion terabytes). So we have a lot of data and we are struggling
to store and analyze it. Now let's try to figure out what is wrong with dealing with such an enormous volume
of data. The answer is same what's wrong in data from PC to portable drives: Speed. Over the years the storage capacities
of hard drives have increased like anything but the rate at which data can be read from drives have not kept up and writing is even slower.
Solution: To read from multiple disks at a time. For instance if we have to store 100 GB of data and we have 100 drives with 100 GB storage space , then it would be faster to read data from 100 drives, each holding 1 GB of data than a single drive holding 100 GB of data.
Problem with solution: First, when using a large number of hardware pieces (hard disks) there is high possibility that one or two might fail.Second, correctly combining the data from different hard disks. MapReduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into computation over sets of keys and values. So , MapReduce is a programming model for data processing.
What Hadoop provides and why is MapReduce needed? Hadoop provides a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. But Why can’t we use databases with lots of disks to do large-scale batch analysis? This is due to the fact that seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. We can define MapReduce as a complement to a Rational Database Management(RDBMS).
Comparison with RDBMS:
To take advantage of the parallel processing that Hadoop provides we need to express our query as a MapReduce job. After some local,small scale testing we can run it on a cluster of machines.Broadly MapReduce working can be broken down into three components:
It's just a data preparation phase,setting up data in such a way that the reducer function can do its work on it.The map function is also a good place to drop bad records.The output from the map function is processed by the MapReduce framework before being sent to the reduce function.
The map() function is passed a key and value.We convert the Text value containing the line of input into a Java String,then use its substring() method to extract the columns we are interested in.The map() function also provides an instance of Context(Context object is used to store the data which is used by reduce method) to write the output to.
Sample code for mapper class and map method.
Point to remember: Rather than using bult-in Java types,Hadoop provides its own set of basic types that are optimized for network serialization.Few are given below:
Just like map() function, four formal parameters are used to specify the input and output types for reduce() function.The input types of reduce function must match the output types of the map function. Reduces a set of intermediate values which share a key to a smaller set of values. The number of Reducers for the job is set by the user via JobConf.setNumReduceTasks(int). Reducer implementations can access the JobConf for the job via the JobConfigurable.configure(JobConf) method and initialize themselves. Similarly they can use the Closeable.close() method for de-initialization.
Sample code for reducer class and reduce method:
Reducer has 3 primary phases:Shuffle, Sort and Reduce.
Shuffle: Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
Sort:The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
Reduce: In this phase the reduce(Object, Iterator, OutputCollector, Reporter) method is called for each ltkey, (list of values)> pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(Object, Object). The output of the Reducer is not re-sorted.
Solution: To read from multiple disks at a time. For instance if we have to store 100 GB of data and we have 100 drives with 100 GB storage space , then it would be faster to read data from 100 drives, each holding 1 GB of data than a single drive holding 100 GB of data.
Problem with solution: First, when using a large number of hardware pieces (hard disks) there is high possibility that one or two might fail.Second, correctly combining the data from different hard disks. MapReduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into computation over sets of keys and values. So , MapReduce is a programming model for data processing.
What Hadoop provides and why is MapReduce needed? Hadoop provides a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce. But Why can’t we use databases with lots of disks to do large-scale batch analysis? This is due to the fact that seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. We can define MapReduce as a complement to a Rational Database Management(RDBMS).
Comparison with RDBMS:
- MapReduce works well on unstructured or semistructured data because it is designed to interpret the data at processing time. In other words, the input keys and values for MapReduce are not intrinsic properties of the data, but they are chosen by the person analyzing the data.
- MapReduce works well for the applications where data is written once and read many times,whereas a relational database is good for datasets that are continually updated.
- MapReduce works on petabytes of data. On the other hand tradiional RDBMS works with gigabytes of data.
To take advantage of the parallel processing that Hadoop provides we need to express our query as a MapReduce job. After some local,small scale testing we can run it on a cluster of machines.Broadly MapReduce working can be broken down into three components:
- Map Method
- Reduce Method
- Code to run
Map function
It's just a data preparation phase,setting up data in such a way that the reducer function can do its work on it.The map function is also a good place to drop bad records.The output from the map function is processed by the MapReduce framework before being sent to the reduce function.
The map() function is passed a key and value.We convert the Text value containing the line of input into a Java String,then use its substring() method to extract the columns we are interested in.The map() function also provides an instance of Context(Context object is used to store the data which is used by reduce method) to write the output to.
Sample code for mapper class and map method.
public static class WordCountMap extends Mapper <LongWritable,Text,Text,IntWritable>
{
@Override
public void map(LongWritable key,Text value,Context context)
throws IOException, InterruptedException
{
// ********some code***********
context.write(new Text(Count), new IntWritable(1));
}
}
}
Point to remember: Rather than using bult-in Java types,Hadoop provides its own set of basic types that are optimized for network serialization.Few are given below:
- LongWritable-Corresponds to Java Long
- Text -Corresponds to Java String
- IntWritable -Corresponds to Java Integer
Reduce function
Just like map() function, four formal parameters are used to specify the input and output types for reduce() function.The input types of reduce function must match the output types of the map function. Reduces a set of intermediate values which share a key to a smaller set of values. The number of Reducers for the job is set by the user via JobConf.setNumReduceTasks(int). Reducer implementations can access the JobConf for the job via the JobConfigurable.configure(JobConf) method and initialize themselves. Similarly they can use the Closeable.close() method for de-initialization.
Sample code for reducer class and reduce method:
public static class WordCountReduce extends Reducer <Text,IntWritable,Text,IntWritable>
{
public void reduce(Text key,Iterable values,Context context)
throws IOException, InterruptedException
{
// ******some code*********
context.write(key, new IntWritable(sum));
}
}
Reducer has 3 primary phases:Shuffle, Sort and Reduce.
Shuffle: Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.
Sort:The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.
Reduce: In this phase the reduce(Object, Iterator, OutputCollector, Reporter) method is called for each ltkey, (list of values)> pair in the grouped inputs. The output of the reduce task is typically written to the FileSystem via OutputCollector.collect(Object, Object). The output of the Reducer is not re-sorted.
Did I miss anything? Or you have any doubt.. !!! feel free to ask !!
ReplyDeleteCan we start processing Reduce workers before finishing all mapper workers execution
ReplyDeleteNo we can't. But if you have seen reducer progressing simultaneously with mapper ( something like this : map 80% reduce 10% ) then it might be because input data is very large. Due to the fact that mapper begins emitting it's output and reducer starts receiving the output of mapper, although reducer can't operate on that data unless complete input data is processed by the mapper and received by the reducer.
DeleteThese are some really nice articles written to explain concepts of hadoop, perfect for beginners. Thanks again Aman
ReplyDeleteThe expansion of internet and intelligence in business process lead the way to huge volume of data. It is important to maintain and process these data to be efficient in data handling.
ReplyDelete
ReplyDeleteThe way you have explained the Big Data concept is really superb! I have never come across such a informative blog in my career. Nowadays, Hadoop course is in high demand by most of the professional to enhance their career.
Regards:
Big Data Training |
Big Data Course in Chennai
As technology grows, we have to be updated in our domain to sustain in the IT industry. Your post made me impressed in the way of your writing.
ReplyDeleteI was searching for the right blog to get Hadoop updates to know what is happening in the Big Data industry. I found your blog where I can get a lot of new updates in storing and retrieving the data concept. Thank you admin I would like to share this blog with my friends also. Keep updating, waiting for your next article.
Regards
Hadoop Training Chennai | Hadoop Training in Chennai
realy great..
ReplyDeletePython Internship
Dotnet Internship
Java Internship
Web Design Internship
Php Internship
Android Internship
Big Data Internship
Cloud Internship
Hacking Internship
Robotics Internship
This comment has been removed by the author.
ReplyDeleteGREAT...
ReplyDeleteOracle Internship
R Programming Internship
CCNA Internship
Networking Internship
Artificial Intelligence Internship
Machine Learning Internship
Blockchain Internship
Sql Server Internship
Iot Internship
Data Science Internship
NICE...
ReplyDeleteSelenium Testing Internship
Linux Internship
C Internship
CPP Internship
Embedded System Internship
Matlab Internship
great.....
ReplyDeleteFREE Internship in Nagpur For Computer Engineering Students
Internship For MCA Students
Final Year Projects For Information Technology
Web Design Class
Mechanical Engineering Internship Certificate
Inplant Training For Mechanical Engineering Students
Inplant Training Certificate
Ethical Hacking Course in Chennai
Winter Internship For ECE Students
Internships For ECE Students in Bangalore
good post....
ReplyDeleteHow To Hack On Crosh
Request Letter For Air Ticket Booking To HR
Zeus Learning Aptitude Paper For Software Developer
Cimpress Interview Questions
VCB Rating
Appreciation Letter To Vendor
JS MAX Safe Integer
Why Do You Consider Yourself Suitable For The Position
How To Hack Android Phone From PC
About Bangalore Traffic Essay
GOOD POST
ReplyDeletehacking course
internship for it students
ccna course chennai
civil engineering internship report pdf india
kashi infotech
internships in hyderabad for cse students 2018
cse internships in hyderabad
inplant training for diploma students
internship in hyderabad for cse students
good post..
ReplyDeletecoronavirus update
inplant training in chennai
inplant training
inplant training in chennai for cse
inplant training in chennai for ece
inplant training in chennai for eee
inplant training in chennai for mechanical
internship in chennai
online internships
For Worldwide Remote IT Support
ReplyDeleteContact Quadsel Systems Pvt Ltd
+91 9841016631 / +919841283352
girish@quadsel.in / kalyan@quadsel.in
https://www.quadsel.in/networking/>
https://twitter.com/quadsel/
https://www.linkedin.com/company/quadsel-systems-private-limited/
https://www.facebook.com/quadselsystems/
#quadsel #network #security #technologies #managedservices #Infrastructure #Networking #OnsiteResources #ServiceDeskSupport #StorageServices #WarrantyAMCServices #datacentersolutions #DataCenterBuild #EWaste #InfraConsolidation #DisasterRecovery #NetworkingServices #ImagingServices #MPS #Consulting #WANOptimisation #enduserservices
Transformation of marketing strategy
ReplyDeletewww.bluebase.in
https://www.facebook.com/bluebasesoftware/
https://www.linkedin.com/…/bluebase-software-services-pvt-…/
https://twitter.com/BluebaseL/
#applications #EnterpriseSolutions #CloudApplication #HostingServices #MobileAppDevelopment #Testing #QA #UIdesign #DigitalMarketing #SocialMediaOptimisation #SMO #SocialMediaMarketing #SMM #SearchEngineOptimisation #SEO #SearchEngineMarketing #SEM #WebsiteDevelopment #WebsiteDesigning #WebsiteRevamping
I appreciate your efforts because it conveys the message of what you are trying to say. It's a great skill to make even the person who doesn't know about the subject could able to understand the subject . Your blogs are understandable and also elaborately described. I hope to read more and more interesting articles from your blog. keep it up guys
ReplyDeleteAi & Artificial Intelligence Course in Chennai
PHP Training in Chennai
Ethical Hacking Course in Chennai Blue Prism Training in Chennai
UiPath Training in Chennai
Thanks for sharing this post. I really appreciate the one who has written the article. Keep posting.
ReplyDeleteOracle Training | Online Course | Certification in chennai | Oracle Training | Online Course | Certification in bangalore | Oracle Training | Online Course | Certification in hyderabad | Oracle Training | Online Course | Certification in pune | Oracle Training | Online Course | Certification in coimbatore
Thanks for sharing...Thank you for your post. This is excellent information. oracle training in chennai
ReplyDeleteMindblowing blog very useful thanks
ReplyDeleteAWS Training in Velachery
AWS Training in Chennai
Great post. keep sharing such a worthy information.
ReplyDeletePython Coaching Centre In Chennai
kralbet
ReplyDeletebetpark
tipobet
slot siteleri
kibris bahis siteleri
poker siteleri
bonus veren siteler
mobil ödeme bahis
betmatik
PEİHB