Friday, 21 February 2014

All About Hadoop: Issue #2

In the Issue #1 of this "All About Hadoop" series, we discussed some basic facts and components of Hadoop. This post will cover application development. We will also learn how to get the data into Hadoop. So let's get started.

Application Development


To abstract some of the complexity of Hadoop programming model, several application development languages have emerged that run on top of Hadoop. Here, we will cover the most popular ones: Pig, Hive and Jaql.

Pig And PigLatin


Little history first: Pig was developed by Yahoo. Just like real pigs, who can eat almost everything, Pig programming language is designed to deal with any kind of data. Pig is basically made of two components: first is language (PigLatin) and second is run-time environment where PigLatin programs are executed. We can correlate this to relationship between JVM and Java application. The first step in Pig program: LOAD the data we want to manipulate from HDFS. Then we run the data through set of transformations. Finally we dump the data into the screen or we can STORE the result in some file.

LOAD: Just like any other Hadoop feature, the objects or data on which Hadoop works are stored in HDFS. To make Pig program access this data, the program first must tell Pig what file/files it will use. It is achieved through the LOAD 'data_file' command, data_file can be any HDFS directory or file. If data_file is directory, then all the files in the directory are loaded into the program. If the file format used as input file is not acceptable to [ig, then we can add the USING function to LOAD statement to specify a user defind function that can read and interpret the data.

TRANSFORM: It is responsible for all the data manipulations. In this command we can FILTER out rows that are of no use, JOIN two sets of data files, GROUP data to build aggregations, ORDER results and many more things.

DUMP and STORE: If we don't specify DUMP or STORE command, the result of a pig program are not generated. We would typically use the DUMP command, which sends the output to screen, when we are debugging pig programs. If we want to store the result in a file , then instead of using DUMP call, use STORE call.

After creating a pig program we need to run it on Hadoop environment. This is where the Pig run times comes in. There are three ways to run Pig program: embedded in a script, embedded in a java program, or from pig command line called as Grunt. No matter whichever way we choose to tun the Pig program, the pig run-time environment translates the program into a set of map and reduce tasks and run them under the covers for us. It simplifies the work associated with the analysis of massive volume of data and allows the developer to focus on the analysis of data rather on individual map and reduce tasks.

HIVE


Some history and few facts first, although Pig can be very powerful and simple language to use, the disadvantage is that it is new. Hive allows developers to develop Hive Query Language (HQL) statements that are similar to SQL statements. HQL statements are broken down into by Hive service into MapReduce jobs and are executed across the cluster. We can run our Hive queries in many ways:
  • Command line interface known as Hive shell
  • Java Database connectivity or Open Database Connectivity Application
  • Hive Thrift Client
Hive Thrift Client is much like any database client that gets installed on user's client machine: It communicates with the Hive services running on the server. We can use Hive Thrift Client within application written in C++, java, PHP, Python Ruby. Hive looks pretty much like traditional database code with SQL access. But, as Hive is based on MapReduce there are various differences Hadoop is intended for long sequential scan and because Hive is based on Hadoop, we can expect queries to have very high latency . This means Hve is not appropriate for applications that need very fast response times, as we could expect with database such as DB2. Moreover Hive is read based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Getting Data Into Hadoop


One of the challenges with Hadoop HDFS is that we can't use typical file system things like copying, creating, moving, deleting, or accessing a file and more. To do anything with a file in HDFS, we must use the HDFS interfaces or API directly. In this section, we will discuss the basic of getting our data into HDFS, and cover Flume , which is a distributed data collection service for flowing data into Hadoop cluster.

Basic Copy Data The most common way to move files from a local file system into HDFS is trough the copyFromLocal command. To get files out of HDFS to local file system use the copyToLocal command. An exmple of above two commands is shown below:
hdfs dfs -copyFromLocal /user/dir/file hdfs://s1.com/dir/hdfsfile
hdfs dfs -copyToLocal  hdfs://s1.com/dir/hdfsfile /user/dir/file
These commands are run through the HDFS shell program, which is simply a Java application. The shell uses the Java APIs for getting data in and out of HDFS. The APIs can be called from any Java application.

The problem with is approach is that we must have Java application developers write the logic and programs to read and write data from HDFS. If we need to access HDFS files from our Java applications, we would use the methods in org.apache.hadoop.fs package. This allows us to incorporate read and write operations directly , to and from HDFS, from within our MapReduce application.

Flume: A flume is a channel that directs water from a source to some other location where water is needed. Flume was created to allow you to flow the data from a source to Hadoop environment. In flume the entities we work with are called sources, decorators and sinks.

A source can be any data source. Flumes has many predefined source adapters. A sink is the target of specific operations. A decorator is an operation on the stream that can transform the steam in some manner, which could be to compress or decompress data, modify data by adding or removing pieces of information and more.

A number of predefined source adapters are built into Flume. For example, some adapters allow the flow of everything coming of TCP port to enter the flow, or anything coming to standard input (stdin). A number of text file source adapters give granular control to grab a specific file and feed it into a data flow or even take a tail of file and continuously feed the flow with whatever new data is written to a file.

There are three kind of sinks in flume. One basically is final flow destination and is known as Collector Tier Event sink. This is where we would land a flow into an HDFS formatted file system. Another sink type used in Flume is called an Agent Tier Event. This sink is used when we want the sink to be the input source for another operation. When we use these links, Flume will ensure the integrity of the flow by sending back acknowledgements that data has actually arrived at sink . The final sink type is known as a Basic sink, which can be a text file, the console display, a simple HDFS path or a null bucket, where the data is simply deleted.

21 comments:

  1. Hi Aman ,

    I am new bee in BigData/Hadoop world .

    can you please help me on below problem statement to achieve solution -

    I have to work on one caseStudy in which i have to generate clickstream Data using JavaClient / MRJob (which i have generated) and later need to put in HDFS and generate various analytic/Insight using that data like -
    How many user viewed your site
    How many user stay long on your site etc.

    so please help me how can i take step further after putting data in HDFS in terms of -

    Hive, Hive partitioned , generate report on dash board (tableau) etc .


    please can you suggest some steps how can i proceed ?

    ReplyDelete
  2. Really an awesome post. I wondered by reading this blog post. Thanks a lot for posting this unique post which you have shared with us. Keep on posting like this exclusive post with us.

    App Store Optimization Services in Chennai

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Thanks for splitting your comprehension with us. It’s really useful to me & I hope it helps the people who in need of this vital information. 
    apple iphone service center in chennai | apple ipad service center in chennai | apple iphone service center in chennai | iphone service chennai |

    ReplyDelete
  7. Great Post,really it was very helpful for us.
    Thanks a lot for sharing!
    I found this blog to be very useful!!
    Hadoop training in Bangalore

    ReplyDelete
  8. Good Post. I like your blog. Thanks for Sharing
    Hadoop Course in Noida

    ReplyDelete
  9. Good Post. I like your blog. Thanks for Sharing Beard Transplantation Cost

    ReplyDelete
  10. Manual testing is a traditional method of testing. Software testers have to write test cases and then execute them manually. Manual tests are very useful in finding basic functional issues in the software being tested, for example in finding out what can be done on each screen, or even in the application being used by the user. Lasts long enough or not. Click to read this article : Manual Testing and software testing

    ReplyDelete
  11. An excellent written content that really helped me learn so many new things. I came to know about the best –rated https://www.bestcashforcarz.com.au/car-removal/ car removal melbourne

    ReplyDelete
  12. Very good quality. I am surprised how many attempts you set to make such an excellent informative website.

    car buyer adelaide

    ReplyDelete
  13. Thank you for this important blog it really helps me a lot to understand new things keep doing this work also have a look on this Best CBSE School in Faridabad

    ReplyDelete
  14. Such a wonderful blog it contains some good knowledge and topics thank you for this informative post keep doing this work also check out this one International school in Faridabad

    ReplyDelete
  15. Remain good rest play. Top kitchen peace school.latest news headlines

    ReplyDelete
  16. Thanks for such a wonderful content. live score

    ReplyDelete
  17. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic. open plots in hyderabad

    ReplyDelete