Friday 21 February 2014

All About Hadoop: Issue #2

In the Issue #1 of this "All About Hadoop" series, we discussed some basic facts and components of Hadoop. This post will cover application development. We will also learn how to get the data into Hadoop. So let's get started.

Application Development


To abstract some of the complexity of Hadoop programming model, several application development languages have emerged that run on top of Hadoop. Here, we will cover the most popular ones: Pig, Hive and Jaql.

Pig And PigLatin


Little history first: Pig was developed by Yahoo. Just like real pigs, who can eat almost everything, Pig programming language is designed to deal with any kind of data. Pig is basically made of two components: first is language (PigLatin) and second is run-time environment where PigLatin programs are executed. We can correlate this to relationship between JVM and Java application. The first step in Pig program: LOAD the data we want to manipulate from HDFS. Then we run the data through set of transformations. Finally we dump the data into the screen or we can STORE the result in some file.

LOAD: Just like any other Hadoop feature, the objects or data on which Hadoop works are stored in HDFS. To make Pig program access this data, the program first must tell Pig what file/files it will use. It is achieved through the LOAD 'data_file' command, data_file can be any HDFS directory or file. If data_file is directory, then all the files in the directory are loaded into the program. If the file format used as input file is not acceptable to [ig, then we can add the USING function to LOAD statement to specify a user defind function that can read and interpret the data.

TRANSFORM: It is responsible for all the data manipulations. In this command we can FILTER out rows that are of no use, JOIN two sets of data files, GROUP data to build aggregations, ORDER results and many more things.

DUMP and STORE: If we don't specify DUMP or STORE command, the result of a pig program are not generated. We would typically use the DUMP command, which sends the output to screen, when we are debugging pig programs. If we want to store the result in a file , then instead of using DUMP call, use STORE call.

After creating a pig program we need to run it on Hadoop environment. This is where the Pig run times comes in. There are three ways to run Pig program: embedded in a script, embedded in a java program, or from pig command line called as Grunt. No matter whichever way we choose to tun the Pig program, the pig run-time environment translates the program into a set of map and reduce tasks and run them under the covers for us. It simplifies the work associated with the analysis of massive volume of data and allows the developer to focus on the analysis of data rather on individual map and reduce tasks.

HIVE


Some history and few facts first, although Pig can be very powerful and simple language to use, the disadvantage is that it is new. Hive allows developers to develop Hive Query Language (HQL) statements that are similar to SQL statements. HQL statements are broken down into by Hive service into MapReduce jobs and are executed across the cluster. We can run our Hive queries in many ways:
  • Command line interface known as Hive shell
  • Java Database connectivity or Open Database Connectivity Application
  • Hive Thrift Client
Hive Thrift Client is much like any database client that gets installed on user's client machine: It communicates with the Hive services running on the server. We can use Hive Thrift Client within application written in C++, java, PHP, Python Ruby. Hive looks pretty much like traditional database code with SQL access. But, as Hive is based on MapReduce there are various differences Hadoop is intended for long sequential scan and because Hive is based on Hadoop, we can expect queries to have very high latency . This means Hve is not appropriate for applications that need very fast response times, as we could expect with database such as DB2. Moreover Hive is read based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.

Getting Data Into Hadoop


One of the challenges with Hadoop HDFS is that we can't use typical file system things like copying, creating, moving, deleting, or accessing a file and more. To do anything with a file in HDFS, we must use the HDFS interfaces or API directly. In this section, we will discuss the basic of getting our data into HDFS, and cover Flume , which is a distributed data collection service for flowing data into Hadoop cluster.

Basic Copy Data The most common way to move files from a local file system into HDFS is trough the copyFromLocal command. To get files out of HDFS to local file system use the copyToLocal command. An exmple of above two commands is shown below:
hdfs dfs -copyFromLocal /user/dir/file hdfs://s1.com/dir/hdfsfile
hdfs dfs -copyToLocal  hdfs://s1.com/dir/hdfsfile /user/dir/file
These commands are run through the HDFS shell program, which is simply a Java application. The shell uses the Java APIs for getting data in and out of HDFS. The APIs can be called from any Java application.

The problem with is approach is that we must have Java application developers write the logic and programs to read and write data from HDFS. If we need to access HDFS files from our Java applications, we would use the methods in org.apache.hadoop.fs package. This allows us to incorporate read and write operations directly , to and from HDFS, from within our MapReduce application.

Flume: A flume is a channel that directs water from a source to some other location where water is needed. Flume was created to allow you to flow the data from a source to Hadoop environment. In flume the entities we work with are called sources, decorators and sinks.

A source can be any data source. Flumes has many predefined source adapters. A sink is the target of specific operations. A decorator is an operation on the stream that can transform the steam in some manner, which could be to compress or decompress data, modify data by adding or removing pieces of information and more.

A number of predefined source adapters are built into Flume. For example, some adapters allow the flow of everything coming of TCP port to enter the flow, or anything coming to standard input (stdin). A number of text file source adapters give granular control to grab a specific file and feed it into a data flow or even take a tail of file and continuously feed the flow with whatever new data is written to a file.

There are three kind of sinks in flume. One basically is final flow destination and is known as Collector Tier Event sink. This is where we would land a flow into an HDFS formatted file system. Another sink type used in Flume is called an Agent Tier Event. This sink is used when we want the sink to be the input source for another operation. When we use these links, Flume will ensure the integrity of the flow by sending back acknowledgements that data has actually arrived at sink . The final sink type is known as a Basic sink, which can be a text file, the console display, a simple HDFS path or a null bucket, where the data is simply deleted.