Hadoop Soup: 01/13/14

Monday, 13 January 2014

Hadoop Archieves

In HDFS each file is stored in a block and this block metadata is held in memory by the namenode. No matter how small the file is the same method is followed HDFS end up storing small files inefficiently. As a result a large number of small files can consume a lot of memory on the namenode. HDFS provides a file archiving facility,special format archives, HAR files or Hadoop archives that stores files into HDFS blocks more efficiently,hence reduces namenode memory usage and allowing transarent access to files at the same time.

Point To Remember- In HDFS small files do not take up any more disk space than its size. For example, a 2 MB file stored with a block size of 128 MB uses 2 MB of disk space, not 128 MB.

A Hadoop Archive is created from a collection of files using the archive tool. The tool runs a MapReduce job to process the input files in parallel. so in order to run it,we need a running MapReduce cluster to use it. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *.har extension. A Hadoop archive directory contains metadata and data (part-*) files. Metadata is in the form of _index and _masterindex.The _index file contains the name of the files that are part of the archive and the location within the part files.

How to Create an Archive

Usage: hadoop archive -archiveName name -p *

For example
% hadoop fs -lsr /old/files
-rw-r--r-- 1 tom supergp 1 2013-05-09 09:03 /old/files/a
drwxr-xr-x - tom supergp 0 2013-05-09 09:03 /old/files/dir
-rw-r--r-- 1 tom supergp 1 2013-05-09 09:03 /my/files/dir/b

Now run the archive command: % hadoop archive -archiveName files.har /old/files /old

The first option after -archiveName is the name of the archive, here files.har. Second one is the files to put in the archive. Here we are archiving only one source tree, the files in /old/files in HDFS, but the tool accepts multiple source trees. The final argument is the output directory for the HAR file.

% hadoop fs -ls /old
Found 2 items
drwxr-xr-x - tom supergp 0 2013-05-09 09:03 /old/files
drwxr-xr-x - tom supergp 0 2009-04-09 19:13 /old/files.har

% hadoop fs -ls /old/files.har
Found 3 items

-rw-r--r-- 10 tom supergp 165 2013-05-09 09:03 /old/files.har/_index
-rw-r--r-- 10 tom supergp 23 2013-05-09 09:03 /old/files.har/_masterindex
-rw-r--r-- 1 tom supergp 2 2013-05-09 09:03 /old/files.har/part-0

The directory listing shows what a HAR file is made of: two index files and a collection of part files (this example has just one of the latter). The part files contain the contents of a number of the original files concatenated together, and the indexes make it possible to look up the part file that an archived file is contained in, as well as its offset and length. All these details are hidden from the application, however, which uses the har URI scheme to interact with HAR files, using a HAR filesystem that is layered on top of the underlying filesystem (HDFS in this case).