Hadoop Soup: Sequence File

What Is SequenceFile?

SequenceFile is just a flat file consisting of binary key/value pairs. It is highly used in MapReduce as input/output formats. In fact, the temporary output of each map is stored using SequenceFile.

The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting purposes.
There are three different SequenceFile formats:-

Uncompressed key/value records.
Record compressed key/value records - only values are compressed .
Block compressed key/value records - both keys and values are collected in blocks separately and compressed. The size of the block is configurable by user.

The recommended way is to use the SequenceFile.createWriter methods to construct the preferred writer implementation. The SequenceFile.Reader acts as a bridge and can read any of the above SequenceFile formats.

Why Do We Need A Sequence File?

HDFS is a distributed file system, mainly designed for batch processing of large amount of data. Now default block size of HDFS block is 64MB. When the size of a file is much smaller than the default block size, there is a tremendous degradation of performance, because of large number of seeks and lots of hopping from one datanode to another to retrieve a small file, which is inefficient.

When file size is very very small, the input for each process is very little and there are large number of map tasks. For example, a 20GB file broken up into files of size 100KB each, use a map of their own. Thus the time taken to finish the job extensively increases.

For solving these two problems mentioned above, we need a Sequence file. A Sequence file is a data structure for binary key-value pairs. it can be used as a common format to transfer data between MapReduce jobs. Another important advantage of a sequence file is that it can be used as an archive to pack smaller files. This avoids the above mentioned problems with small files.

How To Write And Read A Sequence File?

In order to create a sequence file, use one of its createWriter() static methods which returns a SequenceFile.Writer instance. We can then write key-value pairs using the append() method. After we are done, we can call the close() method. Similarly to read a sequence file, create an instance of SequenceFile.Reader and iterate it over the records by invoking the next() method. There are several versions of next() method and which one we use depends upon the serialization framework used. If a key-value pair is read, the function returns true, else it returns false. In case a value is read, it can be retrieved using the getCurrentValue() method.

How SequenceFile Is Stored Internally?

All of the above formats(in What Is SequenceFile heading) share a common header (which is used by the SequenceFile.Reader to return the appropriate key/value pairs). The summary of header is given below:-
SequenceFile Common Header

version - A byte array: 3 bytes of magic header 'SEQ', followed by 1 byte of actual version no. (example SEQ4,SEQ6)
keyClassName - String
valueClassName - String
compression - A boolean which specifies if compression is turned on for keys/values in this file.
blockCompression - A boolean which specifies if block compression is turned on for keys/values in this file.
compressor class - The classname of the CompressionCodec which is used to compress/decompress keys and/or values in this SequenceFile (only if compression is enabled).
metadata - SequenceFile.Metadata for this file (key/value pairs)
sync - A sync marker to denote end of the header. All strings are serialized using Text.writeString api.

The formats for Uncompressed and RecordCompressed Writers are very similar and are explained below:

Uncompressed and RecordCompressed Writer Format

Header
Record
Record length
- Key length
- Key
- (Compressed?) Value

A sync-marker every few k bytes or so. The sync marker permits seeking to a random point in a file and then re-synchronizing input with record boundaries. This is required to be able to efficiently split large files for MapReduce processing. The format for the BlockCompressedWriter is as follows:

BlockCompressed Writer Format

Header
Record Block
- A sync-marker to help in seeking to a random point in the file and then seeking to next record block.
- CompressedKeyLengthsBlockSize
- CompressedKeyLengthsBlock
- CompressedKeysBlockSize
- CompressedKeysBlock
- CompressedValueLengthsBlockSize
- CompressedValueLengthsBlock
- CompressedValuesBlockSize
- CompressedValuesBlock

The compressed blocks of key lengths and value lengths consist of the actual lengths of individual keys/values encoded in ZeroCompressedInteger format .

A sequence file is composed of a header and one or more records. The first three bytes of a sequence file are the bytes SEQ, which acts like a magic number, followed by a single byte representing the version number. The header contains other fields, including the names of the key and value classes, compression details, user-defined metadata etc. Each file has a randomly generated sync marker, whose value is stored in the header. Sync markers appear between records in the sequence file, not necessarily between every pair of records.

The internal format of the records depends on whether compression is enabled, and if it is, whether it is record compression or block compression. If no compression is enabled (the default), each record is made up of the record length (in bytes), the key length, the key, and then the value. The format for record compression is almost identical to no compression, except the value bytes are compressed using the codec defined in the header. Keys are not compressed.

Block compression compresses multiple records at once, it is therefore more compact than and should generally be preferred over record compression because it has the opportunity to take advantage of similarities between records. Records are added to a block until it reaches a minimum size in bytes, defined by the io.seqfile.compress.blocksize property, the default is 1 million bytes. A sync marker is written before the start of every block. The format of a block is a field indicating the number of records in the block, followed by four compressed fields: the key lengths, the keys, the value lengths, and the values.

Enough of theory, let us do some coding and implement Sequencefile in a program.

We will start with simple WordCount program. Write complete WordCount program as it is and just add one line in main method.



  job.setOutputFormatClass(SequenceFileOutputFormat.class);

The final main method will look like this:




public static void main(String args[]) throws Exception

 {
  Job job = new Job();
  job.setJarByClass(WordCount.class);
  FileInputFormat.addInputPath(job, new Path(args [0]));
  FileOutputFormat.setOutputPath(job, new Path(args [1]));
  job.setMapperClass(WordCountMap.class);
  job.setReducerClass(WordCountReduce.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);

  job.setOutputFormatClass(SequenceFileOutputFormat.class);
  
  System.exit(job.waitForCompletion(true) ? 0 : 1);
 }

Try to get the output as we do normally i.e.



$ hadoop fs -cat traing/aman/nwc8/part-r-00000

Instead of showing the result it will print some unexpected lines. This is due fact that sequence file can't be viewed like this.The format of viewing sequence file is different. Now try this command.



$ hadoop fs -text traing/aman/nwc8/part-r-00000

It will show the exact result of the code.

Tuesday, 14 January 2014

Sequence File

What Is SequenceFile?

Why Do We Need A Sequence File?

How To Write And Read A Sequence File?

How SequenceFile Is Stored Internally?

14 comments: