Friday, 10 January 2014

Compression and Decompression in MapReduce


In order to reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk file compression plays a very important role. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop. There are many different compression formats, tools and algorithms, each with different characteristics in Hadoop.

Compression format CompressionCodec
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
LZ4 org.apache.hadoop.io.compress.Lz4Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec

All compression algorithms exhibit a space/time trade-off i.e faster compression and decompression speeds usually come at the expense of smaller space savings. The different tools have very different compression characteristics.
  • Gzip is a generalpurpose compressor and sits in the middle of the space/time trade-off.
  • Bzip2 compresses more effectively than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats.
  • LZ4. and Snappy, on the other hand, all optimize for speed and are around an order of magnitude faster than gzip, but compress less effectively. Snappy and LZ4 are also significantly faster than LZO for decompression.

The tools listed above typically give some control over this trade-off at compression time by offering nine different options. –1 means optimize for speed, and -9 means optimize for space. For example, the following command creates a compressed file file.gz using the fastest compression method: gzip -1 file

Simplest program for compression.

In the above code there is no mapper and reducer.Notice the two lines which are doing the job of compression.Instead of defalte type any other compression format can be taken.

Simplest program for decompression.

In the decompression code we are not using any special expression which is doing the job of compression.In fact we are not doing anything at all. But if you take a compressed file as in input file for above code an run it, it will decompress the file. Now the format in which the decompressed file is produced is just FILE.

2 comments:

  1. Aman please have a look also at https://github.com/carlomedas/4mc : splittable LZ4 power unleashed in hadoop at any stage of M/R.

    ReplyDelete
  2. Good article. I have a questions. Assume we have data in compressed form on hdfc and we are using some splittable codec (bzip2), when exactly the decompression takes place? is it during getSplit() at the client side? are the inputsplits compressed and recordReader decompress them?

    ReplyDelete