Friday 10 January 2014

Compression and Decompression in MapReduce


In order to reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk file compression plays a very important role. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop. There are many different compression formats, tools and algorithms, each with different characteristics in Hadoop.

Compression format CompressionCodec
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
LZ4 org.apache.hadoop.io.compress.Lz4Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec

All compression algorithms exhibit a space/time trade-off i.e faster compression and decompression speeds usually come at the expense of smaller space savings. The different tools have very different compression characteristics.
  • Gzip is a generalpurpose compressor and sits in the middle of the space/time trade-off.
  • Bzip2 compresses more effectively than gzip, but is slower. Bzip2’s decompression speed is faster than its compression speed, but it is still slower than the other formats.
  • LZ4. and Snappy, on the other hand, all optimize for speed and are around an order of magnitude faster than gzip, but compress less effectively. Snappy and LZ4 are also significantly faster than LZO for decompression.

The tools listed above typically give some control over this trade-off at compression time by offering nine different options. –1 means optimize for speed, and -9 means optimize for space. For example, the following command creates a compressed file file.gz using the fastest compression method: gzip -1 file

Simplest program for compression.

In the above code there is no mapper and reducer.Notice the two lines which are doing the job of compression.Instead of defalte type any other compression format can be taken.

Simplest program for decompression.

In the decompression code we are not using any special expression which is doing the job of compression.In fact we are not doing anything at all. But if you take a compressed file as in input file for above code an run it, it will decompress the file. Now the format in which the decompressed file is produced is just FILE.

13 comments:

  1. Aman please have a look also at https://github.com/carlomedas/4mc : splittable LZ4 power unleashed in hadoop at any stage of M/R.

    ReplyDelete
  2. Good article. I have a questions. Assume we have data in compressed form on hdfc and we are using some splittable codec (bzip2), when exactly the decompression takes place? is it during getSplit() at the client side? are the inputsplits compressed and recordReader decompress them?

    ReplyDelete
  3. Herpes Virus whether it is oral or genital. To control its symptoms, you usually do many things but it doesn’t give you the expected results. And sometimes some medicines can even give you side effects which can make your situation more critical. Personally I always prefer natural cure for herpes Or any Other Infection because they won’t give you side effects. You can cure your infection/Diseases smoothly and with less trouble with natural remedies. I Strongly Recommend Herbal doctor Razor's Traditional Medicine , Get in touch with him on his Facebook Page https://web.facebook.com/HerbalistrazorMedicinalcure He is blessed with the wisdom to get rid of this virus and other Diseases. I had suffered from this Virus since I was a child, I'd learnt to live with it but still wanted to get cured of it and DOC RAZOR simply helped me with that . All thanks To Doctor Razor Who Rescued Me. Contact him on email : drrazorherbalhome@gmail.com, . Reach Him directly on https://wa.me/message/USI4SETUUEW4H1

    ReplyDelete