Monday 20 January 2014

Counters In MapReduce:"Channel For Statistics Gathering "

In MapReduce counters provide a useful way for gathering statistics about the job and problem diagnosis. Statistics gathering it may be for quality control or for application control. Hadoop has some but-in counters for every job which reports various metrics.

Advantages of using counters:
  • Counter should be used to record whether a particular condition occurred instead of using log message in map or reduce task.
  • Counter values are much easier to retrieve than log output for large distributed jobs.

Disdvantages of using counters:
  • Counters may go down if a task fails during a job run.

Built-in Counters


As mentioned above Hadoop maintains some built in counters for every job. Counters are divided into various groups. Each group either contains task counters which are updated as task progress or job counters which are updated as a job progresses.

Task Counters:It gathers information about the task dividing their entire execution and the results are aggregated over all the tasks in a job.For example MAP_INPUT_RECORDS counter counts the total number of input records for the whole job. It counts the input records read by each map task and aggregates over all map tasks in a job. Task counters are maintained by each task attempt and periodically sent to the tasktracker and then to the jobtracker so they can be globally aggregated (For more info, check YARN:MapRedeuce 2 post's "Progress And Status Update" section).To guard against errors due to lost messages, task counters are sent in full rather than sending the counts after last transmission.

Although counter values give the final value only after the job has finished execution successfully, some counters provide information while job is under execution. This inforamtion is useful to monitor job with web UI. For example, PHYSICAL_MEMORY_BYTES, VIRTUAL_MEMORY_BYTES and COMMITTED_HEAP_BYTES provide an indication of how memory usage varies over the course of a particulaar task attempt.

Job Counters:Job counters are maintained by the jobtracker (or application master in YARN).This is due to the fact that unlike all other counters(including user_defined) they don't need to be sent across the network.They measure job-level statistics , not values that change while a task is running.For example , TOTAL_LUUNCHED_MAPS counts the number of map tasks thet were launcehed over the course of a job including tasks that failed.

User-Defined Java Counters


MapReduce allows user to define a set of counters, which are incremented as required in mapper or reducer. Counters are defined by a Java enum which serves for group related counters. A job may define an arbitrary number of enums, each with an arbitrary number of fields. The name of the enum is the group name, and the enum’s fields are the counter names. Counters are global: the MapReduce framework aggregates them across all maps and reduces to produce a grand total at the end of the job.

Dynamic counters: The code makes use of a dynamic counter—one that isn’t defined by a Java enum. Because a Java enum’s fields are defined at compile time, you can’t create new counters on the fly using enums. Here we want to count the distribution of temperature quality codes, and though the format specification defines the values that the temperature quality code can take, it is more convenient to use a dynamic counter to emit the values that it actually takes.

The method we use on the Reporter object takes a group and counter name using String names: public void incrCounter(String group, String counter, long amount) The two ways of creating and accessing counters—using enums and using strings— are actually equivalent because Hadoop turns enums into strings to send counters over RPC. Enums are slightly easier to work with, provide type safety, and are suitable for most jobs. For the odd occasion