Tuesday, 11 February 2014

Big Data : The Next Big Thing

Big Data implies to data which can't be processed or analysed using traditional tools and processes. Obviously, organizations today are dealing with more and more Big Data and challenges that comes with it. This enormous volume of data is sitting in semi structured or unstructured format. Organizations are even wondering whether it's worth keeping this data. These challenges in a climate where they have the ability to store anything and they are generating data like never before in history, makes the problems more complicated.

Let's talk about the characteristics of Big Data and how it fits into the current information management landscape. Take the example of railway cars, which has hundreds of sensors. These sensors track things like conditions experienced by cars, the state of individual parts and GPS based data for shipment tracking and logistics. Processors have been added to them to interpret sensor data on parts prone to wear, such as bearings and to identify parts that need repair before they fail and cause more damage. Rail tracks also have been installed with sensors, every feet, to find out any damage to the track , to avoid any accident. Now add this to tracking a rail car's load, arrival and departure times and you can get an idea of BIG data problem. All these data stored everyday and kept stored for further analysis. Rail cars are just one example , but everywhere we look , we see domains with velocity, volume and variety combining to create the Big Data problem.

Characteristics of Big Data


Three characteristics define Big Data: volume, variety and velocity. These three characteristics define what we refer to as "Big Data".

Volume
As implied by the term "Big Data", companies are facing enormous amount of data. Companies that don't know how to manage this data are overwhelmed by it. They are missing an opportunity, which can be grabbed with right technology platform, to analyse almost all the data or at least that data that is useful, to gain better understanding of their business and their customers. As the amount of data available with the organization is increasing, the percent of data organization can process is on decline, creating a blind zone.

Variety
The sheer amount of volume associated with the Big Data brings new challenge to deal with it:variety. The data which we face today is not only traditional data, but also raw, unstructured and semi structured data generated by web (social media, search indexes, web log files, sensor data from active passive system etc ). Traditional analytic platforms obviously struggle to analyse this raw and unstructured data, and get understanding from it, which can be used further. In simple words, variety represents all kind of data, a fundamental shift from traditional structured data. As Traditional analytic platforms can't deal with unstructured data, organizations are struggling, as it's success depends on it's ability to draw insights from the various kind of data. If we look at the data, 20 percent of it is relational i.e structured data, which fits traditional schema structure, on which we spent most of our time. The other 80 percent of the world's data is unstructured or semi structured at it's best. For instance videos and pictures does not come under relational data and what we see everyday is videos and pictures.

Velocity
Just like volume and variety of the data has changed, likewise the velocity at which it is generated and handled is also changed. A conventional definition of velocity deals with questions like, how quickly the data is arriving and stored? and how quick is the retrieval ?. Although it is just a rough idea what velocity suggest here, but with the massive amount of data, the idea of velocity is far more compelling than this definition. Broadly understand it like this: the speed at which data is flowing. The traditional technology platforms are incapable in dealing with data that huge(Big Data Huge) and coming at fast speed, and sometimes knowing something first is everything. Identifying a trend, a need, a problem in seconds before someone else gives an edge over competitors. Plus, more and more data produced today has short shelf-life. So organizations need to analyse the data quickly and get insights in the data. For instance traffic management wants to know the vehicles heading in the direction of already crowded highways where there is a high possibility of traffic jam or vehicles headed for areas where there is already massive jam. So go get the data at real time, in seconds, will be helpful (by tracking the GPS in vehicles)in achieving that, as in minutes the locations of cars will change. Dealing effectively with Big Data requires to analyse the massive volume of data containing a variety of data while it is still in motion.

Data In Warehousing And Data In Hadoop


Traditional warehouse are only capable of dealing with traditional structured data. Hadoop platform is well structured in dealing with semi structured and unstructured data. Tha data that go into warehouse first goes through a lot of rigors to make it into the warehouse. Of course it's a costly process but it makes sure that the data that lands into the warehouse is of high quality, but it has a broad purpose. On the other hand Hadoop rarely undergo the quality control rigors of data that go into the warehouse. Why? With Hadoop, having massive volume of data and with it's variety, there is no way to cleanse the data and document every piece of data properly, and it's not economical too. Hadoop data is not trusted , Hadoop data might seem to be low in value but it can be in fact be the key to the question unasked.

Conclusion


The term Big Data applies to information that can not be processed or analyzed using traditional processes and tools. Increasingly, organization today are facing more and more Big Data challenges. They have access to a wealth of information, but they don't know how to get value out of it because it is sitting in it's most raw form or in a semi structured or unstructured format, and as a result they don't know whether it is worth keeping.