Sunday, September 16, 2012

Big Data analytics - Need of the hour .. oops minute

Let me talk about computing experience that some of the age of our Professors faced. They would have to use punch cards to enter program and data. It was a piece of stiff paper with digital information, captured through presence/absence of holes at specific locations on the card.

The world of computing has changed since then. The volume with which digital information is captured in every nook and corner of the world opens up exciting new opportunities for data scientists, and machine learning communities.

Companies now talk about terbytes and petabytes of data which is present in diverse form in their databases. Obviously such enormous volume of data cannot reside on one server since there are limitations to vertical scaling. Hardware cost and low redundancy just to name a couple. May be there is no machine available which could cater to your data requirements right from the word go.

Even though vertical scaling is the simplest of scaling techniques, it to can hit a 'wall'. Limitations on vertical scaling can be due to the operating system itself or an operational constraint like security, management or a provider's architecture. For example the table below shows the physical memory limits of various Operating systems.

Operating SystemOS typePhysical memory limitsCPU limit
Windows Server 2008 Standard32-bits4GB
Windows Server 2008 Standard64-bits32GB
Linux32-bits1GB~4GB
Linux64-bits4GB~32GB

Google's of the world faced similar problem when they had to deal with massive volume of data. They then came up with excellent solution -

- distribute the data onto multiple commodity servers and
- build a mechanism where data on each commodity server can be independently processed.

While the former approach will resolve storage limitations of vertical scaling, the second approach will resolve the processing limits of a vertical scaling

The 'classical 'Map reduce' framework (at least the name) was born based out of this approach and after six years Google was awarded patent on this concept. Here is the link to the patent.

The framework opened up opportunities big time by turning upside down the traditional concept of storage and processing. The framework now allows for enormous data handling (read storage and processing) capability with the use of cheap commodity servers. Data analysis at such enormous volume has never happened before and therefore the term 'Big Data' was coined.

Much of the analytics is now real time, because time is money. How quickly can a company identify a lead, or how effectively a company can identify outlier or a pattern from a constantly running stream of data is now possible thanks to the Big Data technology.

Big Data analytics, therefore is not the need of the hour, but of the minute or second for various companies who cannot afford to lose in the race of data analysis when their competitors already are making handsome profits from the same.