Analytics De-Mystified

Monday, September 16, 2013

Twitter Data Analytics eBook from Springer.

Pre print edition of the book "Twitter Data Analytics" slated to be released in Dec 2013. Download your copy from this link before it expires:

Link to download ebook

Download the code examples here

Publisher is Springer so must be good. Had a quick look, talks about - various APIs to access user specific data, and streaming data with code snippets in Java. Please pass it on to interested people.

Saturday, August 17, 2013

Rising trend of MOOCs (Massive Open Online Courses)

Tuesday, July 9, 2013

Text Mining made simple with R

Started following the R programming language with more interest and bumped upon a great link. The simplicity of language is extended further by the tm package for text mining which has neat (short and sweet) commands to break down the complex problem of text mining, document classification etc and make it look very simple.

Things like stop word removal, stemming, and creation of term-document frequency matrix can be accomplished in no more than two commands.

Have a look at it:

Text Mining package in R

Sunday, July 7, 2013

Results from the 2012 annual KDnuggets software poll related to data mining

Survery results for- What analytics/data mining software you used in the past 12 months for a real project (not just evaluation)

Survey results for- What low level programming language you used for analytics/data mining in the past 12 months in:

It is heartening to see that the big guns providing state of the art industry tools (yes - you guessed it right IBM and SAS) for data mining are not finding the top spot in this poll. The most startling thing is the #1 spot grabbed by R in both the polls. It is a language of choice by Statisticians and Data Miners especially because of the simplicity it offers through its various generic functions (a thing that Java copied somewhere around 2005, in its version 5) , and excellent support for graphs, stats and mining features such as classification, clustering and time series analysis.

Tuesday, July 2, 2013

Big Data space flooded with Hadoop offerings

The Big Data space is flooded with various offerings around Hadoop (the open source framework). Developers from Yahoo which were involved heavily in early development of Hadoop framework, started a spinoff from Yahoo called HortonWorks and started promoting the framework by providing data and operational services to users and vendors. Competition then became intense with the entry of Cloudera, IBM, MapR, EMC and Amazon into the space.

Here is a look at various distributions

Proprietory

Cloudera Distribution for Hadoop (CDH)

Amazon’s Elastic Map Reduce on EC2

MapR M3, M5 and M7

EMC’s Greenplum HD

■

Open source

Hortonworks

Pentaho

■

Latest entrants

Pivotal HD from EMC

Intel Distribution for Hadoop

Microsoft’s HDInsight distribution of Hadoop for Windows

Differentiating points of each distribution

Infosphere BigInsights	Deepest Hadoop platform and application portfolio Powerful and super fast unstructured analytics engine Built in browser based spreadsheet tool called Big Sheets Adaptive real timer analytics enabled trough integration with Streams App store with number of re-usable jobs and examples GPFS file placement optimizer Accelerators – Social Data and Machine Data Analytics Software bundles ( IBM InfoSphere Streams, IBM InfoSphere Data Explorer, and Cognos Business Intelligence)
CDH	Hadoop pure play with the greatest adoption due to early entry Has of late introduced Cloudera Development Kit (CDK) with collection of libraries, tools and examples Has of late introduced Impala, the open source interactive SQL query engine for analyzing data stored in Hadoop cluster Oracle has adopted it as the distribution of choice in its Big Data Appliance
Hortonworks Data Platform	Yahoo spinoff trying to promote Hadoop by providing data and operational services to users and vendors Scalable to meet custom demands 100% complete open source and free without proprietary license Widest range of deployment options – linux, windows, and cloud
MapR	Strong OEM business for its Hadoop Distribution Provides NoSQL solution besides Hadoop in its latest release M7 Amazon’s Elastic Map Reduce is powered by MapR Greenplum’s HD enterprise edition used MapR distribution for Hadoop so far. Picture may change after EMC’s announcement of its own distribution – Pivotal HD
Amazon’s EMP on EC2	Most prominent Hadoop cloud service provider Costing based on usage and therefore can be minimal Easy to set up, with enormous amount of documentation
Intel’s Hadoop distribution	New in Market – April 2013 Allows analytics on encrypted data Tweaked Hadoop to take advantage of its hardware - Xeon components optimized for High performance I/O and storage using solid state drives and 10Gb Ethernet.
EMC’s Pivotal HD	New in Market – April 2013 It is a radical approach of changing the underlying file system of RDBMS (Greenplum) to HDFS which means ü Hadoop operations can be performed using native SQL queries on Greenplum MPP database whose file system is modified from NFS to HDFS ü Addresses barrier to Big Data by providing opportunity to enterprises to extend their existing db environment into BigData environment ü Scalability of Greenplum MPP can limit data capacity.
Microsoft’s HDInsight	Brings Hadoop to Windows server platform Choice of deployment option over Windows Azure cloud, or VM, or Server Availability of spreadsheet tool (Data Explorer with Excel 2013) for data discovery, transformation and analysis

Apache Hadoop subprojects used in most distributions

Functions	Hadoop subprojects
Modeling & Development	MapReduce, Pig, Mahout
Storage & Data Management	HDFS, HBase
Data Warehousing & Querying	Hive, Sqoop
Data Collection, aggregation and analysis	Flume
Cluster Mgmt, Job scheduling, workflow	Zookeeper, Oozie, Ambarie,

Sunday, September 16, 2012

Big Data analytics - Need of the hour .. oops minute

Let me talk about computing experience that some of the age of our Professors faced. They would have to use punch cards to enter program and data. It was a piece of stiff paper with digital information, captured through presence/absence of holes at specific locations on the card.

The world of computing has changed since then. The volume with which digital information is captured in every nook and corner of the world opens up exciting new opportunities for data scientists, and machine learning communities.

Companies now talk about terbytes and petabytes of data which is present in diverse form in their databases. Obviously such enormous volume of data cannot reside on one server since there are limitations to vertical scaling. Hardware cost and low redundancy just to name a couple. May be there is no machine available which could cater to your data requirements right from the word go.

Even though vertical scaling is the simplest of scaling techniques, it to can hit a 'wall'. Limitations on vertical scaling can be due to the operating system itself or an operational constraint like security, management or a provider's architecture. For example the table below shows the physical memory limits of various Operating systems.

Operating System	OS type	Physical memory limits
Windows Server 2008 Standard	32-bits	4GB
Windows Server 2008 Standard	64-bits	32GB
Linux	32-bits	1GB~4GB
Linux	64-bits	4GB~32GB

Google's of the world faced similar problem when they had to deal with massive volume of data. They then came up with excellent solution -

- distribute the data onto multiple commodity servers and
- build a mechanism where data on each commodity server can be independently processed.

While the former approach will resolve storage limitations of vertical scaling, the second approach will resolve the processing limits of a vertical scaling

The 'classical 'Map reduce' framework (at least the name) was born based out of this approach and after six years Google was awarded patent on this concept. Here is the link to the patent.

The framework opened up opportunities big time by turning upside down the traditional concept of storage and processing. The framework now allows for enormous data handling (read storage and processing) capability with the use of cheap commodity servers. Data analysis at such enormous volume has never happened before and therefore the term 'Big Data' was coined.

Much of the analytics is now real time, because time is money. How quickly can a company identify a lead, or how effectively a company can identify outlier or a pattern from a constantly running stream of data is now possible thanks to the Big Data technology.

Big Data analytics, therefore is not the need of the hour, but of the minute or second for various companies who cannot afford to lose in the race of data analysis when their competitors already are making handsome profits from the same.

Wednesday, May 23, 2012

Star vs snowflake schema

Main difference

The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are normalized with each dimension represented by a single table.

Snowflake

Snowflake schema is tailored for OLTP and therefore optimized for INSERT, UPDATE and DELETE. An example of snowflake schema:

Star

Star schemas are designed to optimize user ease-of-use and retrieval performance by minimizing the number of tables to join to materialize a transaction. OLAP cubes are built on top of star schemas.A sample start schema: