Monday, September 16, 2013

Twitter Data Analytics eBook from Springer.


Pre print edition of the book "Twitter Data Analytics" slated to be released in Dec 2013. Download your copy from this link before it expires:

Link to download ebook

Download the code examples here

Publisher is Springer so must be good. Had a quick look, talks about - various APIs to access user specific data, and streaming data with code snippets in Java. Please pass it on to interested people.

Tuesday, July 9, 2013

Text Mining made simple with R

Started following the R programming language with more interest and bumped upon a great link. The simplicity of language is extended further by the tm package for text mining which has neat (short and sweet) commands to break down the complex problem of text mining, document classification etc and make it look very simple.

Things like stop word removal, stemming, and creation of term-document frequency matrix can be accomplished in no more than two commands.

Have a look at it:

Text Mining package in R

Sunday, July 7, 2013

Results from the 2012 annual KDnuggets software poll related to data mining


Survery results for- What analytics/data mining software you used in the past 12 months for a real project (not just evaluation)



Survey results for- What low level programming language you used for analytics/data mining in the past 12 months in:



It is heartening to see that the big guns providing state of the art industry tools (yes - you guessed it right IBM and SAS) for data mining are not finding the top spot in this poll. The most startling thing is the #1 spot grabbed by R in both the polls. It is a language of choice by Statisticians and Data Miners especially because of the simplicity it offers through its various generic functions (a thing that Java copied somewhere around 2005, in its version 5) , and excellent support for graphs, stats and mining features such as classification, clustering and time series analysis.

Tuesday, July 2, 2013

Big Data space flooded with Hadoop offerings

The Big Data space is flooded with various offerings around Hadoop (the open source framework). Developers from Yahoo which were involved heavily in early development of Hadoop framework, started a spinoff from Yahoo called HortonWorks and started promoting the framework by providing data and operational services to users and vendors. Competition then became intense with the entry of Cloudera, IBM, MapR, EMC and Amazon into the space.

Here is a look at various distributions 


Proprietory
Cloudera Distribution for Hadoop (CDH)
Amazon’s Elastic Map Reduce on EC2
MapR M3, M5 and M7
EMC’s Greenplum HD

Open source
Hortonworks
Pentaho

Latest entrants
Pivotal HD from EMC
Intel Distribution for Hadoop
Microsoft’s HDInsight distribution of Hadoop for Windows

Differentiating points of each distribution


Infosphere BigInsights
Deepest Hadoop platform and application portfolio
Powerful and super fast unstructured analytics engine
Built in browser based spreadsheet tool called Big Sheets
Adaptive real timer analytics enabled trough integration with Streams
App store with number of re-usable jobs and examples
GPFS file placement optimizer
Accelerators – Social Data and Machine Data Analytics
Software bundles ( IBM InfoSphere Streams, IBM InfoSphere Data Explorer, and Cognos Business Intelligence)
CDH
Hadoop pure play with the greatest adoption due to early entry
Has of late introduced Cloudera Development Kit (CDK) with collection of libraries, tools and examples
Has of late introduced Impala, the open source interactive SQL query engine for analyzing data stored in Hadoop cluster
Oracle has adopted it as the distribution of choice in its Big Data Appliance
Hortonworks Data Platform
Yahoo spinoff trying to promote Hadoop by providing data and operational services to users and vendors
Scalable to meet custom demands
100% complete open source and free without proprietary license
Widest range of deployment options – linux, windows, and cloud
MapR
Strong OEM business for its Hadoop Distribution
Provides NoSQL solution besides Hadoop in its latest release M7
Amazon’s Elastic Map Reduce is powered by MapR
Greenplum’s HD enterprise edition used MapR distribution for Hadoop so far. Picture may change after EMC’s announcement of its own distribution – Pivotal HD
Amazon’s EMP on EC2
Most prominent Hadoop cloud service provider
Costing based on usage and therefore can be minimal
Easy to set up, with enormous amount of documentation
Intel’s Hadoop distribution
New in Market – April 2013
Allows analytics on encrypted data
Tweaked Hadoop to take advantage of its hardware - Xeon components optimized for High performance I/O and storage using solid state drives and 10Gb Ethernet.
EMC’s Pivotal HD
New in Market – April 2013
It is a radical approach of changing the underlying file system of RDBMS (Greenplum) to HDFS which means
ü      Hadoop operations can be performed using native SQL queries on Greenplum MPP database whose file system is modified from NFS to HDFS
ü      Addresses barrier to Big Data by providing opportunity to enterprises to extend their existing db environment into BigData environment
ü      Scalability of Greenplum MPP can limit data capacity.
Microsoft’s HDInsight
Brings Hadoop to Windows server platform
Choice of deployment option over Windows Azure cloud, or VM, or Server

Availability of spreadsheet tool (Data Explorer with Excel 2013) for data discovery, transformation and analysis



Apache Hadoop subprojects used in most distributions
Functions Hadoop subprojects
Modeling & Development MapReduce, Pig, Mahout
Storage & Data Management HDFS, HBase
Data Warehousing & Querying Hive, Sqoop
Data Collection, aggregation and analysis Flume
Cluster Mgmt, Job scheduling, workflow Zookeeper, Oozie, Ambarie,

Sunday, September 16, 2012

Big Data analytics - Need of the hour .. oops minute

Let me talk about computing experience that some of the age of our Professors faced. They would have to use punch cards to enter program and data. It was a piece of stiff paper with digital information, captured through presence/absence of holes at specific locations on the card.

The world of computing has changed since then. The volume with which digital information is captured in every nook and corner of the world opens up exciting new opportunities for data scientists, and machine learning communities.

Companies now talk about terbytes and petabytes of data which is present in diverse form in their databases. Obviously such enormous volume of data cannot reside on one server since there are limitations to vertical scaling. Hardware cost and low redundancy just to name a couple. May be there is no machine available which could cater to your data requirements right from the word go.

Even though vertical scaling is the simplest of scaling techniques, it to can hit a 'wall'. Limitations on vertical scaling can be due to the operating system itself or an operational constraint like security, management or a provider's architecture. For example the table below shows the physical memory limits of various Operating systems.

Operating SystemOS typePhysical memory limitsCPU limit
Windows Server 2008 Standard32-bits4GB
Windows Server 2008 Standard64-bits32GB
Linux32-bits1GB~4GB
Linux64-bits4GB~32GB

Google's of the world faced similar problem when they had to deal with massive volume of data. They then came up with excellent solution -

- distribute the data onto multiple commodity servers and
- build a mechanism where data on each commodity server can be independently processed.

While the former approach will resolve storage limitations of vertical scaling, the second approach will resolve the processing limits of a vertical scaling

The 'classical 'Map reduce' framework (at least the name) was born based out of this approach and after six years Google was awarded patent on this concept. Here is the link to the patent.

The framework opened up opportunities big time by turning upside down the traditional concept of storage and processing. The framework now allows for enormous data handling (read storage and processing) capability with the use of cheap commodity servers. Data analysis at such enormous volume has never happened before and therefore the term 'Big Data' was coined.

Much of the analytics is now real time, because time is money. How quickly can a company identify a lead, or how effectively a company can identify outlier or a pattern from a constantly running stream of data is now possible thanks to the Big Data technology.

Big Data analytics, therefore is not the need of the hour, but of the minute or second for various companies who cannot afford to lose in the race of data analysis when their competitors already are making handsome profits from the same.



Wednesday, May 23, 2012

Star vs snowflake schema


Main difference

The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are normalized with each dimension represented by a single table.

Snowflake

Snowflake schema is tailored for OLTP and therefore optimized for INSERT, UPDATE and DELETE. An example of snowflake schema:



Star

Star schemas are designed to optimize user ease-of-use and retrieval performance by minimizing the number of tables to join to materialize a transaction. OLAP cubes are built on top of star schemas.A sample start schema: