Tuesday, July 9, 2013

Text Mining made simple with R

Started following the R programming language with more interest and bumped upon a great link. The simplicity of language is extended further by the tm package for text mining which has neat (short and sweet) commands to break down the complex problem of text mining, document classification etc and make it look very simple.

Things like stop word removal, stemming, and creation of term-document frequency matrix can be accomplished in no more than two commands.

Have a look at it:

Text Mining package in R

Sunday, July 7, 2013

Results from the 2012 annual KDnuggets software poll related to data mining


Survery results for- What analytics/data mining software you used in the past 12 months for a real project (not just evaluation)



Survey results for- What low level programming language you used for analytics/data mining in the past 12 months in:



It is heartening to see that the big guns providing state of the art industry tools (yes - you guessed it right IBM and SAS) for data mining are not finding the top spot in this poll. The most startling thing is the #1 spot grabbed by R in both the polls. It is a language of choice by Statisticians and Data Miners especially because of the simplicity it offers through its various generic functions (a thing that Java copied somewhere around 2005, in its version 5) , and excellent support for graphs, stats and mining features such as classification, clustering and time series analysis.

Tuesday, July 2, 2013

Big Data space flooded with Hadoop offerings

The Big Data space is flooded with various offerings around Hadoop (the open source framework). Developers from Yahoo which were involved heavily in early development of Hadoop framework, started a spinoff from Yahoo called HortonWorks and started promoting the framework by providing data and operational services to users and vendors. Competition then became intense with the entry of Cloudera, IBM, MapR, EMC and Amazon into the space.

Here is a look at various distributions 


Proprietory
Cloudera Distribution for Hadoop (CDH)
Amazon’s Elastic Map Reduce on EC2
MapR M3, M5 and M7
EMC’s Greenplum HD

Open source
Hortonworks
Pentaho

Latest entrants
Pivotal HD from EMC
Intel Distribution for Hadoop
Microsoft’s HDInsight distribution of Hadoop for Windows

Differentiating points of each distribution


Infosphere BigInsights
Deepest Hadoop platform and application portfolio
Powerful and super fast unstructured analytics engine
Built in browser based spreadsheet tool called Big Sheets
Adaptive real timer analytics enabled trough integration with Streams
App store with number of re-usable jobs and examples
GPFS file placement optimizer
Accelerators – Social Data and Machine Data Analytics
Software bundles ( IBM InfoSphere Streams, IBM InfoSphere Data Explorer, and Cognos Business Intelligence)
CDH
Hadoop pure play with the greatest adoption due to early entry
Has of late introduced Cloudera Development Kit (CDK) with collection of libraries, tools and examples
Has of late introduced Impala, the open source interactive SQL query engine for analyzing data stored in Hadoop cluster
Oracle has adopted it as the distribution of choice in its Big Data Appliance
Hortonworks Data Platform
Yahoo spinoff trying to promote Hadoop by providing data and operational services to users and vendors
Scalable to meet custom demands
100% complete open source and free without proprietary license
Widest range of deployment options – linux, windows, and cloud
MapR
Strong OEM business for its Hadoop Distribution
Provides NoSQL solution besides Hadoop in its latest release M7
Amazon’s Elastic Map Reduce is powered by MapR
Greenplum’s HD enterprise edition used MapR distribution for Hadoop so far. Picture may change after EMC’s announcement of its own distribution – Pivotal HD
Amazon’s EMP on EC2
Most prominent Hadoop cloud service provider
Costing based on usage and therefore can be minimal
Easy to set up, with enormous amount of documentation
Intel’s Hadoop distribution
New in Market – April 2013
Allows analytics on encrypted data
Tweaked Hadoop to take advantage of its hardware - Xeon components optimized for High performance I/O and storage using solid state drives and 10Gb Ethernet.
EMC’s Pivotal HD
New in Market – April 2013
It is a radical approach of changing the underlying file system of RDBMS (Greenplum) to HDFS which means
ü      Hadoop operations can be performed using native SQL queries on Greenplum MPP database whose file system is modified from NFS to HDFS
ü      Addresses barrier to Big Data by providing opportunity to enterprises to extend their existing db environment into BigData environment
ü      Scalability of Greenplum MPP can limit data capacity.
Microsoft’s HDInsight
Brings Hadoop to Windows server platform
Choice of deployment option over Windows Azure cloud, or VM, or Server

Availability of spreadsheet tool (Data Explorer with Excel 2013) for data discovery, transformation and analysis



Apache Hadoop subprojects used in most distributions
Functions Hadoop subprojects
Modeling & Development MapReduce, Pig, Mahout
Storage & Data Management HDFS, HBase
Data Warehousing & Querying Hive, Sqoop
Data Collection, aggregation and analysis Flume
Cluster Mgmt, Job scheduling, workflow Zookeeper, Oozie, Ambarie,