Tuesday, July 2, 2013

Big Data space flooded with Hadoop offerings

The Big Data space is flooded with various offerings around Hadoop (the open source framework). Developers from Yahoo which were involved heavily in early development of Hadoop framework, started a spinoff from Yahoo called HortonWorks and started promoting the framework by providing data and operational services to users and vendors. Competition then became intense with the entry of Cloudera, IBM, MapR, EMC and Amazon into the space.

Here is a look at various distributions 


Proprietory
Cloudera Distribution for Hadoop (CDH)
Amazon’s Elastic Map Reduce on EC2
MapR M3, M5 and M7
EMC’s Greenplum HD

Open source
Hortonworks
Pentaho

Latest entrants
Pivotal HD from EMC
Intel Distribution for Hadoop
Microsoft’s HDInsight distribution of Hadoop for Windows

Differentiating points of each distribution


Infosphere BigInsights
Deepest Hadoop platform and application portfolio
Powerful and super fast unstructured analytics engine
Built in browser based spreadsheet tool called Big Sheets
Adaptive real timer analytics enabled trough integration with Streams
App store with number of re-usable jobs and examples
GPFS file placement optimizer
Accelerators – Social Data and Machine Data Analytics
Software bundles ( IBM InfoSphere Streams, IBM InfoSphere Data Explorer, and Cognos Business Intelligence)
CDH
Hadoop pure play with the greatest adoption due to early entry
Has of late introduced Cloudera Development Kit (CDK) with collection of libraries, tools and examples
Has of late introduced Impala, the open source interactive SQL query engine for analyzing data stored in Hadoop cluster
Oracle has adopted it as the distribution of choice in its Big Data Appliance
Hortonworks Data Platform
Yahoo spinoff trying to promote Hadoop by providing data and operational services to users and vendors
Scalable to meet custom demands
100% complete open source and free without proprietary license
Widest range of deployment options – linux, windows, and cloud
MapR
Strong OEM business for its Hadoop Distribution
Provides NoSQL solution besides Hadoop in its latest release M7
Amazon’s Elastic Map Reduce is powered by MapR
Greenplum’s HD enterprise edition used MapR distribution for Hadoop so far. Picture may change after EMC’s announcement of its own distribution – Pivotal HD
Amazon’s EMP on EC2
Most prominent Hadoop cloud service provider
Costing based on usage and therefore can be minimal
Easy to set up, with enormous amount of documentation
Intel’s Hadoop distribution
New in Market – April 2013
Allows analytics on encrypted data
Tweaked Hadoop to take advantage of its hardware - Xeon components optimized for High performance I/O and storage using solid state drives and 10Gb Ethernet.
EMC’s Pivotal HD
New in Market – April 2013
It is a radical approach of changing the underlying file system of RDBMS (Greenplum) to HDFS which means
ü      Hadoop operations can be performed using native SQL queries on Greenplum MPP database whose file system is modified from NFS to HDFS
ü      Addresses barrier to Big Data by providing opportunity to enterprises to extend their existing db environment into BigData environment
ü      Scalability of Greenplum MPP can limit data capacity.
Microsoft’s HDInsight
Brings Hadoop to Windows server platform
Choice of deployment option over Windows Azure cloud, or VM, or Server

Availability of spreadsheet tool (Data Explorer with Excel 2013) for data discovery, transformation and analysis



Apache Hadoop subprojects used in most distributions
Functions Hadoop subprojects
Modeling & Development MapReduce, Pig, Mahout
Storage & Data Management HDFS, HBase
Data Warehousing & Querying Hive, Sqoop
Data Collection, aggregation and analysis Flume
Cluster Mgmt, Job scheduling, workflow Zookeeper, Oozie, Ambarie,

No comments:

Post a Comment