Thursday, January 26, 2012

Row-Based Vs Columnar Vs NoSQL

There are various Database players in the market. Here is one quick comparsion on Row-Based Vs Vs Columnar Vs NoSQL.
Description: Data structured or stored in Rows.
Common Use Case: Used in transaction processing, interactive transaction applications.
Strength: Robust, proven technology to capture intermediate transactions.
Weakness: Scalability and query processing time for huge data.
Size of DB: Several GB to TB.
Key Players: Sybase, Oracle, My SQL, DB2
Description: Data is vertically partitioned and stored in Columns.
Common Use Case: Historical data analysis, data warehousing and business Intelligence.
Strength: Faster query (specially ad-hoc queries) on large data.
Weakness: not suitable for transaction, import export seep & heavy computing resource utilization.
Size of DB: Several GB to 50 TB.
Key Players: Info Bright, Asterdata, Vertica, Sybase IQ, Paraccel
NoSQL Key Value Stored
Description: Data stored in memory with some persistent backup.
Common Use Case: Used in cache for storing frequently requested data in applications.
Strength: Scalable, faster retrieval of data , supports Unstructured and partial structured data.
Weakness: All data should fit to memory, does not support complex query.
Size of DB: Several GBs to several TBs.
Key Players: Amazon S3, MemCached, Redis, Voldemort
NoSQL- Document Store
Description: Persistent storage of unstructured or semi-structured data along with some SQL Querying functionality.
Common Use Case: Web applications or any application which needs better performance and scalability without defining columns in RDBMS.
Strength: Persistent store with scalability and better query support than key-value store.
Weakness: Lack of sophisticated query capabilities.
Size of DB: Several TBs to PBs.
Key Players: MongoDB, CouchDB, SimpleDb
NoSQL- Column Store
Description: Very large data store and supports Map-Reduce.
Common Use Case: Real time data logging in Finance and web analytics.
Strength: Very high throughput for Big Data, Strong Partitioning Support, random read-write access.
Weakness: Complex query, availability of APIs, response time.
Size of DB: Several TBs to PBs.Key Players: HBase, Big Table, Cassandra

Wednesday, January 18, 2012

Cloud BI The reality (From Warne's blog)

Three Types of Cloud Services with BI Examples

Software-as-a-Service (SaaS). SaaS delivers packaged applications tailored to specific workflows and users. SaaS was first popularized by, which was founded in 1996 to deliver online sales applications to small- and medium-sized businesses. now has 92,000 customers of all sizes and has spawned a multitude of imitators. A big benefit of SaaS is that it obviates the need for customers to maintain and upgrade application code and infrastructure. Many SaaS customers are astonished to see new software features automatically appear in their application without notice or additional expense.
Within the BI market, many startups and established BI players offer SaaS BI services that deliver ready-made reports and dashboards for specific commercial applications, such as Salesforce, NetSuite, Microsoft Dynamics, and others. SaaS BI vendors include Birst, PivotLink, GoodData, Indicee, Rosslyn Analytics, and SAP, among others.
Platform-as-a-Service (PaaS). PaaS enables developers to build applications online. PaaS services provide development environments, such as programming languages and databases, so developers can create and deliver applications without having to purchase and install hardware. In the BI market, the SaaS BI vendors (above) for the most part double as PaaS BI vendors.
In a PaaS environment, a developer must first build a data mart, which is often tedious and highly customized work since it involves integrating data from multiple sources, cleaning and standardizing the data, and finally modeling and transforming the data. Although SaaS BI applications deploy quickly, PaaS BI applications are not. Basically, SaaS BI are packaged applications while PaaS BI are custom applications. In the world of BI, most applications are custom. This is the primary reason why growth of Cloud BI in general is slower than anticipated.
SaaS BI are packaged applications and PaaS BI are custom applications. In the world of BI, most applications are custom.
Infrastructure-as-a-Service (IaaS). IaaS provides online computing resources (servers, storage, and networking) which customers use to augment or replace their existing compute resources. In 2006, Amazon popularized IaaS when it began renting space in its own data center using virtualization services to outside parties. Some BI vendors are beginning to offer software infrastructure within public cloud or hosted environments. For example, analytic databases Vertica and Teradata are now available as public services within Amazon EC2, while Kognitio offers a private hosted service. ETL vendors Informatica and SnapLogic also offer services in the cloud.
Key Characteristics of the Cloud
Virtualization is the foundation of cloud computing. You can’t do cloud computing without virtualization; but virtualization by itself doesn’t constitute cloud computing.
Virtualization abstracts or virtualizes the underlying compute infrastructure using a piece of software called a hypervisor. With virtualization, you create virtual servers (or virtual machines) to run your applications. Your virtual server can have a different operating system than the physical hardware upon which it is running. For the most part, users no longer have to worry whether they have the right operating system, hardware, and networking to support a BI or other application. Virtualization shields users and developers from the underlying complexity of the compute infrastructure (as long as the IT department has created appropriate virtual machines for them to use.)
Deployment Options for Cloud Computing

Public Cloud. Application and compute resources are managed by a third party services provider.
Private Cloud. Application and compute resources are managed by an internal data center team.
Hybrid Cloud. Either a private cloud that leverages the public cloud to handle peak capacity, or a reserved “private” space within a public cloud, or a hybrid architecture in which some components run in a data center and others in the public cloud.

Monday, January 9, 2012

Types of Analytical Platforms

MPP analytical databases : Row-based databases designed to scale out on a cluster of commodity servers and run complex queries in parallel against large volumes of data.
[Teradata Active Data Warehouse, Greenplum (EMC), Microsoft Parallel Data Warehouse, Aster Data (Teradata), Kognitio, Dataupia]
Columnar databases : Database management systems that store data in columns, not rows, and support high data compression ratios.
[ParAccel, Infobright, Sand Technology, Sybase IQ (SAP), Vertica (Hewlett-Packard), 1010data, Exasol, Calpont]
Analytical appliances : Preconfigured hardware-software systems designed for query processing and analytics that require little tuning.
[Netezza (IBM), Teradata Appliances, Oracle Exadata, Greenplum Data Computing Appliance (EMC)]
Analytical bundles : Predefined hardware and software configurations that are certified to meet specific performance criteria, but the customer must purchase and configure themselves.
[IBM SmartAnalytics, Microsoft FastTrack]
In-memory databases : Systems that load data into memory to execute complex queries.
[SAP HANA, Cognos TM1 (IBM), QlikView, Membase]
Distributed file-based systems : Distributed file systems designed for storing, indexing, manipulating and querying large volumes of unstructured and semi-structured data.
[Hadoop (Apache, Cloudera, MapR, IBM, HortonWorks), Apache Hive, Apache Pig]
Analytical services : Analytical platforms delivered as a hosted or public-cloud-based service.
[1010data, Kognitio]
Nonrelational : Nonrelational databases optimized for querying unstructured data as well as structured data.
[MarkLogic Server, MongoDB, Splunk, Attivio, Endeca, Apache Cassandra, Apache Hbase]
CEP/streaming engines: Ingest, filter, calculate, and correlate large volumes of discrete events and apply rules that trigger alerts when conditions are met.
[IBM, Tibco, Streambase, Sybase (Aleri), Opalma, Vitria, Informatica]

Why many BI implementations across organizations fails?

# Failure to tie the Business Intelligence Strategy with the Enterprise vision & Strategy.
# Failure to have a flexible BI Architecture that aligns with the Enterprise I.T Architecture.
# Failure to have Enterprise vision of Data Quality, Master Data Management, Metadata Management, Portalization ..etc
# Business Adoptability.
# Various BI tools seamless integration.
# Over-all Business pattern as well as technology is changing. (Example, envisioning DW 2.0 as well as the Business need while architecting)
# Calculating the ROI for any BI implementation and realization takes little longer as compared to other programs.
# Failure due to projects being driven by I.T instead of the business.
# Failure due to business not being involved heavily right from the beginning.
# Lack of good understanding of the BI and Data Warehousing fundamentals and concepts.
# Failure by deviating from the objective and focus during the phases of BI project.
# Lack of BI skills in team implementation team.
# Failure to have a clear working process between the Infrastructures, ETL & Reporting team.
# Lack of continuous support & direction from the leadership team.
# Lack of a clear and detailed BI strategy and road map.
# Failure to gather complete requirements and do system study.
# Lack of a clear understanding of the end goal.
# Lack of a an end to end understanding of the initiative with the source systems team, Infrastructure team, ETL team, Reporting & Analytics team and the Business all functioning in silos.
# Lack of continuous training for the implementation team and the business.
# Lack of a clear support structure.
# Lack of enough budgets that delays upgrades and feature enhancements.
# Implementation of BI plans, direction and work structure by a PMO or Business team without the involvement or consultation of the BI team.
# Lack of clear testing by the UAT team.
# Incorrect and non-flexible data model.
# Lack of clear communication and co-ordination issues between different teams and to the business stake holders.
# Lack of a proper change management process.
# Failure to have a scope creep and breaking the deliverable into phases as there is never an end to the requirements sometimes as BI is iterative and business expects constant delivery of values.
# Failure to include performance criteria as part of the BI and Analytics deliverable.
# Failure to determine and include SLA's for each piece of the BI.

Thursday, January 5, 2012


There has been a lot of talk about “big data” in the past year, which I find a bit puzzling. I’ve been in the data warehousing field for more than 15 years, and data warehousing has always been about big data. So what’s new in 2011? Why are we are talking about “big data” today? There are several reasons:

Changing data types. Organizations are capturing different types of data today. Until about five years ago, most data was transactional in nature, consisting of numeric data that fit easily into rows and columns of relational databases. Today, the growth in data is fueled by largely unstructured data from wWeb sites as well as machine-generated data from an exploding number of sensors.
Technology advances. Hardware has finally caught up with software. The exponential gains in price/-performance exhibited by computer processors, memory, and disk storage have finally made it possible to store and analyze large volumes of data at an affordable price. Organizations are storing and analyzing more data because they can.!
Insourcing and outsourcing. Because of the complexity and cost of storing and analyzing Web traffic data, most organizations have outsourced these functions to third- party service bureaus. But as the size and importance of corporate e-commerce channels have increased, many are now eager to insource this data to gain greater insights about customers. At the same time, virtualization technology is making it attractive for organizations to move large-scale data processing to private hosted networks or public clouds.
Developers discover data. The biggest reason for the popularity of the term “big data” is that Web and application developers have discovered the value of building a new data-intensive applications. To application developers, “big data” is new and exciting. Of course, for those of us who have made their careers in the data world, the new era of “big data” is simply another step in the evolution of data management systems that support reporting and analysis applications.

Analytics against Big Data
Big data by itself, regardless of the type, is worthless unless business users do something with it that delivers value to their organizations. That’s where analytics comes in. Although organizations have always run reports against data warehouses, most haven’t opened these repositories to ad hoc exploration. This is partly because analysis tools are too complex for the average user but also because the repositories often don’t contain all the data needed by the power user. But this is changing.
• Patterns. A valuable characteristic of ““big data”” is that it contains more patterns and interesting anomalies than “small” data. Thus, organizations can gain greater value by mining large data volumes than small ones. Fortunately, techniques already exist to mine big data thanks to companies, such as SAS Institute and SPSS (now part of IBM), that ship analytical workbenches.
• Real-time. Organizations that accumulate big data recognize quickly that they need to change the way they capture, transform, and move data from a nightly batch process to a continuous process using micro batch loads or event-driven updates. This technical constraint pays big business dividends because it makes it possible to deliver critical information to users in near- real- time.
• Complex analytics. In addition, during the past 15 years, the “analytical IQ” of many organizations has evolved from reporting and dashboarding to lightweight analysis. Many are now on the verge of upping their analytical IQ by implementing predictive analytics against both structured and unstructured data. This type of analytics can be used to do everything from delivered highly tailored cross-sell recommendations to predicting failure rates of aircraft engines.
• Sustainable advantage . At the same time, executives have recognized the power of analytics to deliver a competitive advantage, thanks to the pioneering work of thought leaders, such as Tom Davenport, who co-wrote the book, “Competing on Analytics.” In fact, forward-thinking executives recognize that analytics may be the only true source of sustainable advantage since it empowers employees at all levels of an organization with information to help them make smarter decisions.
For more refer: