Wednesday, March 21, 2012

Busting 10 Myths about Hadoop By Philip Russom (TDWI)

Fact #1. Hadoop consists of multiple products.
We talk about Hadoop as if it’s one monolithic thing, whereas it’s actually a family of open-source products and technologies overseen by the Apache Software Foundation (ASF). (Some Hadoop products are also available via vendor distributions; more on that later.)
The Apache Hadoop library includes (in BI priority order): the Hadoop Distributed File System (HDFS), MapReduce, Hive, Hbase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and so on. You can combine these in various ways, but HDFS and MapReduce (perhaps with Hbase and Hive) constitute a useful technology stack for applications in BI, DW, and analytics.
Fact #2. Hadoop is open source but available from vendors, too.
Apache Hadoop’s open-source software library is available from ASF at http://www.apache.org. For users desiring a more enterprise-ready package, a few vendors now offer Hadoop distributions that include additional administrative tools and technical support.
Fact #3. Hadoop is an ecosystem, not a single product.
In addition to products from Apache, the extended Hadoop ecosystem includes a growing list of vendor products that integrate with or expand Hadoop technologies. One minute on your favorite search engine will reveal these.
Fact #4. HDFS is a file system, not a database management system (DBMS).
Hadoop is primarily a distributed file system and lacks capabilities we’d associate with a DBMS, such as indexing, random access to data, and support for SQL. That’s okay, because HDFS does things DBMSs cannot do.
Fact #5. Hive resembles SQL but is not standard SQL.
Many of us are handcuffed to SQL because we know it well and our tools demand it. People who know SQL can quickly learn to hand-code Hive, but that doesn’t solve compatibility issues with SQL-based tools. TDWI feels that over time, Hadoop products will support standard SQL, so this issue will soon be moot.
Fact #6. Hadoop and MapReduce are related but don’t require each other.
Developers at Google developed MapReduce before HDFS existed, and some variations of MapReduce work with a variety of storage technologies, including HDFS, other file systems, and some DBMSs.
Fact #7. MapReduce provides control for analytics, not analytics per se.
MapReduce is a general-purpose execution engine that handles the complexities of network communication, parallel programming, and fault-tolerance for any kind of application that you can hand-code – not just analytics.
Fact #8. Hadoop is about data diversity, not just data volume.
Theoretically, HDFS can manage the storage and access of any data type as long as you can put the data in a file and copy that file into HDFS. As outrageously simplistic as that sounds, it’s largely true, and it’s exactly what brings many users to Apache HDFS.
Fact #9. Hadoop complements a DW; it’s rarely a replacement.
Most organizations have designed their DW for structured, relational data, which makes it difficult to wring BI value from unstructured and semistructured data. Hadoop promises to complement DWs by handling the multi-structured data types most DWs can’t.
Fact #10. Hadoop enables many types of analytics, not just Web analytics.
Hadoop gets a lot of press about how Internet companies use it for analyzing Web logs and other Web data. But other use cases exist. For example, consider the big data coming from sensory devices, such as robotics in manufacturing, RFID in retail, or grid monitoring in utilities. Older analytic applications that need large data samples -- such as customer base segmentation, fraud detection, and risk analysis -- can benefit from the additional big data managed by Hadoop. Likewise, Hadoop’s additional data can expand 360-degree views to create a more complete and granular view.

Tuesday, March 6, 2012

The Forrester Wave™ Enterprise ETL, Q1 2012



Data movement is critical in any organization to support data management initiatives, such as data warehousing (DW), business intelligence (BI), application migrations and upgrades, master data management (MDM), and other initiatives that focus on data integration. Besides moving data, ETL supports complex transformations like cleansing, reformatting, aggregating, and converting very large volumes of data from many sources. In a mature integration architecture, ETL complements change data capture (CDC) and data replication technologies to support real-time data requirements and combines with application integration tools to support messaging and transactional integration. Although ETL is still used extensively to support traditional scheduled batch data feeds into DW and BI environments, the scope of ETL has evolved over the past five years to support new and emerging data management initiatives, including:


- Data virtualization

- Cloud integration

- Big Data

- Real-time data warehousing

- Data migration and application retirement

- Master data management
IBM Datastage, Informatica, Oracle (Oracle Data Integration - ODI & Oracle Warehouse Builder - OWB), SAP (BusinessObjects’ Data Integrator - BODI), SAS, Ab Initio, And Talend Lead, With Pervasive, Microsoft (SSIS), And iWay Close Behind in The Forrester Wave™ Enterprise ETL, Q1 2012.

Real-Time, Near Real-Time & Right-Time Business Intelligence

It is important for Enterprise to have the data, reports, alerts, predictions and information at right time rather struggling for Real Time & Near Real time data acquisitions and integration. Everyone is boasting about the Real Time and near real time, but what is the realization? Are we using the data at right time?
Real-time business intelligence (RTBI) is the process of delivering information about business operations as they occur.
The speed of today's processing systems has moved classical data warehousing into the realm of real-time. The result is real-time business intelligence. Business transactions as they occur are fed to a real-time business intelligence system that maintains the current state of the enterprise. The RTBI system not only supports the classic strategic functions of data warehousing for deriving information and knowledge from past enterprise activity, but it also provides real-time tactical support to drive enterprise actions that react immediately to events as they occur. As such, it replaces both the classic data warehouse and the enterprise application integration (EAI) functions. Such event-driven processing is a basic tenet of real-time business intelligence
All real-time business intelligence systems have some latency, but the goal is to minimize the time from the business event happening to a corrective action or notification being initiated. Analyst Richard Hackathorn describes three types of latency:
1. Data latency; the time taken to collect and store the data
2. Analysis latency; the time taken to analyze the data and turn it into actionable information
3. Action latency; the time taken to react to the information and take action
Real-time business intelligence technologies are designed to reduce all three latencies to as close to zero as possible, whereas traditional business intelligence only seeks to reduce data latency and does not address analysis latency or action latency since both are governed by manual processes.
The term "near real-time" or "nearly real-time" (NRT), in telecommunications and computing, refers to the time delay introduced, by automated data processing or network transmission, between the occurrence of an event and the use of the processed data, such as for display or feedback and control purposes. For example, a near-real-time display depicts an event or situation as it existed at the current time minus the processing time, as nearly the time of the live event.
The distinction between the terms "near real time" and "real time" is somewhat nebulous and must be defined for the situation at hand. The term implies that there are no significant delays. In many cases, processing described as "real-time" would be more accurately described as "near-real-time".
The Change Data Capture (CDC) tool is booming to address the Real-Time, Near Real-Time data integration. Now the data is growing rapidly and the address the business ineed we must understand the need of Right-Time data and information.