Laptop Displaying the GigaOm Research Portal

Get your Free GigaOm account today.

Access complimentary GigaOm content by signing up for a FREE GigaOm account today — or upgrade to premium for full access to the GigaOm research catalog. Join now and uncover what you’ve been missing!

Sector Roadmap: Hadoop/Data Warehouse Interoperability

Table of Contents

  1. Summary
  2. Introduction and Methodology
  3. Usage Scenarios
  4. Disruption Vectors
  5. Company Analysis
  6. Key Takeaways
  7. About George Gilbert

1. Summary

SQL-on-Hadoop capabilities played a key role in the big data market in 2013. In 2014, their importance only grew, as did their ubiquitousness, making possible new use cases for big data. Now, with virtually every Hadoop distribution vendor and incumbent database vendor offering SQL-on-Hadoop solutions, the key factor in the market is no longer mere SQL query capability, it’s the quality and economics of the resulting integration between Hadoop and data warehouse technology.

This Sector RoadmapTM examines that integration, reviewing SQL-on-Hadoop solutions on offer from the three major Hadoop vendors: Cloudera, Hortonworks, and MapR; incumbent data warehouse vendor Teradata; relational-database juggernaut Oracle; and Hadoop/data warehouse hybrid vendor Pivotal. With this analysis, key usage scenarios made possible by these solutions are identified, as are the architectural distinctions between them.

Vendor solutions are evaluated over six Disruption Vectors: schema flexibility, data engine interoperability, pricing model, enterprise manageability, workload role optimization, and query engine maturity. These vectors collectively measure not just how well a SQL-on-Hadoop solution can facilitate Hadoop-data warehouse integration, but how successfully it does so with respect to the emerging usage patterns discussed in this report.

Key findings in our analysis include:

  • In addition to the widely discussed data lake, the adjunct data warehouse is a key concept, which has a greater near-term relevance to pragmatist customers.
  • The adjunct data warehouse provides for production ETL, reporting, and BI on the data sources first explored in the data lake. It also offloads production ETL from the core data warehouse in order to avoid costly capacity additions on proprietary platforms at a 10- to 30-times cost premium.
  • MapR fared best in our comparison due to the integration powers of Apache Drill’s technology. It would have fared better still were Drill not in such a relatively early phase of development.
  • Hortonworks, given its enhancements to Apache Hive, and Cloudera, with its dominant Impala SQL-on-Hadoop engine, follow closely behind MapR.
  • Despite their conventional data warehouse pedigrees, Teradata, Pivotal, and Oracle are very much in the game as they make their comprehensive SQL languages available as a query interface over data in Hadoop.

SQLHadoopUber2

Key:

  • Number indicates company’s relative strength across all vectors
  • Size of ball indicates company’s relative strength along individual vector

Source: Gigaom Research

Image courtesy of 3dmentat/iStock.