Hadoop – Hive, Impala, Zookeeper, and a Data Strategy

Tera-Tom here!

This post will discuss Hive, Impala, and Zookeeper and the best technology to query, load, manage, and integrate Hadoop projects with traditional systems.

The name Hadoop came from the original open source project leader Doug Cutting. Doug’s son had named his toy elephant Hadoop! Hadoop is the open source version of the Google internal tools, and it was open sourced by Yahoo and is now controlled by the Apache Software Foundation, which provides software products for the public good.

Let’s first discuss the differences between Hive and Impala.

Hive was incubated at Facebook and given to the Apache Foundation. It represents the earliest solution on Hadoop to work with SQL. It is written in Java and all Hive SQL is translated under the hood into MapReduce. This provides an excellent batch processing solution, but its high latency makes querying slower.

Cloudera produced Impala and has also provided it to the Apache Foundation. It is designed much like MPP platform and built for speed. Impala is written in C++ and uses a lot of RAM memory so queries can be up to 5-80% faster than Hive.

Who is more likely to use Hive vs. Impala? Data engineers and software developers are more likely to use Hive whereas Data Analysts and Data Scientists are more likely to use Impala.

Hive’s MapReduce is great for batch processing and is compared to a Mack truck, which is great at moving large amounts of data around (web clicks, large fact tables, etc.). Impala is built more like a jet engine built for speed and low latency!

What is the best model for Hive vs. Impala? Hive is better for ETL loading or long-running queries. Impala works well with a fact table that joins to many different small dimension tables. Impala works even better with Apache Parquet, which turns tables into a columnar format.

Many companies utilize a hybrid approach to Hadoop utilizing both Hive and Impala together. You see, anytime you create tables on Hive or Impala, they are stored within the Hive Metastore and use the Hadoop Distributed File System (HDFS). So, whether or not you create a table using Hive or Impala, it is the same table in HDFS. You can create the table once and then switch between querying it via Hive or Impala, depending on the need (batch long-running query or speed).

ZooKeeper is the Apache project that takes large implementations of Hadoop commodity servers and provides a distributed centralized coordination service that enable synchronization across large clusters. Distributed applications require coordination services, such as naming services that allow one node to find a specific server in a cluster of thousands of servers or for serialized updates.

What is the best tool to query, convert, and move data to and from Hadoop and share results with other users? The Nexus Query Chameleon! It has been used in production by many of the largest corporations in the world.

Nexus has developed four different foundations that build upon themselves to provide large enterprises the perfect enterprise data strategy.

Nexus converts table structures and data types between all systems automatically so anyone can move a single-table or an entire database between any system, whether on-premise or in the cloud.
Nexus shows tables/views visually and their relationships with other tables and builds the SQL automatically as users point-and-click. Since Nexus can also automatically convert and move tables between systems this allows users to perform automatic cross-system joins between any combination of systems, including Hadoop. Nexus even allows the user to choose on which system they want the cross-system joins processed.
Nexus takes every returned answer set and places it in the Garden of Analysis where a user can join answer sets or re-query them with point-and-click templates to get analytics, graphs and charts and additional reports that are processed inside the user’s PC. This allows a user to query a system once and then generate up to 50 additional reports with sub-second response time because all of the processing is done on the user’s PC.
Nexus has BizStar, which allows users to receive reports, Excel spreadsheets, word documents, videos and unstructured data so data can be shared. One person can run a query and thousands can see the results! The BizStar also has a series of menus that allow users to run queries with a single-click of a button. BizStar even has the Multi-Step Process Builder so users can perform a wide variety of repetitive task on data and automate the entire process from start to finish. This is the 8^th wonder of the world!

It is the combination of these four foundations that provide the world’s largest financial institution, the world’s most successful PC maker and the most prominent telecommunications company with the most sophisticated data strategy.

Imagine doing a cross-system join between any combination of on-premises or cloud computing systems made up of Teradata, Aster Data, Oracle, SQL Server, DB2, Amazon Redshift, Azure SQL Data Warehouse, Hive, Impala, and even Excel in a single query that has been automatically built by merely pointing on the tables needed and selecting the columns desired for the report.

Now imagine taking that answer set and using the Garden of Analysis to generate another 50 reports in minutes and then sharing those reports with hundreds or even thousands of team members in their BizStar menus.

Now imagine setting up this entire process in the BizStar Multi-Step Process Builder so it is automated and can be scheduled to run.

Our advice: Integrate your on-premise traditional systems with your cloud strategy and combine Hadoop among them all. Automate as much as possible and process additional analytics, graphs and charts and reports on the user’s PC when it makes sense. And share information across entire teams so everyone can work as one cohesive unit. That is 22^nd century processing the in 21^st century!

Posted in Blog and tagged data warehouse, Hadoop, Hive, Impala, Nexus Query chameleon, Query, query tool, Zookeeper