Hive-metastore on Kubernetes with S3 External Table

This article covers how to set up Hive metastore on Kubernetes and then leverage external S3 data sets.

Hive-metastore

1 minute
image_pdfimage_print

This blog on hive-metastore originally appeared on Medium. It has been republished with the author’s credit and consent. 

In this blog, I’ll cover how to set up Hive metastore on Kubernetes and then leverage external S3 data sets.

Installation

In order to deploy Hive components on the Kubernetes cluster, I first add the required helm chart repo:

I can then search the new repo for the available helm charts. At the time of this writing, these are:

To link Trino to some S3 data sets, I’ll be deploying a Hive metastore:

I check that my pods have started correctly within Kubernetes:

I then copy out the created configmap and append the following to the data hive-site.xml section (replace with your S3 endpoint values):

Note: I had to create a new container image to include the missing hadoop-aws-2.7.4.jar file. This can be obtained from my Docker repo jboothomas/hive-metastore-s3:v6. I then simply provided this as the image to use for the hive-metastore deployment.

Modern Hybrid Cloud Solutions

Table Creation

To create an external table, I exec into the hive metastore pod and connect to hive:

Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true

Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

Now I can create my external table:

From this point onwards, I can leverage various analytics tools and point to my hive-metastore service to run queries against this table (for example: trino).