Hive-metastore on Kubernetes with S3 External Table

This blog on hive-metastore originally appeared on Medium. It has been republished with the author’s credit and consent.

In this blog, I’ll cover how to set up Hive metastore on Kubernetes and then leverage external S3 data sets.

Installation

In order to deploy Hive components on the Kubernetes cluster, I first add the required helm chart repo:

$ helm repo add bigdata-gradiant https://gradiant.github.io/bigdata-charts/

$ helm repo update

$ helm repo add bigdata–gradiant https://gradiant.github.io/bigdata-charts/

$ helm repo update

I can then search the new repo for the available helm charts. At the time of this writing, these are:

$ helm search repo bigdata-gradiant

NAME CHART VERSION APP VERSION DESCRIPTION

bigdata-gradiant/hbase 0.1.6  HBase is an open-source non-relational distribu…

bigdata-gradiant/hdfs  0.1.10 The Apache Hadoop software library is a framewo…

bigdata-gradiant/hive  0.1.6  2.3.6  The Apache Hive ™ data warehouse software facil…

bigdata-gradiant/hive-metastore 0.1.3  2.3.6  The Apache Hive ™ data warehouse software facil…

bigdata-gradiant/jupyter 0.1.11 6.0.3   Helm for jupyter single server with pyspark sup…

bigdata-gradiant/kafka-connect-ui 0.1.0  0.9.7  Helm for Landoop/kafka-connect-ui

bigdata-gradiant/opentsdb  0.1.7  2.4.0  Store and serve massive amounts of time series …

bigdata-gradiant/spark-standalone 0.1.0  2.4.4  Apache Spark™ is a unified analytics engine for…

$ helm search repo bigdata–gradiant

NAME CHART VERSION APP VERSION DESCRIPTION

bigdata–gradiant/hbase 0.1.6 HBase is an open–source non–relational distribu...

bigdata–gradiant/hdfs 0.1.10 The Apache Hadoop software library is a framewo...

bigdata–gradiant/hive 0.1.6 2.3.6 The Apache Hive ™ data warehouse software facil...

bigdata–gradiant/hive–metastore 0.1.3 2.3.6 The Apache Hive ™ data warehouse software facil...

bigdata–gradiant/jupyter 0.1.11 6.0.3 Helm for jupyter single server with pyspark sup...

bigdata–gradiant/kafka–connect–ui 0.1.0 0.9.7 Helm for Landoop/kafka–connect–ui

bigdata–gradiant/opentsdb 0.1.7 2.4.0 Store and serve massive amounts of time series ...

bigdata–gradiant/spark–standalone 0.1.0 2.4.4 Apache Spark™ is a unified analytics engine for...

To link Trino to some S3 data sets, I’ll be deploying a Hive metastore:

$ helm install hivems bigdata-gradiant/hive-metastore -n analytics

NAME: hivems

LAST DEPLOYED: Thu Aug 10 10:21:47 2023

NAMESPACE: analytics

STATUS: deployed

REVISION: 1

TEST SUITE: None

$ helm install hivems bigdata–gradiant/hive–metastore –n analytics

NAME: hivems

LAST DEPLOYED: Thu Aug 10 10:21:47 2023

NAMESPACE: analytics

STATUS: deployed

REVISION: 1

TEST SUITE: None

I check that my pods have started correctly within Kubernetes:

$ kubectl -n analytics get pods

NAME&nbsp;  STATUS&nbsp;

…&nbsp;  0  42m

hivems-hive-metastore-0  1/1  Running&nbsp;

hivems-postgresql-0  1/1  Running

…

$ kubectl –n analytics get pods

NAME  STATUS 

...  0 42m

hivems–hive–metastore–0 1/1 Running 

hivems–postgresql–0 1/1 Running

...

I then copy out the created configmap and append the following to the data hive-site.xml section (replace with your S3 endpoint values):

$ kubectl -n analytics get configmap hivems-hive-metastore -o yaml&nbsp; &gt; hivems-hive-metastore.yaml

$ vi hivems-hive-metastore.yaml

### I MAKE THE FOLLOWING CHANGE/ADDITION ###

data:

&nbsp;hive-site.xml: |

&nbsp;&nbsp;&nbsp;&lt;?xml version=”1.0″?&gt;

   &lt;?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?&gt;

&lt;configuration&gt;

&lt;property&gt;

&lt;name&gt;fs.s3a.endpoint&lt;/name&gt;

&lt;value&gt;192.168.2.2&lt;/value&gt;

&lt;/property&gt;

 property&gt;

&lt;name&gt;fs.s3a.access.key&lt;/name&gt;

&lt;value&gt;PSFB….JEIA&lt;/value&gt;

&lt;/property&gt;
&lt;property&gt;

&lt;name&gt;fs.s3a.secret.key&lt;/name&gt;

 &lt;value&gt;A121….eJOEN&lt;/value&gt;

&lt;/property&gt;

&lt;property&gt;

&lt;name&gt;fs.s3a.connection.ssl.enabled&lt;/name&gt;

&lt;value&gt;false&lt;/value&gt;

&lt;/property&gt;

$ kubectl –n analytics get configmap hivems–hive–metastore –o yaml  > hivems–hive–metastore.yaml

$ vi hivems–hive–metastore.yaml

### I MAKE THE FOLLOWING CHANGE/ADDITION ###

data:

 hive–site.xml: |

   <?xml version=“1.0”?>

<?xml–stylesheet type=“text/xsl” href=“configuration.xsl”?>

<name>fs.s3a.endpoint</name>

</property>

property>

<name>fs.s3a.access.key</name>

</property>

<name>fs.s3a.secret.key</name>

<value>A121....eJOEN</value>

</property>

<name>fs.s3a.connection.ssl.enabled</name>

<value>false</value>

</property>

Note: I had to create a new container image to include the missing hadoop-aws-2.7.4.jar file. This can be obtained from my Docker repo jboothomas/hive-metastore-s3:v6. I then simply provided this as the image to use for the hive-metastore deployment.

Modern Hybrid Cloud Solutions

Seamless Cloud Mobility

Unify cloud and on-premises infrastructure with a modern data platform.

Table Creation

To create an external table, I exec into the hive metastore pod and connect to hive:

$ kubectl -n analytics exec -it hivems-hive-metastore-0 – /bin/sh

# hive

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/opt/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See https://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]

$ kubectl –n analytics exec –it hivems–hive–metastore–0 – /bin/sh

# hive

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j–slf4j–impl–2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/opt/hadoop–2.7.4/share/hadoop/common/lib/slf4j–log4j12–1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See https://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type

[org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true

Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.

hive&gt;

hive>

Now I can create my external table:

create external table if not exists nyctaxi(

VendorID bigint,

tpep_pickup_datetime timestamp,

tpep_dropoff_datetime timestamp,

passenger_count double,

trip_distance double,

RatecodeID double&lt;,

store_and_fwd_flag string,

PULocationID bigint,

DOLocationID bigint,

payment_type bigint,

fare_amount double,

extra double,
mta_tax double,

tip_amount double,

tolls_amount double,

improvement_surcharge double,

total_amount double

)

STORED AS PARQUET

LOCATION ‘s3a://nyctaxi/’;

Time taken: 2.756 seconds

create external table if not exists nyctaxi(

VendorID bigint,

tpep_pickup_datetime timestamp,

tpep_dropoff_datetime timestamp,

passenger_count double,

trip_distance double,

RatecodeID double<,

store_and_fwd_flag string,

PULocationID bigint,

DOLocationID bigint,

payment_type bigint,

fare_amount double,

extra double,

mta_tax double,

tip_amount double,

tolls_amount double,

improvement_surcharge double,

total_amount double

)

STORED AS PARQUET

LOCATION ‘s3a://nyctaxi/’;

Time taken: 2.756 seconds

From this point onwards, I can leverage various analytics tools and point to my hive-metastore service to run queries against this table (for example: trino).

Upskill Your Knowledge

Check out our latest coding articles so you can go from try to DIY.

Check Out the Latest Episodes

Blog Home

Hive-metastore on Kubernetes with S3 External Table

Installation

Seamless Cloud Mobility

Table Creation

Upskill Your Knowledge

AWS S3 vs. Azure Blob

How to Ensure Business Continuity in the Age of Cyber Threats

Top Cybersecurity Infrastructure Standards to Know

NERC CIP: Understanding and Ensuring Compliance for a Secure Power Grid

Top Stories

AWS S3 vs. Azure Blob

How to Ensure Business Continuity in the Age of Cyber Threats

Top Cybersecurity Infrastructure Standards to Know

NERC CIP: Understanding and Ensuring Compliance for a Secure Power Grid

How a Zero Trust Architecture Can Help Mitigate Ransomware Risks

Hive-metastore on Kubernetes with S3 External Table

Installation

Seamless Cloud Mobility

Table Creation

Upskill Your Knowledge

Related Stories

Top Stories