This blog on hive-metastore originally appeared on Medium. It has been republished with the author’s credit and consent.
In this blog, I’ll cover how to set up Hive metastore on Kubernetes and then leverage external S3 data sets.
Installation
In order to deploy Hive components on the Kubernetes cluster, I first add the required helm chart repo:
1 2 3 |
$ helm repo add bigdata–gradiant https://gradiant.github.io/bigdata-charts/ $ helm repo update |
I can then search the new repo for the available helm charts. At the time of this writing, these are:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
$ helm search repo bigdata–gradiant NAME CHART VERSION APP VERSION DESCRIPTION bigdata–gradiant/hbase 0.1.6 HBase is an open–source non–relational distribu... bigdata–gradiant/hdfs 0.1.10 The Apache Hadoop software library is a framewo... bigdata–gradiant/hive 0.1.6 2.3.6 The Apache Hive ™ data warehouse software facil... bigdata–gradiant/hive–metastore 0.1.3 2.3.6 The Apache Hive ™ data warehouse software facil... bigdata–gradiant/jupyter 0.1.11 6.0.3 Helm for jupyter single server with pyspark sup... bigdata–gradiant/kafka–connect–ui 0.1.0 0.9.7 Helm for Landoop/kafka–connect–ui bigdata–gradiant/opentsdb 0.1.7 2.4.0 Store and serve massive amounts of time series ... bigdata–gradiant/spark–standalone 0.1.0 2.4.4 Apache Spark™ is a unified analytics engine for... |
To link Trino to some S3 data sets, I’ll be deploying a Hive metastore:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ helm install hivems bigdata–gradiant/hive–metastore –n analytics NAME: hivems LAST DEPLOYED: Thu Aug 10 10:21:47 2023 NAMESPACE: analytics STATUS: deployed REVISION: 1 TEST SUITE: None |
I check that my pods have started correctly within Kubernetes:
1 2 3 4 5 6 7 8 9 10 11 |
$ kubectl –n analytics get pods NAME STATUS ... 0 42m hivems–hive–metastore–0 1/1 Running hivems–postgresql–0 1/1 Running ... |
I then copy out the created configmap and append the following to the data hive-site.xml section (replace with your S3 endpoint values):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
$ kubectl –n analytics get configmap hivems–hive–metastore –o yaml > hivems–hive–metastore.yaml $ vi hivems–hive–metastore.yaml ### I MAKE THE FOLLOWING CHANGE/ADDITION ### data: hive–site.xml: | <?xml version=“1.0”?> <?xml–stylesheet type=“text/xsl” href=“configuration.xsl”?> <configuration> <property> <name>fs.s3a.endpoint</name> <value>192.168.2.2</value> </property> property> <name>fs.s3a.access.key</name> <value>PSFB....JEIA</value> </property> <property> <name>fs.s3a.secret.key</name> <value>A121....eJOEN</value> </property> <property> <name>fs.s3a.connection.ssl.enabled</name> <value>false</value> </property> |
Note: I had to create a new container image to include the missing hadoop-aws-2.7.4.jar file. This can be obtained from my Docker repo jboothomas/hive-metastore-s3:v6. I then simply provided this as the image to use for the hive-metastore deployment.
Modern Hybrid Cloud Solutions
Accelerate innovation with a modern data platform that unifies hybrid and multicloud ecosystems.
Table Creation
To create an external table, I exec into the hive metastore pod and connect to hive:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ kubectl –n analytics exec –it hivems–hive–metastore–0 – /bin/sh # hive SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/hive/lib/log4j–slf4j–impl–2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop–2.7.4/share/hadoop/common/lib/slf4j–log4j12–1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See https://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] |
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
1 |
hive> |
Now I can create my external table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
create external table if not exists nyctaxi( VendorID bigint, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count double, trip_distance double, RatecodeID double<, store_and_fwd_flag string, PULocationID bigint, DOLocationID bigint, payment_type bigint, fare_amount double, extra double, mta_tax double, tip_amount double, tolls_amount double, improvement_surcharge double, total_amount double ) STORED AS PARQUET LOCATION ‘s3a://nyctaxi/’; Time taken: 2.756 seconds |
From this point onwards, I can leverage various analytics tools and point to my hive-metastore service to run queries against this table (for example: trino).
Upskill Your Knowledge
Check out our latest coding articles so you can go from try to DIY.