This blog on hive-metastore originally appeared on Medium. It has been republished with the author’s credit and consent.
In this blog, I’ll cover how to set up Hive metastore on Kubernetes and then leverage external S3 data sets.
Installation
In order to deploy Hive components on the Kubernetes cluster, I first add the required helm chart repo:
1 2 3 |
<span style="font-weight: 400;">$ helm repo add bigdata-gradiant https://gradiant.github.io/bigdata-charts/</span> <span style="font-weight: 400;">$ helm repo update</span> |
I can then search the new repo for the available helm charts. At the time of this writing, these are:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
<span style="font-weight: 400;">$ helm search repo bigdata-gradiant</span> <span style="font-weight: 400;">NAME CHART VERSION APP VERSION DESCRIPTION </span> <span style="font-weight: 400;">bigdata-gradiant/hbase 0.1.6 2.0.1 HBase is an open-source non-relational distribu...</span> <span style="font-weight: 400;">bigdata-gradiant/hdfs 0.1.10 2.7.7 The Apache Hadoop software library is a framewo...</span> <span style="font-weight: 400;">bigdata-gradiant/hive 0.1.6 2.3.6 The Apache Hive ™ data warehouse software facil...</span> <span style="font-weight: 400;">bigdata-gradiant/hive-metastore 0.1.3 2.3.6 The Apache Hive ™ data warehouse software facil...</span> <span style="font-weight: 400;">bigdata-gradiant/jupyter 0.1.11 6.0.3 Helm </span><span style="font-weight: 400;">for</span><span style="font-weight: 400;"> jupyter single server with pyspark sup...</span> <span style="font-weight: 400;">bigdata-gradiant/kafka-connect-ui 0.1.0 0.9.7 Helm </span><span style="font-weight: 400;">for</span><span style="font-weight: 400;"> Landoop/kafka-connect-ui </span> <span style="font-weight: 400;">bigdata-gradiant/opentsdb 0.1.7 2.4.0 Store and serve massive amounts of time series ...</span> <span style="font-weight: 400;">bigdata-gradiant/spark-standalone 0.1.0 2.4.4 Apache Spark™ is a unified analytics engine </span><span style="font-weight: 400;">for</span><span style="font-weight: 400;">...</span> |
To link Trino to some S3 data sets, I’ll be deploying a Hive metastore:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
<span style="font-weight: 400;">$</span> <span style="font-weight: 400;">helm</span> <span style="font-weight: 400;">install</span> <span style="font-weight: 400;">hivems</span> <span style="font-weight: 400;">bigdata-gradiant/hive-metastore</span> <span style="font-weight: 400;">-n</span> <span style="font-weight: 400;">analytics</span> <span style="font-weight: 400;">NAME:</span> <span style="font-weight: 400;">hivems</span> <span style="font-weight: 400;">LAST DEPLOYED:</span> <span style="font-weight: 400;">Thu</span> <span style="font-weight: 400;">Aug</span> <span style="font-weight: 400;">10</span> <span style="font-weight: 400;">10</span><span style="font-weight: 400;">:21:47</span> <span style="font-weight: 400;">2023</span> <span style="font-weight: 400;">NAMESPACE:</span> <span style="font-weight: 400;">analytics</span> <span style="font-weight: 400;">STATUS:</span> <span style="font-weight: 400;">deployed</span> <span style="font-weight: 400;">REVISION:</span> <span style="font-weight: 400;">1</span> <span style="font-weight: 400;">TEST SUITE:</span> <span style="font-weight: 400;">None </span> |
I check that my pods have started correctly within Kubernetes:
1 2 3 4 5 6 7 8 9 10 11 |
<span style="font-weight: 400;">$ kubectl -n analytics get pods</span> <span style="font-weight: 400;">NAME READY STATUS </span> <span style="font-weight: 400;">... 0 42m</span> <span style="font-weight: 400;">hivems-hive-metastore-0 1/1 Running </span> <span style="font-weight: 400;">hivems-postgresql-0 1/1 Running </span> <span style="font-weight: 400;">...</span> |
I then copy out the created configmap and append the following to the data hive-site.xml section (replace with your S3 endpoint values):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
<span style="font-weight: 400;">$ kubectl -n analytics get configmap hivems-hive-metastore -o yaml > hivems-hive-metastore.yaml</span> <span style="font-weight: 400;">$ vi hivems-hive-metastore.yaml</span> <span style="font-weight: 400;">### I MAKE THE FOLLOWING CHANGE/ADDITION ###</span> <span style="font-weight: 400;">data:</span> <span style="font-weight: 400;"> hive-site.xml: |</span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><?xml version=</span><span style="font-weight: 400;">"1.0"</span><span style="font-weight: 400;">?></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><?xml-stylesheet type=</span><span style="font-weight: 400;">"text/xsl"</span><span style="font-weight: 400;"> href=</span><span style="font-weight: 400;">"configuration.xsl"</span><span style="font-weight: 400;">?></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><configuration></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><property></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><name></span><span style="font-weight: 400;">fs.s3a.endpoint</span><span style="font-weight: 400;"></name></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><value></span><span style="font-weight: 400;">192.168.2.2</span><span style="font-weight: 400;"></value></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"></property></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><property></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><name></span><span style="font-weight: 400;">fs.s3a.access.key</span><span style="font-weight: 400;"></name></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><value></span><span style="font-weight: 400;">PSFB....JEIA</span><span style="font-weight: 400;"></value></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"></property></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><property></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><name></span><span style="font-weight: 400;">fs.s3a.secret.key</span><span style="font-weight: 400;"></name></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><value></span><span style="font-weight: 400;">A121....eJOEN</span><span style="font-weight: 400;"></value></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"></property></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><property></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><name></span><span style="font-weight: 400;">fs.s3a.connection.ssl.enabled</span><span style="font-weight: 400;"></name></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"><value></span><span style="font-weight: 400;">false</span><span style="font-weight: 400;"></value></span> <span style="font-weight: 400;"> </span><span style="font-weight: 400;"></property></span> |
Note: I had to create a new container image to include the missing hadoop-aws-2.7.4.jar file. This can be obtained from my Docker repo jboothomas/hive-metastore-s3:v6. I then simply provided this as the image to use for the hive-metastore deployment.
Table Creation
To create an external table, I exec into the hive metastore pod and connect to hive:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
<span style="font-weight: 400;">$ kubectl -n analytics </span><span style="font-weight: 400;">exec</span><span style="font-weight: 400;"> -it hivems-hive-metastore-0 - /bin/sh</span> <span style="font-weight: 400;"># hive</span> <span style="font-weight: 400;">SLF4J: Class path contains multiple SLF4J bindings.</span> <span style="font-weight: 400;">SLF4J: Found binding </span><span style="font-weight: 400;">in</span><span style="font-weight: 400;"> [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]</span> <span style="font-weight: 400;">SLF4J: Found binding </span><span style="font-weight: 400;">in</span><span style="font-weight: 400;"> [jar:file:/opt/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]</span> <span style="font-weight: 400;">SLF4J: See http://www.slf4j.org/codes.html</span><span style="font-weight: 400;">#multiple_bindings for an explanation.</span> <span style="font-weight: 400;">SLF4J: Actual binding is of </span><span style="font-weight: 400;">type</span><span style="font-weight: 400;"> [org.apache.logging.slf4j.Log4jLoggerFactory]</span> |
Logging initialized using configuration in jar:file:/opt/hive/lib/hive-common-2.3.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
1 |
<span style="font-weight: 400;">hive></span> |
Now I can create my external table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
<span style="font-weight: 400;">create</span> <span style="font-weight: 400;">external</span> <span style="font-weight: 400;">table</span><span style="font-weight: 400;"> if </span><span style="font-weight: 400;">not</span> <span style="font-weight: 400;">exists</span><span style="font-weight: 400;"> nyctaxi(</span> <span style="font-weight: 400;"> VendorID </span><span style="font-weight: 400;">bigint</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> tpep_pickup_datetime </span><span style="font-weight: 400;">timestamp</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> tpep_dropoff_datetime </span><span style="font-weight: 400;">timestamp</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> passenger_count </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> trip_distance </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> RatecodeID </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> store_and_fwd_flag string,</span> <span style="font-weight: 400;"> PULocationID </span><span style="font-weight: 400;">bigint</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> DOLocationID </span><span style="font-weight: 400;">bigint</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> payment_type </span><span style="font-weight: 400;">bigint</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> fare_amount </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> extra </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> mta_tax </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> tip_amount </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> tolls_amount </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> improvement_surcharge </span><span style="font-weight: 400;">double</span><span style="font-weight: 400;">,</span> <span style="font-weight: 400;"> total_amount </span><span style="font-weight: 400;">double</span> <span style="font-weight: 400;"> )</span> <span style="font-weight: 400;"> STORED </span><span style="font-weight: 400;">AS</span><span style="font-weight: 400;"> PARQUET</span> <span style="font-weight: 400;"> LOCATION </span><span style="font-weight: 400;">'s3a://nyctaxi/'</span><span style="font-weight: 400;">;</span> <span style="font-weight: 400;">Time</span><span style="font-weight: 400;"> taken: </span><span style="font-weight: 400;">2.756</span><span style="font-weight: 400;"> seconds</span> |
From this point onwards, I can leverage various analytics tools and point to my hive-metastore service to run queries against this table (for example: trino).