Trino S3 via hive-metastore Integration

This blog on Trino S3 initially appeared on Medium. It was republished with the author’s credit and consent.

In this blog, I’ll go over how to use S3 storage on a Pure Storage® FlashBlade® with Trino, the fast distributed SQL query engine for big data.

I deploy Trino using the hive chart and provide a values.yaml file with the following configuration:

additionalCatalogs:
lakehouse: |
connector.name=hive
hive.metastore.uri=thrift://hivems-hive-metastore.analytics.svc.local:9083
hive.s3.aws-access-key=PSFB….BJEIA
hive.s3.aws-secret-key=A121….eJOEN
hive.s3.endpoint=192.168.2.2
hive.s3.ssl.enabled=false

additionalCatalogs:

lakehouse: |

connector.name=hive

hive.metastore.uri=thrift://hivems-hive-metastore.analytics.svc.local:9083

hive.s3.aws–access–key=PSFB....BJEIA

hive.s3.aws–secret–key=A121....eJOEN

hive.s3.endpoint=192.168.2.2

hive.s3.ssl.enabled=false

This is pointing to my hive-metastore server. See this blog post for more information on setting that up. I then edit the Trino service to switch from ClusterIP to NodePort to facilitate external access.

As usual, I use the helm install command:

helm -n analytics install trino trino/trino -f values.yaml

1	helm –n analytics install trino trino/trino –f values.yaml

On a Linux client with the trino-cli installed, I use the following command to connect to my in Kubernetes running instance, and list the current catalogs available:

$ ./trino-cli-422-executable.jar –server https://10.225.114.50:32140
trino&gt; show catalogs;
Catalog
———–
hive&lt;
system
tpcds
tpch
(4rows)
Query 20230810_122520_00000_gb9kf, FINISHED, 2 nodes
Splits: 68 total, 68 done (100.00%)
1.15 [0rows, 0B] [0rows/s, 0B/s]
trino&gt;

$ ./trino–cli–422–executable.jar —server https://10.225.114.50:32140

trino> show catalogs;

Catalog

—————–

hive<

system

tpcds

tpch

(4rows)

Query 20230810_122520_00000_gb9kf, FINISHED, 2 nodes

Splits: 68 total, 68 done (100.00%)

1.15 [0rows, 0B] [0rows/s, 0B/s]

trino>

I can then select my hive source and check the available tables:

trino&gt; use hive.default;
USE
trino:default&gt; show tables;
Table
—————
nyctaxi_table
(1row)
Query 20230810_140256_00004_uapkp, FINISHED, 3 nodes
Splits: 68 total, 68 done (100.00%)
0.45 [1 rows, 30B] [2 rows/s, 67B/s]
trino:default

trino> use hive.default;

USE

trino:default> show tables;

Table

———————–

nyctaxi_table

(1row)

Query 20230810_140256_00004_uapkp, FINISHED, 3 nodes

Splits: 68 total, 68 done (100.00%)

0.45 [1 rows, 30B] [2 rows/s, 67B/s]

trino:default

Note that to see the available schemas, you can use:

trino:default&gt; describe nyctaxi;
Column  |  | Extra | Comment
———————–+————–+——-+———
vendori | bigint  |  |
tpep_pickup_datetime | timestamp(3) | 
tpep_dropoff_datetime | timestamp(3) | |
passenger_count  | double  |  |
trip_distance  | double  |  |
ratecodeid | double  |  |
store_and_fwd_flag  | varchar |  |
pulocationid  | bigint  |  |
dolocationid | bigint  | |
payment_type | bigint  | |
fare_amount  | double  | |
extra  | double |  |
mta_tax  | double  | ; |
tip_amount | double  ||
tolls_amount  | double  |  |
improvement_surcharge | double  | |
total_amount; | double |  |
(17rows)
Query 20230810_163505_00104_a45jd, FINISHED, 3 nodes
Splits: 68 total, 68 done (100.00%)
0.28 [17rows, 1.05KB] [60rows/s, 3.72KB/s]

trino:default> describe nyctaxi;

Column | | Extra | Comment

———————————–+———————+———–+————–

vendori | bigint | |

tpep_pickup_datetime | timestamp(3) |

tpep_dropoff_datetime | timestamp(3) | |

passenger_count | double | |

trip_distance | double | |

ratecodeid | double | |

store_and_fwd_flag | varchar | |

pulocationid | bigint | |

dolocationid | bigint | |

payment_type | bigint | |

fare_amount | double | |

extra | double | |

mta_tax | double | ; |

tip_amount | double ||

tolls_amount | double | |

improvement_surcharge | double | |

total_amount; | double | |

(17rows)

Query 20230810_163505_00104_a45jd, FINISHED, 3 nodes

Splits: 68 total, 68 done (100.00%)

0.28 [17rows, 1.05KB] [60rows/s, 3.72KB/s]

Modern Hybrid Cloud Solutions

Seamless Cloud Mobility

Unify cloud and on-premises infrastructure with a modern data platform.

I can now run various queries on the data set. Please note this is from a very limited lab Kubernetes cluster with low resources and network connectivity, so performance was not the aim:

trino:default&gt; selectcount(*) from nyctaxi;
_col0
———–
899294289
(1row)
Query 20230815_132013_00029_4suvq, FINISHED, 1 node
Splits: 343 total,343 done (100.00%)
1.04 [899M rows, 3.56MB] [865M rows/s, 3.43MB/s]
trino:default&gt; select * from nyctaxi WHERE tolls_amount &gt; 100 order by tolls_amount desc&lt; limit 10;
vendorid | tpep_pickup_datetime  | tpep_dropoff_datetime | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type | fare_amount | extra | mta
———-+————————-+————————-+—————–+—————+————+——————–+————–+————–+————–+————-+——-+—-
2 | 2017-07-03&lt;/ 17:15:15.000 | 2017-07-03 17:35:57.000 |  1 | 9.74 | 1 |  | 138 | 107 | 2 | 27.5 | 1.0 |
2 | 2015-12-1200:17:39.000 | 2015-12-12 01:14:18.000 |  1 |  38.48 | 5 |  |  68 | 265 | 2 |  200.0 |  0.0 |
1 | 2016-12-29 07:46:33.000 | 2016-12-29&lt; 08:03:36.000 |  1 |  4.7 | 1 |  | 151 | 223 | 2 | 16.5 | 0.0 |
1 | 2014-08-15 10:37:42.000 | 2014-08-15 11:07:21.000 |   1 | 10.0 | 1 |  | 230 | 138 | 3 | 32.0 |  0.0 |
1 | 2015-01-06 18:27:38.000 | 2015-01-06 19:04:04.000 |  | 18.5 | 2 |  | 264 | 264 | 2 | 52.0 |  0.0 |
1 | 2016-03-14 15:16:32.000 | 2016-03-1416:27:11.000 |  1 | 18.5 | 2 |  | 132 | 230 | 3 | 52.0 |  0.0 |
1 | 2016-05-02 09:10:21.000 | 2016-05-0209:58:00.000 |  1 | 18.0 | 2 |  | 229 | 132 | 3 | 52.0 |  0.0 |
1 | 2017-11-28 23:24:24.000 | 2017-11-29 00:09:00.000 |  1 |  6.2 | 1 |  | 246 | 112 | 3 | 26.5 |  0.5 |
1 | 2014-09-14 12:37:44.000 | 2014-09-14 12:49:49.000 |  2 |  1.4 |1 |  | 186 | 170 | 3 |  9.5 |  0.0 |1 | 2015-11-20 21:35:41.000 | 2015-11-20 22:08:29.000 |  1 |  8.5 | 1 |  | 264| 264 | 3 | 30.5 |  0.5&lt; |
(10 rows)
Query 20230815_131646_00028_4suvq, FINISHED, 1 node
Splits: 353&lt; total, 353 done (100.00%)
2.77 [487M rows, 6.36GB] [176M rows/s, 2.3GB/s]

trino:default> selectcount(*) from nyctaxi;

_col0

—————–

899294289

(1row)

Query 20230815_132013_00029_4suvq, FINISHED, 1 node

Splits: 343 total,343 done (100.00%)

1.04 [899M rows, 3.56MB] [865M rows/s, 3.43MB/s]

trino:default> select * from nyctaxi WHERE tolls_amount > 100 order by tolls_amount desc< limit 10;

—————+————————————–+————————————–+————————–+———————–+——————+——————————+———————+———————+———————+——————–+———–+——

2 | 2017–07–03</ 17:15:15.000 | 2017–07–03 17:35:57.000 | 1 | 9.74 | 1 | | 138 | 107 | 2 | 27.5 | 1.0 |

2 | 2015–12–1200:17:39.000 | 2015–12–12 01:14:18.000 | 1 | 38.48 | 5 | | 68 | 265 | 2 | 200.0 | 0.0 |

1 | 2016–12–29 07:46:33.000 | 2016–12–29< 08:03:36.000 | 1 | 4.7 | 1 | | 151 | 223 | 2 | 16.5 | 0.0 |

1 | 2014–08–15 10:37:42.000 | 2014–08–15 11:07:21.000 | 1 | 10.0 | 1 | | 230 | 138 | 3 | 32.0 | 0.0 |

1 | 2015–01–06 18:27:38.000 | 2015–01–06 19:04:04.000 | | 18.5 | 2 | | 264 | 264 | 2 | 52.0 | 0.0 |

1 | 2016–03–14 15:16:32.000 | 2016–03–1416:27:11.000 | 1 | 18.5 | 2 | | 132 | 230 | 3 | 52.0 | 0.0 |

1 | 2016–05–02 09:10:21.000 | 2016–05–0209:58:00.000 | 1 | 18.0 | 2 | | 229 | 132 | 3 | 52.0 | 0.0 |

1 | 2017–11–28 23:24:24.000 | 2017–11–29 00:09:00.000 | 1 | 6.2 | 1 | | 246 | 112 | 3 | 26.5 | 0.5 |

1 | 2014–09–14 12:37:44.000 | 2014–09–14 12:49:49.000 | 2 | 1.4 |1 | | 186 | 170 | 3 | 9.5 | 0.0 |1 | 2015–11–20 21:35:41.000 | 2015–11–20 22:08:29.000 | 1 | 8.5 | 1 | | 264| 264 | 3 | 30.5 | 0.5< |

(10 rows)

Query 20230815_131646_00028_4suvq, FINISHED, 1 node

Splits: 353< total, 353 done (100.00%)

2.77 [487M rows, 6.36GB] [176M rows/s, 2.3GB/s]

Written By:

Pure Storage

Upskill Your Knowledge!

Looking to hone your coding skills? Check out some of our top coding blogs so you can go from try to DIY!

Check Out the Latest Articles

Blog Home

Trino S3 via hive-metastore Integration

Seamless Cloud Mobility

Upskill Your Knowledge!

Types of AI and Their Business Impact

AWS S3 vs. Azure Blob

RTO vs. RPO: What’s the Difference?

How to Ensure Business Continuity in the Age of Cyber Threats

Top Stories

Types of AI and Their Business Impact

AWS S3 vs. Azure Blob

RTO vs. RPO: What’s the Difference?

How to Ensure Business Continuity in the Age of Cyber Threats

Top Cybersecurity Infrastructure Standards to Know

Trino S3 via hive-metastore Integration

Seamless Cloud Mobility

Upskill Your Knowledge!

Related Stories

Top Stories