This blog on Trino S3 initially appeared on Medium. It was republished with the author’s credit and consent.
In this blog, I’ll go over how to use S3 storage on a Pure Storage® FlashBlade® with Trino, the fast distributed SQL query engine for big data.
I deploy Trino using the hive chart and provide a values.yaml file with the following configuration:
1 2 3 4 5 6 7 8 |
additionalCatalogs: lakehouse: | connector.name=hive hive.metastore.uri=thrift://hivems-hive-metastore.analytics.svc.local:9083 hive.s3.aws–access–key=PSFB....BJEIA hive.s3.aws–secret–key=A121....eJOEN hive.s3.endpoint=192.168.2.2 hive.s3.ssl.enabled=false |
This is pointing to my hive-metastore server. See this blog post for more information on setting that up. I then edit the Trino service to switch from ClusterIP to NodePort to facilitate external access.
As usual, I use the helm install command:
1 |
helm –n analytics install trino trino/trino –f values.yaml |
On a Linux client with the trino-cli installed, I use the following command to connect to my in Kubernetes running instance, and list the current catalogs available:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
$ ./trino–cli–422–executable.jar —server https://10.225.114.50:32140 trino> show catalogs; Catalog —————– hive< system tpcds tpch (4rows) Query 20230810_122520_00000_gb9kf, FINISHED, 2 nodes Splits: 68 total, 68 done (100.00%) 1.15 [0rows, 0B] [0rows/s, 0B/s] trino> |
I can then select my hive source and check the available tables:
1 2 3 4 5 6 7 8 9 10 11 |
trino> use hive.default; USE trino:default> show tables; Table ———————– nyctaxi_table (1row) Query 20230810_140256_00004_uapkp, FINISHED, 3 nodes Splits: 68 total, 68 done (100.00%) 0.45 [1 rows, 30B] [2 rows/s, 67B/s] trino:default |
Note that to see the available schemas, you can use:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
trino:default> describe nyctaxi; Column | | Extra | Comment ———————————–+———————+———–+————– vendori | bigint | | tpep_pickup_datetime | timestamp(3) | tpep_dropoff_datetime | timestamp(3) | | passenger_count | double | | trip_distance | double | | ratecodeid | double | | store_and_fwd_flag | varchar | | pulocationid | bigint | | dolocationid | bigint | | payment_type | bigint | | fare_amount | double | | extra | double | | mta_tax | double | ; | tip_amount | double || tolls_amount | double | | improvement_surcharge | double | | total_amount; | double | | (17rows) Query 20230810_163505_00104_a45jd, FINISHED, 3 nodes Splits: 68 total, 68 done (100.00%) 0.28 [17rows, 1.05KB] [60rows/s, 3.72KB/s] |
Modern Hybrid Cloud Solutions
Accelerate innovation with a modern data platform that unifies hybrid and multicloud ecosystems.
I can now run various queries on the data set. Please note this is from a very limited lab Kubernetes cluster with low resources and network connectivity, so performance was not the aim:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
trino:default> selectcount(*) from nyctaxi; _col0 —————– 899294289 (1row) Query 20230815_132013_00029_4suvq, FINISHED, 1 node Splits: 343 total,343 done (100.00%) 1.04 [899M rows, 3.56MB] [865M rows/s, 3.43MB/s] trino:default> select * from nyctaxi WHERE tolls_amount > 100 order by tolls_amount desc< limit 10; vendorid | tpep_pickup_datetime | tpep_dropoff_datetime | passenger_count | trip_distance | ratecodeid | store_and_fwd_flag | pulocationid | dolocationid | payment_type | fare_amount | extra | mta —————+————————————–+————————————–+————————–+———————–+——————+——————————+———————+———————+———————+——————–+———–+—— 2 | 2017–07–03</ 17:15:15.000 | 2017–07–03 17:35:57.000 | 1 | 9.74 | 1 | | 138 | 107 | 2 | 27.5 | 1.0 | 2 | 2015–12–1200:17:39.000 | 2015–12–12 01:14:18.000 | 1 | 38.48 | 5 | | 68 | 265 | 2 | 200.0 | 0.0 | 1 | 2016–12–29 07:46:33.000 | 2016–12–29< 08:03:36.000 | 1 | 4.7 | 1 | | 151 | 223 | 2 | 16.5 | 0.0 | 1 | 2014–08–15 10:37:42.000 | 2014–08–15 11:07:21.000 | 1 | 10.0 | 1 | | 230 | 138 | 3 | 32.0 | 0.0 | 1 | 2015–01–06 18:27:38.000 | 2015–01–06 19:04:04.000 | | 18.5 | 2 | | 264 | 264 | 2 | 52.0 | 0.0 | 1 | 2016–03–14 15:16:32.000 | 2016–03–1416:27:11.000 | 1 | 18.5 | 2 | | 132 | 230 | 3 | 52.0 | 0.0 | 1 | 2016–05–02 09:10:21.000 | 2016–05–0209:58:00.000 | 1 | 18.0 | 2 | | 229 | 132 | 3 | 52.0 | 0.0 | 1 | 2017–11–28 23:24:24.000 | 2017–11–29 00:09:00.000 | 1 | 6.2 | 1 | | 246 | 112 | 3 | 26.5 | 0.5 | 1 | 2014–09–14 12:37:44.000 | 2014–09–14 12:49:49.000 | 2 | 1.4 |1 | | 186 | 170 | 3 | 9.5 | 0.0 |1 | 2015–11–20 21:35:41.000 | 2015–11–20 22:08:29.000 | 1 | 8.5 | 1 | | 264| 264 | 3 | 30.5 | 0.5< | (10 rows) Query 20230815_131646_00028_4suvq, FINISHED, 1 node Splits: 353< total, 353 done (100.00%) 2.77 [487M rows, 6.36GB] [176M rows/s, 2.3GB/s] |
Written By:
Upskill Your Knowledge!
Looking to hone your coding skills? Check out some of our top coding blogs so you can go from try to DIY!