This article on Airbyte S3 initially appeared on Medium. It was republished with the author’s credit and consent.
In this blog, I’ll show a simple implementation of Airbyte on Kubernetes with S3 integration on Pure Storage® FlashBlade®.
From our Kubernetes server with Helm installed, I first add the required helm repo for Airbyte:
1 2 3 |
helm repo add airbyte https://airbytehq.github.io/helm-charts helm repo update |
Then, I deploy with, if required, a values.yaml to the desired namespace:
1 |
helm install s3airbyte airbyte/airbyte –n airbyte |
I edit the service/s3airbyte-airbyte-webapp-svc to change from ClusterIP to NodePort to have a quick port forward to the web interface.
Airbyte interface.
Let’s now create a simple connector to pull data from an S3 bucket. I select Create Connector and choose S3 as the type. Then, I provide optional fields for the AWS access and secret keys, as well as the endpoint. For the endpoint, since I have no SSL certificate on my demo environment setup, I specify an http path.
Airbyte will test the source and then prompt for a destination to be created.
For the sake of this blog post, I’ll simply pass a second, newly created S3 bucket on our FlashBlade as the destination and test the switch from parquet to avro as the transformation of the data.
Hacker’s Guide to Ransomware Mitigation and Recovery
Again, Airbyte will test the destination, and after validation, I’m presented with the Configure Connection settings page. Change settings to suit you. I’ll leave it all as per default:
After setup, I’m passed to the Connection Management pages, where I can see its status, job history, replication, transformation, and settings:
While this is in progress, I quickly check the objects in the two source and destination buckets:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
$ aws s3api list–objects–v2 —bucket nyctaxi —profile fbstaines03 —no–verify–ssl —endpoint–url=https://192.168.40.165 InsecureRequestWarning, { “Contents”: [ { “LastModified”: “2023-08-08T11:41:20.000Z”, “ETag”: “d2de0ffc4f9112b91c5fe3a407c07435”, “StorageClass”: “STANDARD”, “Key”: “yellow_tripdata_2023-01.parquet”, “Size”: 47673370 } ] } $ aws s3api list–objects–v2 —bucket airbytedest —profile fbstaines03 —no<–verify–ssl —endpoint–url=https://192.168.40.165 |
To see the progress of the job, select Job History > View logs:
It will show the current count of records processed, for my example S3 connector job:
The job will finish and in the Job History, sync history information for the successful sync is displayed:
Let’s check the destination bucket contents. I now have four objects from our parquet to avro conversion:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
$ aws s3api list–objects–v2 —bucket airbytedest —profile fbstaines03 —–verify–ssl —endpoint–url=https://192.16840.165 InsecureRequestWarning, { “Contents”: [ { “LastModified”: “2023-08-08T11:56:47.000Z”, “ETag”: “0a08d7a59360c0ec2f53cd572ca86127-20”, “StorageClass”: “STANDARD”, “Key”: “/nyctaxi/fbS3nyctaxi/2023_08_08_1691495559589_0.avro”, “Size”: 209738278 }, { “LastModified”: “2023-08-08T12:00:51.000Z”, “ETag”: “3c81c74aa12620ada225390959677eb7-20”, “StorageClass”: “STANDARD”, “Key”: “/nyctaxi/fbS3nyctaxi/2023_08_08_1691495559589_1.avro”, “Size”: 209747517 }, { “LastModified”: “2023-08-08T12:04:56.000Z”, “ETag”: “f62110ec469abac38d1b5b12a1ccf4f4-20”, “StorageClass”: “STANDARD”, “Key”: “/nyctaxi/fbS3nyctaxi/2023_08_08_1691495559589_2.avro”, “Size”: 209748737 , { “LastModified”: “2023-08-08T12:07:30.000Z”, “ETag”: “fc6eae990030e9d6684c7035e65bfdca-13”, “StorageClass”: “STANDARD”, “Key”: “/nyctaxi/fbS3nyctaxi/2023_08_08_1691495559589_3.avro”, “Size”: 131695016 } ] |
Airbyte provides a simple platform to extract, transform, and load data from multiple sources and destinations thanks to its 300+ connectors.
Pure Storage FlashBlade’s S3 storage is simple to integrate and provides a fast, scalable S3 layer for Airbyte and analytics applications to leverage within the larger data pipeline picture.
Written By:
Upskill Your Knowledge!
Check out our catalogue of coding blogs so that you can go from try to DIY.