This blog post is the second part of a two-part series, the first of which covers the use of Pure Service Orchestrator for providing persistence for a SQL Server 2019 big data cluster’s storage and data pools.
SQL Server 2019 big data clusters heralded Microsoft’s vision of a future in which data virtualization does away with the need for complex and cumbersome ETL processes. Data virtualization being the ability to consume data directly from different data sources without the requirement to perform any ETL. SQL Server 2019 big data clusters is a unified data platform for use by data scientists and T-SQL users alike. We at Pure share the same vision of future in which data does not live in numerous disparate data silos, but in a single aggregated pool, otherwise referred to as a “Data hub”:
FlashBlade™ underpins Pure’s vision of the data hub architecture. FlashBlade is a scale-out file, object, and unstructured data platform designed for IO throughput and concurrency optimized data access at the petabyte scale. FlashBlade has enjoyed considerable success in helping power some of the world’s largest deep learning GPU clusters, and it is also:
With the release to manufacturing version of SQL Server just around the corner, FlashBlade can now help power a modern Microsoft analytics experience with SQL Server 2019 big data clusters.
What Is A SQL Server 2019 Big Data Cluster?
SQL Server 2019 Big Data Clusters facilitate the deployment of scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes. All these components run in unison, thus enabling high-value relational and unstructured data to be processed on the same platform using Transact-SQL or Spark.
From an architectural standpoint, a big data cluster consists of four main components:
Connecting to An Object Storage via HDFS Tiering
HDFS tiering virtualizes data stored in S3 compatible storage or Azure Data Lake Services Generation 2 behind the storage pool, the crux of which is a block map that maps data stored in the object store to HDFS.
A simple walkthrough will help illustrate this in action with FlashBlade. The starting point is the creation of an S3 bucket; this is as easy as the following four Purity commands:
The Purity graphical user interface or REST API can be used for the S3 bucket and access key creation also. Data now needs to be placed in the bucket; this example uses the New York Taxi data set. For the sake of simplicity and expediency, s3cmd can be used for this purpose. Finally, mount the S3 bucket via HDFS tiering:
export MOUNT_CREDENTIALS=fs.s3a.access.key=<access key>, \
fs.s3a.secret.key=<secret key>, \
fs.s3a.endpoint=<FlashBlade endpoint ip address>, \
The access key and secret key represent the strings values access / secret key pair for the bucket that contains the source data. The FlashBlade IP address represents the data endpoint of the array that holds the S3 bucket.
4. Mount the S3 bucket:
azdata bdc hdfs mount create –remote-uri=s3a://bdcdata/nyc_taxis \
5. Check that the bucket has mounted successfully, the mount operation should be near-instantaneous:
azdata bdc hdfs mount status
This screenshot represents what the New York taxi data set looks like in Azure Data Studio:
A whole world of analytics possibilities is now opened up, all enhanced by the performance of FlashBlade! Such as running notebooks:
or creating external tables for the consumption of the data via T-SQL:
Leverage SQL Server 2019 big data clusters in unison with the massive capacity, IO throughput, IO concurrency, and parallelism of FlashBlade powered datahub for the best possible on-premises analytics experience. And with Pure Service Orchestrator, the power of built for flash from the ground up storage can be harnessed for the storage pool and the data pool also.