This blog post is the second part of a two-part series, the first of which covers the use of Pure Service Orchestrator for providing persistence for a SQL Server 2019 big data cluster’s storage and data pools.
SQL Server 2019 big data clusters heralded Microsoft’s vision of a future in which data virtualization does away with the need for complex and cumbersome ETL processes. Data virtualization being the ability to consume data directly from different data sources without the requirement to perform any ETL. SQL Server 2019 big data clusters is a unified data platform for use by data scientists and T-SQL users alike. We at Pure share the same vision of future in which data does not live in numerous disparate data silos, but in a single aggregated pool, otherwise referred to as a “Data hub”:
FlashBlade™ underpins Pure’s vision of the data hub architecture. FlashBlade is a scale-out file, object, and unstructured data platform designed for IO throughput and concurrency optimized data access at the petabyte scale. FlashBlade has enjoyed considerable success in helping power some of the world’s largest deep learning GPU clusters, and it is also:
- helping usher in the era of the self-driving car at organizations such as Zenuity,
- accelerating the time to market for Man AHL; a pioneer in the field of systematic quantitative investing,
- reducing the time in which Searcher Seismic can provide seismic data to customers from weeks to minutes,
- enabling Chronopost, the French express shipping and delivery service, to exploit data in real-time and develop innovative customer services through R&D.
With the release to manufacturing version of SQL Server just around the corner, FlashBlade can now help power a modern Microsoft analytics experience with SQL Server 2019 big data clusters.
What Is A SQL Server 2019 Big Data Cluster?
SQL Server 2019 Big Data Clusters facilitate the deployment of scalable clusters of SQL Server, Spark, and HDFS containers running on Kubernetes. All these components run in unison, thus enabling high-value relational and unstructured data to be processed on the same platform using Transact-SQL or Spark.
From an architectural standpoint, a big data cluster consists of four main components:
The controller provides secure connectivity and cluster management; this essentially acts as the control plane for the cluster. Apache Knox provides a single point authentication gateway for the spark services provided by the storage pool, and Apache Livy enables Spark for job submission via a REST API. A master instance provides a single point of access to the data point, and the controller service acts as the connection point for managing the cluster.
- Compute Pool
The compute pool consists of containerized SQL Server instances that use their column store engines to shuffle and aggregate data from both the storage and data pools.
- Storage Pool
The storage pool is built up of pods, each consisting of a SQL Server container, a Spark instance and an HDFS node. The storage pool acts as the big data cluster’s engine for processing and storing unstructured data, and to users of the cluster, it looks very much like Apache Spark.
- Data Pool
The data pool handles the storage and processing of relational data. In essence, the data pool is a set of containerized SQL Server instances that data is spread across. Access to the data pool is via the master instance.
Connecting to An Object Storage via HDFS Tiering
HDFS tiering virtualizes data stored in S3 compatible storage or Azure Data Lake Services Generation 2 behind the storage pool, the crux of which is a block map that maps data stored in the object store to HDFS.
A simple walkthrough will help illustrate this in action with FlashBlade. The starting point is the creation of an S3 bucket; this is as easy as the following four Purity commands:
- pureobjaccount create <account name>
- purebucket create —account <account name> <bucketname>
- pureobjuser create <account name/username>
- pureobjuser access-key create —user <account name/username>
The Purity graphical user interface or REST API can be used for the S3 bucket and access key creation also. Data now needs to be placed in the bucket; this example uses the New York Taxi data set. For the sake of simplicity and expediency, s3cmd can be used for this purpose. Finally, mount the S3 bucket via HDFS tiering:
- For release candidate 1, set the CONTROLLER_USERNAME and CONTROLLER_PASSWORD environment variables.
- Log into the cluster controller service; this uses port 30080 by default unless otherwise changed via a custom configuration at cluster creation time:
azdata login -e https://<controller service ip address>:30080
- Set the MOUNT_CREDENTIALS environment variable:
export MOUNT_CREDENTIALS=fs.s3a.access.key=<access key>, \
fs.s3a.secret.key=<secret key>, \
fs.s3a.endpoint=<FlashBlade endpoint ip address>, \
The access key and secret key represent the strings values access / secret key pair for the bucket that contains the source data. The FlashBlade IP address represents the data endpoint of the array that holds the S3 bucket.
4. Mount the S3 bucket:
azdata bdc hdfs mount create –remote-uri=s3a://bdcdata/nyc_taxis \
5. Check that the bucket has mounted successfully, the mount operation should be near-instantaneous:
azdata bdc hdfs mount status
This screenshot represents what the New York taxi data set looks like in Azure Data Studio:
A whole world of analytics possibilities is now opened up, all enhanced by the performance of FlashBlade! Such as running notebooks:
or creating external tables for the consumption of the data via T-SQL:
Leverage SQL Server 2019 big data clusters in unison with the massive capacity, IO throughput, IO concurrency, and parallelism of FlashBlade powered datahub for the best possible on-premises analytics experience. And with Pure Service Orchestrator, the power of built for flash from the ground up storage can be harnessed for the storage pool and the data pool also.