SQL Server 2019 Big Data Clusters is a game-changing new feature for Microsoft data platform customers that provides:

  • A cloud-native hybrid scale-out data platform for both unstructured and structured data.
  • Data virtualization – a means of integrating data from different sources without having to write and maintain ETL packages.
  • The ability to perform analytics on data sets containing sensitive data for data scientists and T-SQL practitioners alike, without the data sets having to cross-compliance boundaries.

What About Infrastructure?

This is the architecture of a SQL Server 2019 big data cluster:

Architecture of a SQL Server 2019 big data cluster

A big data cluster consists of four different major components:

  • Control plane
  • Compute pool
  • Storage pool
  • Data pool

Each component has subtly different requirements, the compute pool – as its name suggests – is 100% compute-intensive. An infrastructure approach that allows compute and storage to be scaled independently provides the greatest degree of flexibility, commonly referred to as a disaggregated infrastructure. Similarly, a storage platform that is not tightly coupled to the same compute resources that serves the storage and data pools, allows for more flexible scaling for both storage and IO bandwidth capacity.

Big Data Cluster Storage Requirements

This can be broken down into two areas – the requirements of the Kubernetes cluster and the Big Data Cluster itself.

Kubernetes Cluster

The state of the cluster is stored in etcd instances – etcd is a high-performance lightweight key-value store database:
Kubernetes Cluster 

The certificates that the cluster components use also constitute a state.  Therefore etcd requires storage that is both durable and highly available.

The SQL Server 2019 Big Data Cluster Data is stored in two places:

SQL Server 2019 Big Data Cluster Data is stored in two places

The storage pool uses HDFS – “Hadoop Distributed File System” and Spark’s default storage format of Parquet. Under HDFS, data is stored in reliable distributed data sets which replicates data three times for the purposes of availability. 

The storage pool uses an HDFS replication factor of 3 as per this JSON excerpt:

Data protection is built into all Pure Storage® platforms using advanced RAID and erasure coding techniques tailored specifically for flash storage, so the HDFS replication factor can, therefore, be set to 1. The data pool has the same storage requirements as that of a conventional SQL Server database. 

Stateful Applications and Kubernetes

When docker (and docker compatible) containers were first conceived, the focus was on stateless applications with objects in a Kubernetes cluster that are managed declaratively. Pods – which encapsulate containers are either part of a replicaset or a statefulset. Kubernetes aims to ensure that the number of pods specified for a replicaset or statefulset are always running. If for any reason a worker node is unable to service a workload, the Kubernetes control plane will reschedule the pods from that node to run on other worker nodes.

The rescheduling of pods in the event of a node failure for stateless applications is a trivial problem to solve:

Kubernetes control plane will reschedule the pods
This challenge becomes more nuanced with stateful applications, due to the fact that wherever a pod is rescheduled to run, its associated state needs to ‘Follow’ it. There are two approaches to solving this problem, the first of which is to replicate state between worker nodes:

replicate state between worker nodes

In this scheme, if a pod needs to be rescheduled, it is placed on a worker node that its data has been replicated to. The second scheme uses a centralised storage platform:

second scheme uses a centralised storage platform

All nodes in the cluster have an IO path to a single shared storage platform. When pods are rescheduled to run on a different worker node, their volumes are unmounted and then re-mounted to wherever they move to.

Replicated Versus Shared Storage Platforms

Every single piece of data that is written using a replication-based infrastructure, must be written in multiple places, which adds to the complexity and management overhead of such platforms. Modern centralised storage platforms rely on techniques such as erasure coding to make data highly available, making these both more space and hardware efficient than replication. 

Persistent Volumes

A big data cluster can use two types of storage; ephemeral storage or persistent volumes. With ephemeral storage, the state associated with any pod is lost the instant that pod is rescheduled to run on any other worker node in the cluster. Ephemeral storage is therefore not recommended for production purposes. All production-grade SQL Server 2019 Big Data Clusters should use persistent volumes. The Kubernetes persistent volume storage eco-system is based on three objects: 

  • Volumes
  • Persistent volume claims
  • Persistent volumes

A volume is the touchpoint for storage consumption at the pod level and can be thought of in similar terms to a mount point. Storage must be associated with the volume, and this is where persistent volume claims come in – a request for storage from a storage class. Each storage entity available to the cluster is represented by a storage class. Different (or the same) storage classes can be specified for the storage and data pools, allowing storage platforms to be used which best suit the needs of each pool. For example, due to the highly parallel nature of Spark, it may be preferable to use a storage class for the storage pool associated with a platform that delivers high IO bandwidth and can service numerous IO requests concurrently.

Using a layered cake as an analogy, a volume is the topmost layer, followed by a persistent volume claim in the middle. The bottom-most layer is the actual persistent volume(s), which map directly to physical storage volumes or LUNs on the underlying storage platform. 

Storage is provisioned in one of two ways:

  • Manual provisioning: Human intervention is required in order to create persistent volumes from which storage is allocated in order to satisfy persistent volume claims.
  • Dynamic provisioning: When a persistent volume claim is requested, a persistent volume is created automatically and then the two entities are ‘Bound’, meaning that the persistent volume claim is usable.

What Does A Big Data Cluster Require?

The most basic requirements of a big data cluster are a storage platform that:

  1. Supports persistent volumes, which a storage class can be created for.
  2. Furnishes storage that is both durable and highly available.
  3. Allows a pod’s state to ‘Follow’ the pod whenever it is rescheduled.

Storage Plugins

Originally all Kubernetes storage plugins were “In-tree,” meaning that all storage vendors had to integrate their drivers directly into the Kubernetes source code. This type of tight coupling is incredibly inflexible, therefore, as a direct result the Kubernetes community came up with the Flex Volume architecture. Flex Volume drivers are out-of-tree; however, a major drawback is that files must be copied onto the root file system of each node in the cluster. The state-of-the-art for Kubernetes storage driver integration is now the Container Storage Interface, often abbreviated to CSI. Not only are CSI storage plugins out-of-tree, but they are containerized and are deployed using standard Kubernetes primitives.

All storage platforms that support persistent volumes and have an associated storage class can be used for a SQL Server 2019 big data cluster, irrespective of whether the plugin is in-tree, Flex Volume or CSI compliant. However, the only Kubernetes storage plugin interface specification that has any future is the CSI standard. All storage platforms with a roadmap that aligns with the Kubernetes community, must support the Container Storage Interface standard. Pure Storage already has customers using CSI 1.0 plugins in production, we will very shortly be releasing a CSI 1.1 compliant plugin and Pure Storage is firmly committed to following future CSI developments.

Introducing the Pure Storage Kubernetes Storage Solution 

Pure Storage provides a Flex Volume and CSI compliant plugin in the form of Pure Service Orchestrator™ (PSO) which provides three advanced capabilities:

What Does A Big Data Cluster Require

In keeping with the Pure mantra of ease-of-use and simplicity, Pure Service Orchestrator is installed via Helm. The ability of Pure Service Orchestrator to elastically scale across numerous storage arrays is seamless and simply requires:

  1. The IP address endpoint of the array(s) and access token(s) to be added to a YAML configuration file
  2. The invocation of the helm upgrade command to incorporate the new array(s) into the infrastructure that PSO can use

SQL Server 2019 Big Data Clusters can be deployed to vanilla Kubernetes on-premises with support for the OpenShift Container Platform set to follow. The good news is that whatever Kubernetes based platform SQL Server 2019 Big Data Clusters may support in the future, the likelihood is that Pure Service Orchestrator already supports it:

Summary

  • Disaggregated architectures that separate compute from storage provide the most flexibility for scaling big data cluster compute and storage components independently of one another.
  • The storage pool HDFS replication factor can be set to one for all Pure Storage platforms – due to their use of advanced RAID and erasure coding techniques. 
  • Storage platforms that allow every worker node in the cluster to mount every volume, provide the greatest flexibility in terms of pod scheduling and help minimize data centre hardware footprints.
  • Prefer the use of storage platforms that come with Container Storage Interface compliant plugins, because all future Kubernetes storage innovation will take place around this standard.

What Matters Most?

Organizations looking to deploy SQL Server 2019 Big Data Clusters are likely to be driven by the need to extract as much value as possible from their data, not from the standpoint of adding complexity and additional management overheads to their infrastructure. The good news is that we at Pure Storage make the challenge of SQL Server 2019 Big Data Cluster storage persistence an incredibly simple problem to solve. In doing so, data and analytics professionals have more time to return as much data-driven value back to their organizations as possible.

Learn best practices for managing data for Microsoft SQL Server, whether on-premises or using Cloud Block Store during the Power Up Copy Data Management for SQL Server with New Integrations Webinar, and dive deep into why and how you would want to deploy SQL Server DBs on Containers in the Streamline SQL Server Development with Containers Webinar.