HCI in healthcare: The Good, The Bad, The Burly…

Almost weekly, I get asked about my thoughts about Hyperconverged Infrastructure (HCI) in healthcare.

Before I begin, let me note that while I’ve used several VMware vSAN examples in this blog, it’s not because I’m picking on vSAN but rather because VMware’s vSAN documentation is excellent and easy to reference for a few of the points that I make below. Other HCI offerings use similar principles.

There are generally 4 things that folks like about HCI, namely: low cost, operational simplicity, reliable performance, and collapsed stacks. We’ll touch on all 4 of these items in this blog — it’s a lengthy read, but the TL;DR version is: if we’re talking about any of the mainstream healthcare applications running on HCI (such as VDI), we’re having the wrong conversation; HCI is simply not a good fit for healthcare applications.

Let me explain…

HCI Landscape & Adoption

Today’s major HCI vendor landscape includes folks like VMware (vSAN), Nutanix, Springpath, SimpliVity, etc. In a majority of cases, the two that stand out are VMware vSAN and Nutanix — both are admirable companies with solid products. That said, let’s take a quick look to see what the overall industry’s take is on HCI adoption — now and with an eye to the future.

The following image from an ESG Research Report (The Cloud Computing Spectrum, from Private to Hybrid), illustrates a cross-industry / cross-vertical poll of IT organizations to get their take on on-premises infrastructure flavors.

 

Only 8% of the polled participants indicated that they were looking towards HCI for their infrastructure needs, while over half indicated that a piecemeal approach was more to their liking. Interesting. What’s also interesting is the large (30%) group that are considering CI (Converged Infrastructure) platforms.

Next, let’s take a look at why users would pick HCI over CI. The following data is part of the same ESG study referenced above.

Looking at these reasons, it’s not a far reach for us to consider the top items as part of an “operational simplicity” umbrella. Digging deeper, it’s oftentimes the case where the virtualization/compute teams are seeking to take charge of the storage layer, too, and HCI gives them just the platform to do so. However, considering the collapsed domain nature of HCI, these solutions often times require a nontrivial amount of storage expertise to be able to correctly and safely provision capacity for workloads. This seems like the opposite of simplicity and goes back to the old days of requiring very deep storage expertise to be able to correctly and safely provision capacity for workloads, turning the right knobs, etc.

Next, let’s take a look at why users would pick CI over HCI. The following data is also part of the same ESG study referenced above.

Let’s focus on the #1, #2, and #4 items in this graphic, viz.: better performance, better reliability, and better suited to tier-1 workloads/mission critical workloads. Interesting…

Herein lies my point — in healthcare IT, it is incredibly hard, if not impossible, to find true non-tier-1/non-mission critical front-end workloads! Now, some might ask about back office workloads, such as billing and other non-clinical workloads. True, there’s some more flexibility to those applications, but the typical healthcare IT organization leverages shared infrastructure to provision all applications across the enterprise, and it’s usually very hard to separate these workloads from each other.

Therefore, I count VDI (Virtual Desktop Infrastructure) as one of these tier-1/mission-critical workloads because, quite frankly, the VDI deployment is the face of the franchise. If the VDI experience is slow, or worse, unreliable, the end-users’ experience is that the clinical application(s) are slow or unavailable. Remember, experience ≡ reality!

Balancing Cost, Availability, and Performance

In an ideal world, we’d find a great balance between 3 somewhat opposing factors: balancing cost with high levels of availability and predictable performance. This is doable — traditional, bespoke deployments, as well as Converged Infrastructure, allow you to mix and match appropriate components to achieve just the right balance for your environment. HCI, on the other hand, struggles to provide a good balance across the 3 factors. Let’s see why.

Here’s a somewhat simplified visual representation of the HCI tradeoffs:

It’s not always easy to be able to balance these tradeoffs in an HCI environment. Let’s see why…

A term we should familiarize ourselves with is FTT, or Failures-to-Tolerate. FTT defines the number of concurrent faults a particular configuration can sustain, or tolerate, while continuing to remain online. Ergo, FTT=1 indicates the capability of a system to tolerate 1 failure, FTT=2 indicates 2 failures, and so on.

FTT is a key part of a successful HCI deployment because of the collapsed failure domain nature of this architecture. Once again, as an example, looking at VMware’s vSAN best practices on Fault Domains, we see that a basic formula to calculate the acceptable number of nodes based on FTT is 2n + 1 where n = the FTT factor; for FTT=2, for example, you’ll want to deploy 5 physical hosts (nodes) in order to allow HCI to appropriately be able to rebuild storage components in the event of a node being unavailable / node failure. This is, again, extra “unusable” capacity which must be provisioned.

We should also familiarize ourselves with the term FTM, or Failure Tolerance Method (aka RAID / Erasure Coding) which is also used to provide either fault tolerance or better performance. While we won’t go into the murky details of the various FTMs in this blog, it’s important to note that in HCI we need to worry about FTT and FTM when planning a particular deployment. Why? Again, because of the collapsed failure domain nature of HCI, a failure of a single subsystem inside of a node results in a failure of that entire node.

Another thing to internalize is that a configuration that has FTT=1 and has data reduction enabled can result in data loss/data degradation of one or more VMware data stores on a node in the event of a single media error. In fact, Duncan Epping of the VMware’s CTO Office recommends that FTT=2 should be your chosen configuration moving forward. This necessarily means an over-over-configuration of a deployment by intentionally reducing the amount of available capacity per node. Also, it’s important to note that every modern storage platform has provided resilience against dual concurrent faults for over a decade — from a storage vendor’s perspective, protecting customers data with FTT=1 is not only antiquated but also quite unheard of!

It’s also generally vendor-provided best-practice to maintain at some amount of reserve, free space per node (usually ranges from 10% to 30%), so the overall per-node capacity needs to be further adjusted appropriately. And, finally, let’s not forget the per-node HCI hypervisor overhead (CPU) — somewhere on the range of 10% – 30%.

Tying it all together

Here’s where we really dive into the Burly aspect of HCI in Healthcare. In order to achieve the high levels of availability and consistently high (and predictable) performance, you’re going to spend a fortune, and even then, will end up with a configuration that is massively underutilized from a compute and throughput standpoint. You’ll need an {FTT=2, RAID6 Erasure Coding or RAID1+1 Mirroring} configuration per-node (remember, each node will contain a minimum of 150% of the needed capacity — see this from the VMware vSAN Design and Sizing Guide, for example), and want to configure enough nodes so that, at most, each node is never utilized over the point whereby your cluster can survive up to {X} node failures and the application performance impact during rebalancing and datastore rebuild is within the acceptable SLAs for the application.

Cost-optimized: HCI configurations typically produce low performance and low availability; these configurations offer data reduction, configuring all disk-based nodes, and, most importantly, accepting FTT=1 (very low availability) and FTM=RAID5 Erasure Coding (low performance).

Availability-optimized (higher availability): HCI configurations are typically optimized by offering data reduction, configuring either hybrid flash/disk or fully disk-based nodes, and, most importantly, configuring FTT=2 (higher availability) and Triple Mirroring (RAID1 + RAID1), or RAID 6 Erasure Coding (low performance). This configuration includes several in-system spares so as to increase availability, but drives the costs up.

Performance-optimized (higher performance): HCI configurations are typically optimized by offering no data reduction, configuring hybrid flash/disk, FTT=1 (low availability) and RAID1 Mirroring (low availability). The faster disks along with no data reduction leads to an expensive proposition which also has low availability.

Additionally, while HCI vendors do a reasonably good job of publishing benchmark (performance and scalability) data, it’s very hard to find reliable best-practices for a balanced (high availability + high performance) deployment model.

What’s also important to recognize is that upon rebalancing, HCI systems typically start to rebuild the datastore objects from the failed node(s) on the surviving node(s) — sometimes fairly aggressively — causing a further impact on the performance of the application(s) being served by the surviving node(s). To accommodate for this, you’ll want to further reduce the per-node peak workload, necessitating more nodes in your cluster.

So, how do you address the requirements of applications that require high availability and consistent performance? You need to minimally address:

Per-node:

  • Hypervisor overhead — 10% – 30%, for storage and compute, each
  • FTM = Triple Mirroring or RAID6 Erasure Encoding — 150% storage overhead, minimum
  • Slack space — 10% – 30% storage overhead

Cluster-wide:

  • Enough nodes in the cluster to accommodate FTT = 2 (minimum for fault-tolerant applications)
  • Enough nodes to accommodate cluster-wide rebalancing and rebuilding of resources for node failure(s) — planned or unplanned — so as to avoid the performance and availability impact for users during such rebuilding and rebalancing activities
  • Cluster-wide free space to maintain for failure — a good rule of thumb: add {((N – (N – FTT)) / N)} nodes to the total, where N = the original number of nodes. For example, a nominally 10 node cluster with FTT = 2 would need an additional ((10 – (10 – 2)) / 10) = 20% nodes, or a total of 12 nodes

So, the question isn’t how many HCI nodes is too many for guaranteed performance and availability, but rather, how many nodes is too few

Thusly, HCI is simply not a good fit for healthcare applications, as it cannot economically meet your technical and organizational demands.

I’d like to leave you with an awesome interview between Terri McClure, Senior Analyst @ Enterprise Strategy Group, and Vaughn Stewart, VP of Technology at Pure Storage (and a friend, colleague, and a thought leader in the HCI vs. CI space). It’s worth a watch!