Just recently, Rubrik announced their integration with the FlashArray to help backup virtual machines and avoid the common performance penalty incurred during VMware snapshot consolidation. See their announcement here.
First off, the problem. When you take a VM snapshot of a VMFS-based VM, there is a performance penalty. This is due to the fact that when a snapshot is “created” what really happens is that a delta VMDK is instantiated. All new writes from that virtual machine go to the delta virtual disks, instead of the base VMDK. This makes that base VMDK store the point-in-time state of the virtual disk. This process is referred to as “redirect-on-write”.
Here is a good KB for more detail:
It is well understood that keeping these snapshots around is not a good idea-mainly due to the performance impact of this redirection. Basically, the rule is to not keep them for more than 24 hours, but really don’t keep them around longer than you have to. If something is about to happen (like a patch), take the VMware snapshot, install the patch, ensure nothing is broken, then remove the snapshot.
There are three primary performance-impacting events here:
Let’s look at an example.
I kicked off a light workload in my VM using VDBench:
Doing 2,000 IOPS, 70% write, 4KB I/O size. Nothing crazy, but to elucidate the concept.
The latency is looking pretty good, all sub-ms as it should be. .3 for reads and .4 for writes. Now I take a snapshot:
But moving forward, you can see the performance continues to suffer, as the latency is much higher than what it was without the VMware snapshot. Read response time is around 1 ms and write latency is 3 to even 5 ms or more. This latency is being introduced in the ESXi storage stack (the snapshots), as the FlashArray has no idea about the additional latency–it still reports as fine, but the throughput and IOPS goes way up due to the additional work being done with the snapshot:
So the VM experiences quite the penalty when this snapshot is present.
So back to Rubrik. I ran a normal backup on this VM (running the same workload as above) and at a high level this is what it does:
Two things to note here:
The VMware snapshot was created at 7:23 AM, and the backup was complete at 6:28 pm which is when it started to delete the snapshot.
The backup took ~11 hours.
The VMware snapshot was around for ~11 hours.
The VM workload was impacted (high latency) for ~11 hours. Not fun.
NOTE: I will note this example is a full backup, so subsequent incremental snapshots will very likely have a decreased backup duration, but this time still can be significant if the VM has a high change rate in between scheduled backups.
The VM snapshot delete took time too (87 minutes), and this is somewhat unavoidable. But the bulk of the time was the backup (665 minutes) and this is what the integration seeks to shorten, which has the biggest impact. That being said, the reduction of the snapshot lifetime should also in turn shorten the snapshot delete operation.
So what does Rubrik do?
The goal here is to minimize the impact from a performance perspective to the VM targeted for backup. The most obvious way to do this is to reduce the duration that the VMware snapshot exists.
So what Rubrik does is leverage the FlashArray snapshot technology to offload the backup process from the source production VM to one recovered from a snapshot. The process is like so:
The benefit here is that unlike in the original process where the VMware snapshot exists for 11 hours, it only exists for seconds. Really reducing the impact duration on the VM.
Furthermore, it is important to note that the FlashArray snapshots are:
Setup is pretty easy. Go into you Rubrik interface and first add your FlashArray. You just need your IP/FQDN and username/password.
Then add your array:
Next, find your VM in the interface (mine is called Windows2012R2):
Then enable array integration for that VM:
If you do not see the “Enable Array Integration” option, make sure you added the right FlashArray. Also, find your vCenter in the Rubrik interface and choose “refresh”. That did the trick for me.
And that is it! Pretty simple. So from now on, any on-demand or scheduled backups will use the array to offload the process. Very simple!
So I will kick off an on-demand full backup (no backups previously exist). The process timing looks like so:
So a couple things to note here. First off the overall backup time is pretty much the same. This is expected. What is different is how long the VMware snapshot exists on the source VM. The VMware snapshot is created at 8:04 AM and then the deletion starts less than a minute later. So the “existence” of the VMware snapshot is down from 665 minutes to less than 1.
The deletion itself completes 44 minutes later at 8:49. So the deletion time is reduced as well, from 87 minutes to 44.
So the total impact time to the VM being backed up is reduced from 752 minutes (12.5 hrs) to 45 minutes. Not bad at all!
For a quick overview, check out this video I did with my good friend Nitin Nagpal (Director of Product Management at Rubrik):