When a FlashArray™or FlashBlade® device ships for the first time, it’s a proud and exciting moment for all of us in engineering. There are celebrations and fanfare, handshakes and shouts. And then, with (metaphorical) tears in our eyes, we bid the product a fond farewell—wishing it well like a dear friend about to head off on the journey of a lifetime.
Of course, this customer shipment is far from the end of that box’s story. Each FlashArray or FlashBlade has its own adventure ahead of it, full of possible dangers and difficulties: heavy workloads, outsider attacks, environmental extremes, and other unforeseen challenges. As the designers who first prepared these products for the world outside, we hope that every array will have a quiet and fulfilling life. However, we know that isn’t always the case. So, even before our systems make their way to the customer, we work together to ensure that they get all of the strict training, testing, and preparation that would befit the most exciting of origin stories.
Years of work have gone into getting that daring young box ready for the data center floor. We’ve gone through numerous conversations on architecture and planning, several design iterations, and untold cycles of stringent stress testing. Throughout this training, we make improvements and adjustments to ensure our young hero is ready for any threat it might face. Our customers can rest assured knowing that every FlashArray or FlashBlade is as well-prepared as possible for its responsibilities in the data center. But once our protagonist is out in the world, we also have an extra trick up our sleeves. It’s kind of a fairy godmother-esque power to make sure that no box is completely alone on its adventures.
That trick? Data analytics.
We talk so much about data here at Pure Storage that we’d be remiss if we didn’t use it ourselves. Every 30 seconds, FlashArray and FlashBlade devices in the field phone home information about performance metrics, age, health, and errors. The data includes anything that indicates how well the devices are doing their jobs and how close they could be to any kind of failure. Additional logs from these customer systems come back to us every hour. Dashboards consume all of this logged data to capture larger trends, flag outliers or errors, and provide insight into how products age and change over time.
By flagging outliers, we can remotely check for, and sometimes even anticipate, failures in customer systems. It enables us to act on them as soon as possible. An array may physically be on a data center floor, but we’re still actively monitoring it back here at Pure. Just last month, one of our dashboards caught a component with a potential quality issue. We notified the customer before they even noticed something was amiss. Some heroes must tiptoe blindly through a cave to avoid waking the sleeping dragon. Our hero can stride confidently. If danger is lurking nearby, we’ll be the first to send a warning and pull it away.
We also use this data to improve our future designs and internal processes. While we’re following seasoned veterans in the field, we’re also training the next generation of young protagonists. Our dashboards provide invaluable information for refining projects still in development. For example, monitoring the performance of each system component helps us identify the specific parts that are limiting the full assembly. We can add resources in these critical areas and address all the main bottlenecks.
Dashboard data also helps us as we identify root cause issues. It lets us better differentiate bugs that come from materials—like components or NAND—from bugs rooted in code. We can also differentiate between random failures that can happen at any time and wearout ones that happen normally at a product’s end of life. Once we understand these problems, we can make more informed decisions for how to respond—for example, using a specific code patch instead of replacement hardware. And in the spirit of constant improvement, we can also integrate smarter sensing into our software and internal quality process to better catch similar issues in the future.
Maybe one of the best examples of our data at work came up recently with our new QLC drive in FlashArray//C. When a customer’s array phoned home unusual behavior, we used our dashboards to confirm the abnormality and guess a possible cause before we got to work on a fix. Additional dashboards monitoring our internal systems helped us evaluate our initial hypothesis—that there was a dependence on temperature—while we performed controlled testing. After analyzing our test data, we found that temperature didn’t play as much of a part as we suspected, but we did see that a firmware change effectively and consistently improved the issue when we reproduced it in our own testbeds. Once we shipped the new code, the dashboards helped confirm that we’d resolved the issue—and that no other customers with this new firmware had seen it either. Another successful intervention from the fairy godmother—complete.
In my time at Pure so far, I’ve had the chance to prepare and watch over more than a few young heroes that are now out in the field. At times, I’ve felt as though I’m on my own adventure, just like those very systems I build. My dangers revolve around difficult design decisions, stubborn hardware bugs, and other such obstacles. And also like our boxes out in the field, I’m far from alone. I work on a team of seasoned designers who have seen many products off on their quests, and new engineers whose journeys have just begun. As we’ve all trained and improved new products, we’ve also helped and mentored each other. Just like our heroic little arrays, we’re ready to take on any challenge that we might encounter.
So, before our customers close their books? They can breathe a satisfied sigh, knowing that we’re all back here, working together, so that every data center can have its happily ever after.