This post was originally published on this siteIn Part 1 of creating volumes from Protection Group (PGroup) sources I discussed how to get a specific snapshot to use for ...
Last week was a great one for Pure Storage. For the first time, we attended the AI Summit at the legendary gegency Center in San Francisco. The conference served as a platform to show the world what FlashBladeTM can do for AI, and what our customers are doing with FlashBlade. We shared customer stories, which featured marquee innovators like autonomous automaker Zenuity, and even got to celebrate when we took home the award for Best Innovation in AI Hardware at the AI Summit. But most importantly, we learned – a lot – from some of the best and brightest in the industry. Here are eight observations from the event and the overall AI market as a whole.
Given the event location I wasn’t shocked to see so many large Silicon Valley-based tech companies, but that was augmented by surprisingly strong attendance from the Global 2000. During the two-day event, I spoke with serious attendees from major retailers, telcos, auto manufacturers, health care providers, drug and medical device firms, airlines and insurers. And next year’s event is expected to be twice the size.
Statistical regression, correlation, scoring and traditional forms of analytics can be classified as AI, but they are distinctly different than the super-trend of using deep learning neural networks. There are differences in the software tools, the infrastructure and the skill sets of the data scientists applying them. They will both exist in parallel, but the biggest most radical innovations will come from neural networks. The time is now because the right tech “ingredients” are all coming together simultaneously. On the SW side it’s new SW algorithms like TensorFlow, Caffe2, PyTorch, and mxnet. On the infrastructure side it’s a combination of GPU driven compute, faster networks and flash storage.
A lot of the initial hype around AI focuses on the training of neural networks. As humans, we are both fascinated and sometimes scared by the idea of a computer with the ability to “learn.” If you are trying to build a successful AI practice it’s important to understand that most of the hard work is done well before the training phase. Most of the iceberg is below the waterline.
The collection, extraction and transformation of data are the first stages in this new data pipeline. Thankfully there is a lot of skill and myriad tools from decades of data warehousing, but as any data architect knows, it’s not easy to do well. At Pure Storage, we’ve been involved in a number of deep learning projects and we hear constant challenges around data tagging and debugging the initial model. The ability to effectively undertake these two stages will be the biggest challenge, and a competitive advantage for those that can do it well. Once all this is complete, a great deal of GPU, network and storage horsepower is still needed to complete the training itself. But a lot of the hard work is over. As a storage company, we also know that fast shared storage will be of great value and effectively move data through this new data pipeline. Manually copying large amounts of data to local SSDs for each stage doesn’t scale and will be an inhibitor to rapid progress.
The data pipeline concept is important, but the age-old computer adage of “garbage in, garbage out” still applies. I saw a pithy quote from a researcher at one of the big Silicon Valley internet giants that said: “We don’t have better algorithms, we just have more data”. I think what they meant to say was, “We don’t have better algorithms, we just have more high-quality data.” Feeding poor data into a neural network isn’t going to get people where they want to go, and this is another reason to really map out data sources and spend due effort on the collection, extraction, and transformation of data. I like bacon more than oil, so I’ll say good Data is the new Bacon.
Much of the effort in the new data pipeline is going to be undertaken by programmers and data scientists. All the standard best practices for SW development apply. Version the schema you use to extract and transform data. It will change over time. Keep a copy of the raw data to go back to and re-process with a new schema. Save processed data again so you can go back and tag iteratively to attempt to gain even more value from it. Check all the models into Git or a source code control system. If you work with great SW developers and ensure they have the processes to make them efficient, good results are more likely.
I was often asked by attendees if they should build their new AI system in a local “on-prem” data center or in the cloud. My response: it is much easier to move processing to the data than vice versa. Data has gravity. If you are recording HD video, every five cameras recording at 1080p 30fps can generate around a terabyte of data per day. 10 genetic sequencers collecting data can generate somewhere around 30 petabytes in a year. Moving large data sets is costly and complex, both in terms of network bandwidth and in terms of time/complexity. If all of your data is generated from a cloud app, run the AI there as well. If your data is generated or centralized in an existing data center, run your AI program there.
To implement a successful deep learning pipeline you need people, software, and infrastructure. Among these three, the software layer is going to experience the highest rate of change. I’d expect new SW capabilities will roll in all the time, so you will likely need to add/change your pipeline every 3, 6 or 9 months. You won’t change your people every 3, 6 or 9 months, and you certainly don’t want to rebuy your infrastructure or try to move to/from/between cloud providers because of the data gravity issue. Because of these differences in rates of the change, it is exceedingly important to get people and infrastructure that are adaptable to change. We know from Charles Darwin the fastest and strongest aren’t the ones that survive, but the most adaptable.
I’ve attend about a bazillion trade shows over my career, and I can tell you the average IQ at the AI Summit was in the top 5 percent of all the events I’ve been to in the past 15 years. Hiring smart people is always a good plan, and given the opportunity and rate of change in AI it’s true here as well. Given the rapid rate of innovation in this space no one knows everything, so it’s also important to surround yourself with consultants and vendors that can be your trusted advisors. Cultivate a list of peers and colleagues to bounce ideas off of, and plan to attend next year’s AI Summit. Hope to see you there.
In the meantime, if you’re already deep into your AI project – or just thinking about what AI can do for your business – we’d love to discuss it with you. And whatever your ambitions, we believe FlashBlade should be the data platform you consider to help accelerate your AI and modern analytics strategy. We make it fast, and we make it easy.