This blog post was co-authored by David Gonzales, CEO of Ziff.ai
Building a data pipeline for AI projects is hard, but it gets even harder when you move toward the later stages of an AI project that include things like Continuous Training and Augmented Learning.
This post describes some common ways that application-level and infrastructure-level complexities can balloon as an AI project matures. Don’t take the decision to “implement AI” lightly because, for now, AI solutions aren’t plug-&-play. If you’re starting to build an AI pipeline today, it’s good to at least be aware of what the more advanced AI stages look like so you don’t build yourself into roadblocks later.
At Pure, we’ve helped our customers successfully deploy AI solutions from beginning to end. There are some common ‘gotchas’ that you should think about early-on.
From first AI consideration to cutting-edge augmented learning, look for ways to improve data management efficiencies. Application and infrastructure level complexity can create pain that will only be magnified the deeper you get into AI.
1: AI consideration
You have decided to look into an AI strategy: You brainstorm possible ways AI could add business value to your company, identify the data set to explore and metrics to solve for, and then evaluate whether a simple heuristic would suffice instead of machine learning. It’s common to explore AI tools using open source software applications (e.g. TensorFlow) and open source data sets (e.g. ImageNet).
Common pain points:
- Training data availability – you may not already collect data for the metric you’d like to build a model around.
- AI hype can make it hard to evaluate whether a simpler solution would suffice. Machine learning is a significant investment, so let yourself off the hook if there’s a viable alternative.
2: First ML production deployment
Today, you may be labeling training data with human inferences or acquire pre-labelled data. You’ll need to examine the training dataset to verify its consistency and evaluate outliers, empty, and erroneous values.
Common pain points during data preparation:
- Manual data labeling is labor intensive and error prone – up to thousands of labelers manually assessing and documenting the content of each piece of training data.
- Training data sets must be representative in order to produce accurate results (what if, instead of identifying sadness, it’s identifying green t-shirts?).
- Data management and provenance can be time consuming and inefficient – tens of copies of a training data may be needed across various data formats and data science projects.
Use your favorite training application (e.g. Caffe2, TensorFlow, PyTorch) to train a neural network. You can validate the neural network against a pre-reserved dataset and iterate on the model to increase accuracy, working toward production readiness. Once you reach an acceptable accuracy level with a model, you’re ready to start running inference. Deploy the neural network into a stream of new data to analyze on the fly.
Common pain points during training:
- Training iterations are often slowed by complexity of hyperparameter tuning, slow storage performance, and repetitive data movement. If you have to stage data and move it around between silos of infrastructure, you are doing it wrong. To minimize data management efforts, keep the data on a single, scalable storage platform.
- Debugging neural networks requires extensive adjustment of the training software’s tuning knobs and training data itself, often taking several months’ worth of iterations to reach production-level accuracy.
Keep an AI pipeline contained to a centralized storage hub to decrease time-to-first-pipeline.
3: Scope expansion
AI projects usually expand to involve more than one model, which increases completeness of information you can gather during inference. For example, you might build a neural network for each factor of facial recognition: age + gender + emotion. AI teams should ensure that their infrastructure can support multiple data scientists or data science teams making use of the same training data concurrently.
The pain points are the same as in Phase 2 but are amplified with the number of models being worked on.
4: Continuous training
During inference, you can detect anomalies and feed them back into your pipeline to re-train the model, sometimes called “active learning.” You may also decide to adjust pipelines to utilize improvements offered by the ongoing explosion of network design (e.g. convolutional networks, GANs) or data sources (e.g. synthetic training data generation). Successful teams often apply DevOps-like tactics to deploy models on the fly and maintain a continuous feedback loop.
Common pain points:
- Inflexible storage or network infrastructure unable to keep up with evolving performance demands of pipeline changes can limit AI teams.
- Model performance monitoring. Models can drift as the data flowing through it changes. Spot checking or, ideally, automated ground-truth performance checks can avoid costly or annoying model drift.
5: Augmented Learning
Neural networks move from working in effective silos to being integral to each others’ development. There are numerous ways to leverage existing neural networks. For example, you can jumpstart new networks via transfer learning by substituting training data or applying existing models to adjacent problem sets.
The pain at each stage of human interaction with data is multiplied exponentially.
Inefficiencies and pain points from earlier phases of AI development are compounded and cascaded from the development of one neural network to each downstream project.
AI teams can move through development phases faster if armed with fast infrastructure.
Today, teams frequently have infrastructure silos for each stage of their AI pipeline, which is less flexible and more time-consuming than having a single, centralized storage hub connecting the entire pipeline.
Instead of running stages of pipelines across multiple storage locations, give valuable time back to your data science team by eliminating the waiting, complexity and risk of copying, managing and deleting multiple copies of your data. Direct attached storage simply won’t scale for AI. At Pure, we built the ultimate storage hub for AI, FlashBlade™, engineered to accelerate every stage of the data pipeline.
We come across many companies who wonder if AI is real for their business. It is. No matter whether AI is central to your company’s core competency (you’re an analytics company) or not (you’re an insurance company), AI is a tool you should be using to bring efficiency and accuracy to your data-heavy projects. So, regardless of how far up the pyramid you plan to take your AI strategy, it’s critical to have infrastructure that supports both massive data ingest and rapid analytics evolution.
We’ve done it. We’ve helped people succeed, and we’re here to help with any questions you have about GPU+storage architectures, software toolchains, or how all the puzzle pieces can come together. Reach out to us if you want to learn how to accelerate the time to get insight from your data.