AI projects have a tendency to go far beyond budget and deadlines. Application- and infrastructure-level complexities often balloon as an AI project matures, and AI solutions definitely aren’t at the plug-and-play level yet.
If you’re starting to build an AI pipeline today, it’s good to at least be aware of what the more advanced AI stages look like so you don’t build yourself into a corner later.
Application and infrastructure-level complexity can create pain that will only be magnified the deeper you get into AI. From first AI consideration to cutting-edge augmented learning, you’ll need to look for ways to improve data management efficiencies.
At Pure, we’ve helped our customers successfully deploy AI solutions from beginning to end. Here are some of the most common “gotchas” to watch out for early on and what to do about them.
1. Tool Choice as it Relates to Model Goals
You’ve decided to invest in an AI strategy: You brainstorm possible ways AI could add business value to your company, identify the data set to explore and problems to solve for, and then evaluate whether a simple heuristic would suffice instead of machine learning.
These days serious efforts rarely start from scratch with open source toolkits even if they’re still in play. Newer versions of PyTorch and TensorFlow are still relevant, but they’re used to refine pretrained models with the customer’s data. Most will use models from the NVIDIA NGC and then their TAO framework to retrain them and keep them up to date with the latest gathered data.
However, be mindful of project scope and going all in on the wrong tool just because others use it. For training data availability, you may not be collecting data for the goal you’d like to build a model around. What are the goals that you want your models to achieve? There are two related problems: (1) not collecting the right data needed to provide sufficient signal for the AI models to make the right decisions, and (2) collecting too much data that isn’t useful for the model and just ends up taking up resources.
Solution: Hype can make it hard to evaluate whether a simpler solution would suffice. Things like machine learning are significant investments, so let yourself off the hook if there’s a viable alternative and be sure to give every possible type of tool and solution a fair chance and fair evaluation.
2. Data Preparation
Today, you may be labeling training data with human inferences or acquiring pre-labeled data. You’ll need to examine the training data set to verify its consistency and evaluate outliers, empty, and erroneous values.
Common pain points during data preparation:
- Manual data labeling is labor-intensive and error-prone—up to thousands of labelers manually assessing and documenting the content of each piece of training data.
- Training data sets must be representative to produce accurate results (what if, instead of identifying happiness, it’s identifying green T-shirts?).
- Data management and provenance can be time-consuming and inefficient—tens of copies of a training data set may be needed across various data formats and data science projects.
Solution: Use your favorite training application (e.g., Caffe2, TensorFlow, PyTorch) to train a neural network. You can validate the neural network against a pre-reserved data set and iterate on the model to increase accuracy, working toward production readiness. Once you reach an acceptable accuracy level with a model, you’re ready to start running inference. Deploy the neural network into a stream of new data to analyze on the fly.
3. Model Training
Common pain points during training:
- Training iterations are often slowed by complexity of hyperparameter tuning, slow storage performance, and repetitive data movement. If you have to stage data and move it around between silos of infrastructure, you’re doing it wrong. To minimize data management efforts, keep the data on a high performance, consolidated, and scalable storage platform.
- Debugging neural networks requires extensive adjustment of the training software’s tuning knobs and training data itself, often taking several months’ worth of iterations to reach production-level accuracy.
Solution: Keep an AI pipeline contained to a centralized storage platform to decrease your time to first pipeline.
4. Scope Creep
AI projects usually expand to involve more than one model, which increases completeness of information you can gather during inference.
For example, you might build a neural network for each factor of facial recognition: age + gender + emotion. AI teams should ensure that their infrastructure can support multiple data scientists or data science teams making use of the same training data concurrently.
The pain at each stage of human interaction with data is multiplied exponentially. Inefficiencies and pain points from earlier phases of AI development are compounded and cascaded from the development of one neural network to each downstream project.
Solution:Consolidate data sets into a single storage platform with flexible expandability so that you can continue to scale performance even with an increasing number of concurrent jobs accessing data.
5. Continuous Training
During the inference stage, you can detect anomalies and feed them back into your pipeline to re-train the model, sometimes called “active learning.”
You may also decide to adjust pipelines to utilize improvements offered by the ongoing explosion of network design (e.g., convolutional networks, GANs) or data sources (e.g., synthetic training data generation). Successful teams often apply DevOps-like tactics to deploy models on the fly and maintain a continuous feedback loop.
Common pain points for this part of the process include:
- Inflexible storage or network infrastructure unable to keep up with evolving performance demands of pipeline changes can limit AI teams.
- Model performance monitoring can be a challenge. Models can drift as the data flowing through them changes.
Solution: Spot checks or, ideally, automated ground-truth performance checks can avoid costly or annoying model drift.
The Ultimate Solution: FlashBlade//S to Manage All of These AI Project Challenges at Once
The bottom line: AI teams can move through development phases faster if armed with fast, scalable infrastructure.
Today, teams frequently have infrastructure silos for each stage of their AI pipeline, which is less flexible and more time-consuming than having a single, centralized storage platform connecting the entire pipeline.
Instead of running stages of pipelines across multiple storage silos, give valuable time back to your data science team by eliminating the waiting, complexity, and risk of copying, managing, and deleting multiple copies of your data. Direct attached storage simply won’t scale for AI. At Pure, we built the ultimate storage platform for AI, FlashBlade//S®, engineered to accelerate every stage of the data pipeline.
We come across many companies that wonder if AI is real for their business. It is. No matter whether AI is central to your company’s core competency or not , AI is a tool you should be using to bring efficiency and accuracy to your data-heavy projects. So, regardless of how far up the pyramid you plan to take your AI strategy, it’s critical to have infrastructure that supports both massive data ingest, predictive analytics and automated decision-making.
We’ve done it. We’ve helped people succeed, and we’re here to help with any questions you have about GPU and storage architectures, software toolchains, or how all the puzzle pieces can come together.
Read Why Storage Matters in this Business White Paper: FlashBlade//S Storage Built for AI.
Get to know AIRI//S. AI Ready Infrastructure from Pure Storage and NVIDIA.