This blog post was co-authored by David Gonzales, CEO of Ziff.ai
Building a data pipeline for AI projects is hard, but it gets even harder when you move toward the later stages of an AI project that include things like Continuous Training and Augmented Learning.
This post describes some common ways that application-level and infrastructure-level complexities can balloon as an AI project matures. Don’t take the decision to “implement AI” lightly because, for now, AI solutions aren’t plug-&-play. If you’re starting to build an AI pipeline today, it’s good to at least be aware of what the more advanced AI stages look like so you don’t build yourself into roadblocks later.
At Pure, we’ve helped our customers successfully deploy AI solutions from beginning to end. There are some common ‘gotchas’ that you should think about early-on.
From first AI consideration to cutting-edge augmented learning, look for ways to improve data management efficiencies. Application and infrastructure level complexity can create pain that will only be magnified the deeper you get into AI.
1: AI consideration
You have decided to look into an AI strategy: You brainstorm possible ways AI could add business value to your company, identify the data set to explore and metrics to solve for, and then evaluate whether a simple heuristic would suffice instead of machine learning. It’s common to explore AI tools using open source software applications (e.g. TensorFlow) and open source data sets (e.g. ImageNet).
Common pain points:
2: First ML production deployment
Today, you may be labeling training data with human inferences or acquire pre-labelled data. You’ll need to examine the training dataset to verify its consistency and evaluate outliers, empty, and erroneous values.
Common pain points during data preparation:
Use your favorite training application (e.g. Caffe2, TensorFlow, PyTorch) to train a neural network. You can validate the neural network against a pre-reserved dataset and iterate on the model to increase accuracy, working toward production readiness. Once you reach an acceptable accuracy level with a model, you’re ready to start running inference. Deploy the neural network into a stream of new data to analyze on the fly.
Common pain points during training:
Keep an AI pipeline contained to a centralized storage hub to decrease time-to-first-pipeline.
3: Scope expansion
AI projects usually expand to involve more than one model, which increases completeness of information you can gather during inference. For example, you might build a neural network for each factor of facial recognition: age + gender + emotion. AI teams should ensure that their infrastructure can support multiple data scientists or data science teams making use of the same training data concurrently.
The pain points are the same as in Phase 2 but are amplified with the number of models being worked on.
4: Continuous training
During inference, you can detect anomalies and feed them back into your pipeline to re-train the model, sometimes called “active learning.” You may also decide to adjust pipelines to utilize improvements offered by the ongoing explosion of network design (e.g. convolutional networks, GANs) or data sources (e.g. synthetic training data generation). Successful teams often apply DevOps-like tactics to deploy models on the fly and maintain a continuous feedback loop.
Common pain points:
5: Augmented Learning
Neural networks move from working in effective silos to being integral to each others’ development. There are numerous ways to leverage existing neural networks. For example, you can jumpstart new networks via transfer learning by substituting training data or applying existing models to adjacent problem sets.
The pain at each stage of human interaction with data is multiplied exponentially.
Inefficiencies and pain points from earlier phases of AI development are compounded and cascaded from the development of one neural network to each downstream project.
AI teams can move through development phases faster if armed with fast infrastructure.
Today, teams frequently have infrastructure silos for each stage of their AI pipeline, which is less flexible and more time-consuming than having a single, centralized storage hub connecting the entire pipeline.
Instead of running stages of pipelines across multiple storage locations, give valuable time back to your data science team by eliminating the waiting, complexity and risk of copying, managing and deleting multiple copies of your data. Direct attached storage simply won’t scale for AI. At Pure, we built the ultimate storage hub for AI, FlashBlade™, engineered to accelerate every stage of the data pipeline.
We come across many companies who wonder if AI is real for their business. It is. No matter whether AI is central to your company’s core competency (you’re an analytics company) or not (you’re an insurance company), AI is a tool you should be using to bring efficiency and accuracy to your data-heavy projects. So, regardless of how far up the pyramid you plan to take your AI strategy, it’s critical to have infrastructure that supports both massive data ingest and rapid analytics evolution.
We’ve done it. We’ve helped people succeed, and we’re here to help with any questions you have about GPU+storage architectures, software toolchains, or how all the puzzle pieces can come together. Reach out to us if you want to learn how to accelerate the time to get insight from your data.