string(7) "English"

From Bytes to AI: Why It’s All About the Data Lifecycle

Advances in deep neural networks have ignited a new wave of algorithms and tools for data scientists to tap into their data with artificial intelligence (AI). With improved algorithms, larger data sets, and frameworks such as TensorFlow, data scientists are tackling new use cases like autonomous driving vehicles and natural language processing.

Data is the heart of modern deep learning algorithms. Before training can even begin, the hard problem is collecting the labeled data that is crucial for training an accurate AI model. Then, a full scale AI deployment must continuously collect, clean, transform, label, and store larger amounts of data. Adding additional high quality data points directly translates to more accurate models and better insights.

The goal of this blog is to describe and make sense of all of the different ways that data engineers and data scientists ingest, process, and use data in a deep learning system, and then to focus on how a data architect can design the storage infrastructure to power a production AI pipeline.

Lifecycle of Data

Data samples undergo a series of processing steps:

  • Ingest the data from an external source into the training system. Each data point is often a file or object. Inference may also have been run on this data. After the ingest step, the data is stored in raw form and is often also backed up in this raw form. Any associated labels (ground truth) may come in with the data or in a separate ingest stream.
  • Clean and transform the data and save in a format convenient for training, including linking the data sample and associated label. This second copy of the data is not backed up because it can be recomputed if needed.
  • Explore parameters and models, and quickly test with a smaller dataset and iterate to converge on the most promising models to push into the production cluster.
  • Training phases select random batches of input data, including both new and older samples, and feed those into production GPU servers for computation to update model parameters.
  • Evaluation uses a holdback portion of the data not used in training in order to evaluate model accuracy on the holdout data.

This lifecycle applies for any type of parallelized machine learning, not just neural networks or deep learning. For example, standard machine learning frameworks, such as Spark MLlib, rely on CPUs instead of GPUs but the data ingest and training workflows are the same.

As seen above, each stage in the AI data pipeline has varying requirements from the underlying storage architecture. To innovate and improve AI algorithms, storage must deliver uncompromised performance for all manner of access patterns, from small to large files, from random to sequential access patterns, from low to high concurrency, and with the ability to easily scale linearly and non-disruptively to grow capacity and performance. For legacy storage systems, this is an impossible design point to meet, forcing the data architects to introduce complexity that just slows down the pace of development. The FlashBlade is the ideal AI data hub as it was purpose-built from the ground up for modern, unstructured workloads.

In the first stage, data is ideally ingested and stored on to the same data hub such that following stages do not require excess data copying. The next two steps can be done on a standard compute server that optionally includes a GPU, and then in the fourth and last stage, full training production jobs are run on powerful GPU-accelerated servers like the DGX-1. Often, there is a production pipeline alongside an experimental pipeline operating on the same dataset. Further, the DGX-1 GPUs can be used independently for different models or joined together to train on one larger model, even spanning multiple DGX-1 systems for distributed training.

A single shared storage data hub creates a coordination point throughout the lifecycle without the need for extra data copies among the ingest, preprocessing, and training stages. Rarely is the ingested data used for only one purpose, and shared storage gives the flexibility to interpret the data in different ways, train multiple models, or apply traditional analytics to the data.

If the shared storage tier is slow, then data must be copied to local storage for each phase, resulting in wasted time staging data onto different servers. The ideal data hub for the AI training pipeline delivers similar performance as if data was stored in system RAM while also having the simplicity and performance for all pipeline stages to operate concurrently.

The Data Scientist Workflow

A data scientist works to improve the usefulness of the trained model through a wide variety of approaches: more data, better data, smarter training, and deeper models. In many cases, there will be teams of data scientists sharing the same datasets and working in parallel to produce new and improved training models.

The day-to-day workflow of data scientists and data engineers includes:

  1. Collating, cleaning, filtering, processing and transforming the training data into a form consumable by the model training.
  2. Experimenting with, testing and debugging a model on a small subset of the training data.
  3. Training the model with the full set of training data for longer periods of time.

This workflow is iterative between these stages: development, experimentation, and debugging. The key development tool is a deep-learning framework like TensorFlow, Caffe2, CNTK, et al. These frameworks provide utilities for processing data and building models that are optimized for execution on distributed GPU hardware.

Often, there is a team of data scientists working within these phases concurrently on the same shared datasets. Multiple, concurrent workloads of data processing, experimentation, and full-scale training layer different the demands of access patterns on the storage tier. In other words, the storage cannot just satisfy large file reads, but must contend with a mix of large and small file reads and writes.

Finally, with multiple data scientists exploring the datasets and models, it is critical to store data in its native format to provide flexibility for each user to transform, clean, and use the data in a unique way. Ultimately, it is the experimentation and iteration of this workflow that yield more powerful models.

FlashBlade provides a natural shared storage home for the dataset, providing data protection redundancy (using RAID6) and the performance necessary to be a common access point for multiple developers and multiple experiments. Using FlashBlade avoids the need to carefully copy subsets of the data for local work, saving both engineering and DGX-1 system use time. These copies become a constant and growing tax as the raw data set and desired transformations constantly update and change.

Scalable Datasets

A fundamental reason why deep learning has seen a surge in success is the continued improvement of models with larger data set sizes. In contrast, classical machine learning algorithms, like logistic regression, stop improving in accuracy at smaller data set sizes.

A recent quote from a leading AI researcher highlights the need:

“As of 2016, a rough rule of thumb is that a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labeled examples per category, and will match or exceed human performance when trained with a dataset containing at least 10 million labeled examples.” Ian Goodfellow 2016.

Recent research from Google has shown the advantages of increasing the dataset size, showing logarithmic increase on performance in vision tasks as data sets increase in size to 300 million images. Even further, this research suggests that higher capacity models require proportionally larger datasets.

Separation of compute (DGX-1) and storage (FlashBlade) also allows independent scaling of each tier, avoiding many complexities of managing both together. As the data set size grows or new data sets are considered, a scale out storage system must be able to expand easily. Similarly, if more concurrent training is required, additional GPUs or DGX-1 servers can be added without concern for their internal storage.

Why FlashBlade

A centralized data hub in a deep learning architecture increases the productivity of data scientists and makes scaling and operating simpler and more agile for the data architect. FlashBlade specifically makes building, operating, and growing an AI system easier for the following reasons.

  • Performance: With over 15GB/s of random read bandwidth per chassis and up to 75GB/s total, the FlashBlade can support the concurrent requirements of an end-to-end AI workflow.
  • Small-file handling: The ability to randomly read small files (50KB) at 10GB/s from a single FlashBlade chassis (50GB/s with 75 blades) means that no extra effort required to aggregate individual data points to make larger, storage-friendly files.
  • Scalability: Start off with a small system and then add a blade to increase capacity and performance as either the dataset grows or the throughput requirements grow.
  • Native object support (S3): Input data can be stored as either files or objects.
  • Simple administration: No need to tune performance for large or small files and no need to provision filesystems.
  • Non-Disruptive Upgrade (NDU) everything: Software upgrades and hardware expansion can happen anytime, even during production model training.
  • Ease of management: Pure1, our cloud-based management and support platform, allows users to monitor storage from any device and provides predictive support to identify and fix issues before they become impactful. With Pure1, users can focus on understanding data and not on administering storage.
  • Built for the future: Purpose-built for flash to easily leverage new generations of NAND technology: density, cost, and speed.

Small file performance of the storage tier is critical as many types of inputs, including text, audio, or images will be natively stored as small files. If the storage tier does not handle small files well, an extra step will be required to pre-process and group samples into larger files. Most legacy scale-out storage systems are not built for small file performance.

Storage built on top of spinning disks but that rely on SSD as a caching tier fall short of performance needed. Because training with random input batches results in more accurate models, the entire data set must be accessible with performance. SSD caches only provide high performance for a small subset of the data and will be ineffective at hiding the latency of spinning drives.

Ultimately, the performance and concurrency of FlashBlade means that a data scientist can quickly transition between phases of work without wasting time copying data. FlashBlade also will enable executing multiple different experiments on the same data simultaneously.

A follow-up post will go into detail about the hardware and software components needed to realize a production AI pipeline with DGX-1 and FlashBlade and present benchmarks.