What infrastructure teams can expect from AI projects.

Do you find yourself wishing that data sets loaded and copied faster? That you didn’t have to think about where to put your code or Docker images? How about adding hassle-free persistent storage for your Kubernetes services? 

As a data scientist, I’ve come to appreciate that if the storage underneath AI projects is good, then you don’t have to waste time thinking about those kinds of issues. 

Here are eight ways that better storage can help solve data science problems—some obvious and others not.  

#1 Training a model means reading a data set over and over again. 

Training models is the clearest interaction between developers and storage. As training jobs read data sets again and again, they’re driving a constant random read workload. GPUs process data quickly, so you want your data to be accessible at least as quickly. You’ll need a minimum non-sequential (random) read throughput from your storage.

#2 Read throughput is taxed during other activities as well. 

Before training starts, it’s important to examine the contents of the data set. Look at the format and the content distribution. Data sets frequently need to be wrangled. For example, take a scenario where the data contains major class imbalance. It might take a series of experiments to ensure that classes are detected properly. These investigations could range from label merging, undersampling, or adjusting focal loss. But those projects add to the read throughput burden on storage. 

Even when the data is “clean,” data scientists often adjust training data sets and include or exclude specific data points as they iterate toward a final model. You need to plan for these extensive re-reads of the entire data set. For example, does your IT budget cover extensive data egress charges?

#3 AI workloads aren’t reading only large image files.

Many data sets have labels or annotations like bounding boxes that aren’t part of the main training data .  They’re often in a separate directory or bucket. Annotations are stored in a range of file formats. Data scientists may have to create a translation between the original annotation files and a format acceptable to the type of algorithm they’re using. 

Surprisingly, some storage is tuned to specific file sizes and access patterns. That’s because it’s often more performant reading large files rather than millions of tiny files. AI workloads, however, are dependent on performant random read access for a range of file types and sizes. Storage that’s performant for only one access pattern is likely to become a bottleneck during some parts of the AI workflow.

#4 You’ll probably end up creating additional data sets.

While academic projects in AI focus mainly on GPU work, real-world deployments must also consider the data-loading work (input pipeline). Real-world data is often larger than academic data. For example: 

  •  ResNet-50’s input size is 224 x 224 pixels. 
  • The majority of ImageNet files are smaller than 500 x 500 pixels.
  •  A digital pathology image can be 100,000 x 60,000 pixels.

It’s often impractical to resize images on the fly. It takes too long and requires more time on the GPU server’s CPU. Many teams choose to permanently save “chipped” versions of the data set to reuse as the new training data set. 

When data scientists have the storage space to build more Deep Learning-friendly versions of data sets, it exponentially improves their time to results. 

#5 Write throughput for model checkpoints can add up at scale. 

While reads from storage dominate storage traffic by volume, writes can also overwhelm storage if there are many concurrent projects. During a training job, most scripts write out a checkpoint of the model file back to storage. Can your storage handle many developers writing their model files simultaneously? 

#6 Often, every single data set item is listed at the start of each run.

AI data processing stages

Figure 1. Listing all the files in the data set takes time

Training jobs randomize the order in which they process data so that models are more resilient and stable for predicting all classes. Before randomizing the file order, the training job lists all items in the data set so it’s aware of all files that need to be shuffled.

Many DL data sets are structured as directories with a large number of subdirectories. Files from each subdirectory are listed (often via ls as part of os.walk) before shuffling. When data sets contain millions of items, listing every file can take several minutes. (I’ve seen it take up to 20 minutes!)

That step has to happen at the start of every single training job, which is repetitive and wastes time. Storage can help solve this problem in two ways: 

  •  Commands like ls are executed serially by default. If storage supports a parallel version of ls, it can improve listing files by up to 100x. It greatly reduces the time needed  for shuffling at the start of each job. (For example: “parallel ls” for AI workloads.)
  •  Alternatively, developers may consider saving out a manifest of the data set’s contents, especially for static data sets. That single file could provide the information needed to shuffle the data set (i.e., read it in a random order) without walking the whole file tree at the start of each training job.  

#7 Data scientists follow the easiest path to Jupyter notebooks.

Without team-wide planning, data scientists tend to work with environments that are scattered and unofficial. If IT teams provided a centralized dev platform for data scientists, it would easier to keep dev work secure and backed up, minimize data set duplication, and onboard new users. 

AI developers can use a centralized storage server to simplify infrastructure management

Figure 2: Example of centralized IDEs for data scientists using JupyterHub inside a Kubernetes cluster.

Put all those data flavors together on a central storage server to simplify infrastructure management. Ideally, that central data hub would also support data from other pieces of the platform, including Docker images and monitoring/logging tools, with high performance.

#8 A production AI pipeline brings a whole variety of storage needs. 

Once a model is ready for production, it needs to be hosted somewhere. It needs monitoring and a way to set up pipelines that continuously retrain the model with new data. Incoming data needs a place to land, and inference results need to be preserved and shared. Storage that supports the entire pipeline can minimize data copy time and enable cross-pipeline monitoring, alerts, and security. 

AI developers production pipeline

Figure 3: Example components of an inference pipeline running inside a Kubernetes cluster. Each piece needs storage that meets specific data requirements.

Takeaway

Building deep-learning models is a journey that drives a wide range of I/O patterns on the underlying storage. At first glance, it seems storage should be optimized for random-access read throughput of large files, but a holistic view of the development process shows that this is only part of the story. The workloads will also tax storage performance for small files, metadata, and writes. 

When there’s a tool that you depend on all the time, it’s worth investing in a reliable, high-quality version of it. If you’ll be doing deep learning at scale, see if you can simplify both developer and IT team processes by using a fast, versatile storage solution that meets all of the diverse AI needs.