Data is driving all aspects of high-performance computing. It is especially important in analytics, AI/machine/deep learning applications where more data can help generate more accurate results. Sensors are everywhere and are creating enormous amounts of data; be it photos and videos on mobile phones, vibration sensors and cameras on production lines to LIDAR, or GPS and video data on self-driving automobiles. Efficient data handling is critical in all of these scenarios in order to enable the downstream applications that need them. Most applications in HPC and AI are run on many compute nodes in parallel and the levels of concurrency and parallelism needed are increasing as the problems we are trying to solve get bigger and more complex. However, if the data that is needed by these CPUs and GPUs is not available, cycles and energy will be wasted waiting and turnaround time goes up. When deep learning training times can be measured in days and weeks, time lost waiting for data can be very significant.
Applications like autonomous driving require large amounts of varied data and often subsets of that data as well. For example, autonomous data acquisition vehicles may have driven the roads of San Francisco for 12 months, but there may be a need to improve a model’s accuracy during foggy and twilight conditions. This would require searching through and generating a subset of data based on date and times or some other metadata identifier (perhaps even cross-referenced with a weather database). From a data storage and searching perspective, this will require a data layer that is able to efficiently manage data types from small to large size and be able to handle fast metadata searches.
Datasets in AI and HPC applications are large and growing. In a multi-node compute environment it is tempting to copy data to a local machine and operate on it directly (analogous to a pre-populated local cache) to avoid ongoing network traffic. In some cases, this may be possible if the data fits on the available local storage and the price for copying data locally is only paid once or twice. As a quick example, a 10TB dataset copied over a 10Gb/s link will take over 2 hours under ideal conditions. But what if it is not and the local cache needs to be refreshed over and over again? In this case, many more hours can be lost in simply copying data. In these instances, it is more prudent to completely bypass the cache and operate directly on the central data store.
FlashBlade from Pure Storage is the ideal data storage and management solution for HPC, AI and analytics applications. It has exceptional performance for all different types of data and access patterns while being extremely simple, compact and power-efficient. In addition, the RapidFile toolkit is available for extremely high-speed operations in high file count environments providing up to 60X run time improvement over standard UNIX tools like ls and find.
Stop by our booth and meet with our experts to learn more about Pure’s AI and HPC solutions.