This RapidFile Toolkit v2.0 blog was authored by both Keshav Attrey and Calvin Nieh.
In 1985, my mother got tired of waiting for my father to finish his PhD and finally convinced him to stop banging out his dissertation on a typewriter and go buy a PC—an AT&T 6300 with one floppy drive, 256KB of RAM, and a 5MB hard drive. Sure enough, with the right tools, his productivity went up and he finished his dissertation. Just two years earlier, AT&T Bell Labs launched Unix System V, which included the first version of ls, a tool for listing files. For its time, ls was a great utility.
Fast forward four decades and Pure Storage® has developed FlashBlade//S™: a next-generation, Evergreen, unified fast file and object (UFFO) platform that supports multi-petabyte, billion-file workloads with ease. Yet many customers I speak with continue to manage large-scale file systems using single-threaded tools designed back when 5 megabyte disk drives were the norm and 100 megabytes was massive. This presents a huge barrier to data scientists and IT professionals in modern organizations that need to maximize the speed at which they get results.
Customers can shatter these barriers by combining FlashBlade’s massively parallel, scale-out architecture with Pure Storage RapidFile Toolkit v2.0, a suite of ultrafast rewrites of traditional Linux file system tools that accelerate common operations on Linux by 20x or more.
RapidFile Toolkit serves as a high-performance, drop-in replacement for Linux commands in many common scenarios. By switching to RapidFile Toolkit, customers can increase employee efficiency, application performance, and business productivity using file management commands, scripts, and workflows similar to what they’ve used for decades. Let’s dive into how, with minimal effort, RapidFile Toolkit can transform file management performance within AI/ML, analytics, system administration, and electronic design automation (EDA) workflows.
RapidFile Toolkit was originally developed to accelerate a machine learning workflow of one of Pure’s largest customers. As it turned out, listing files is a common bottleneck in analytics and machine learning pipelines, because processing millions of files usually requires you to first list the file names. When your data set exceeds 10 million files, listing them can take hours.
By incorporating RapidFile Toolkit, we helped this customer eliminate the file listing bottleneck in their PyTorch-based ML workflow, dramatically accelerating their time to insight. RapidFile Toolkit can also be used in scripts to randomly shuffle data sets and Jupyter notebooks. With version 2.0, RapidFile Toolkit now supports JSON Lines output, enabling data scientists to easily parse output from RapidFile Toolkit’s pls command directly in Python.
For data scientists and system administrators, everyday tasks and scripts can be sped up simply by replacing Linux tools with the corresponding RapidFile Toolkit command. For example, updating permissions and ownership across tens of millions of files can take hours. Such tasks can often be done in minutes by replacing the Linux chmod and chown commands with RapidFile Toolkit’s pchmod and pchown commands. With version 2.0, RapidFile Toolkit now supports local file systems, which enables you to use the same accelerated file management scripts both on and off FlashBlade®.
In EDA, shortening the design-to-production time frame enables competitive advantage and faster time to revenue. With high-file-count environments such as scratch space used for EDA, it isn’t unusual to have hundreds of millions of files in a single file system. In these environments, space management represents a significant bottleneck in the design process. We’ve found that workflows to clean up aged files see massive improvements in runtime by switching to RapidFile Toolkit. With version 2.0, RapidFile Toolkit’s prm command now supports piped input without xargs, making it easy to parallelize file deletion for scripts that already find old files.
In DevOps environments, even small optimizations of everyday developer tasks can result in major increases in business productivity when applied across an engineering organization. RapidFile Toolkit provides an accelerated version of Linux’s cp command called pcopy, which copies directory trees up to 20 times faster than standard Linux cp. Our customers have used RapidFile Toolkit’s pcopy command to reduce the time to set up large client workspaces from hours to minutes. This enables your developers to start working right away instead of waiting hours for millions of files to finish downloading from the source code repository.
Lastly, with version 2.0, RapidFile Toolkit is introducing support for copying between local directories and FlashBlade file systems. This enables you to use RapidFile Toolkit throughout your Linux environment to rapidly copy data sets into sandboxes on FlashBlade and local hosts, accelerate script-based backup workflows to and from FlashBlade, and rapidly restore specific subdirectories using FlashBlade file systems’ .snapshot directory.
RapidFile Toolkit v2.0 is available to all Pure Storage FlashBlade customers by going to the RapidFile Toolkit documentation page and logging in using your Pure1® customer login.
Related Resources:
- Learn more about FlashBlade//S, the only scale-out storage platform that efficiently powers your modern unstructured data needs, delivering cutting-edge capabilities without complexity.
- Join us virtually this week (September 19-22, 2022) at NVIDIA’s GTC 2022 online event, where Pure is a Platinum sponsor. Read our GTC announcement and watch our on-demand session, “Unlock the Value of Data and Accelerate Your AI/ML Initiatives,” presented by Miroslav Klivansky, Principal Data Architect for AI and Analytics at Pure.
Written By: