Data Tiering in Azure Blob Storage with Fpsync and FlashBlade

Fpsync is a host-based tool that enables you to move data from an NFS share on FlashBlade to Azure Blob storage.


image_pdfimage_print

Data mobility from an on-premises FlashBlade® system to cloud-adjacent locations, like Equinix for cloud bursting, is important for staging the data in the Equinix-hosted FlashBlade to use the compute resources in Azure Cloud. Our recent post, Data Mobility for HPC and EDA Workloads from On-premises to Azure Cloud, highlights the process to handle data mobility between an on-premises FlashBlade and an Equinix-hosted FlashBlade using file system replication.

However, after the life of a project, you can choose to move the data back to the on-premises FlashBlade or from the FlashBlade in Equinix to a cold storage solution like Azure Blob storage for long-term retention. While array-level replication can be used to move data from the Equinix-hosted FlashBlade into the local data center, host-based tools like fpsync can be useful to move data from an NFS share on FlashBlade to blob storage in Azure.

Figure 1: Data Migration – Long term retention in Azure Blob storage

What Is Fpsync?

Fpsync is a powerful open source migration tool that uses “fpart” and “rsync” to migrate small and large files across heterogeneous storage endpoints and data formats. While fpart synchronizes directories in parallel, rsync copies data from the source to the target locations. The fpsync tool has a faster file transfer rate irrespective of the file sizes and the size of data that is copied compared to the standard UNIX “cp” and other open source tools available.

A typical semiconductor chip design environment has high file counts with deep directory structure and millions of small files with soft and hard links. Fpsync is a very effective host-based tool to migrate design and simulation data across heterogeneous data platforms.

As shown in the diagram above, there may be a lot of residual data on FlashBlade from the design and simulation workflow in the Equinix location that may not need to be retained on primary high-performance storage. Organizations may decide to retain the data in cheap and deep blob storage in Azure Cloud.

How to Use NFS with Azure Blob Storage

Azure provides a hierarchical namespace in Azure Data Lake Storage Gen2 that requires arranging the blob with a valid storage account. This feature offers admins and end users two main advantages:

  • Arranges the objects in the object store bucket as files and directories.
  • Allows the blob to be mounted on the compute host as an NFS share with a mount path.

The following table shows how the Azure blob storage has an NFS mount path in the Linux compute resource. The table also lists the mount path of the NFS share from FlashBlade in the Equinix data center on the same Linux host as the blob storage in Azure.

Microsoft shared a known issue where the metadata information is lost when data is copied from a source NFS and shared to a blob mounted over NFS, using the standard UNIX “cp.” The following test validates the problem.

How Fpsync Preserves Metadata Between FlashBlade and Azure Blob Storage

The files listed under the source subdirectory have “azhpcadmin” as the user and “packer” as the group ownership. The user and group ownership are now changed to “root” on the target share that has the blob storage on the back end after copying the files under the subdirectory from the NFS share on FlashBlade using the standard UNIX “cp” command.

Tests confirm the issue reported by Microsoft. A similar test was performed using fpsync. Fpsync was not only faster than the standard UNIX “cp” command but also preserved the files and directories metadata at the target location.

Data tiering from FlashBlade in Equinix into blob storage in Azure Cloud allows organizations to archive data for long-term retention. By using fpsync, data can be moved in both directions—from FlashBlade to Azure Blob storage and restored back on demand. Reading from blob storage becomes easy with the hierarchical namespace and the mount path over NFS, as the objects are arranged as files and directories. Fpsync provides data continuity across files and objects across heterogeneous data platforms.Tests confirm the issue reported by Microsoft. A similar test was performed using fpsync. Fpsync was not only faster than the standard UNIX “cp” command but also preserved the files and directories metadata at the target location.

Read Connected Cloud with FlashBlade and Microsoft Azure HPC for EDA Workloads to learn more about the value of cloud-connected storage in partnership with Microsoft Azure.

Written By: