Speed Analysis in Genomics Pipelines with elPrep and FlashBlade

The ability to speed up genomics pipelines can fuel breakthroughs. Having a smart omics data and infrastructure strategy is essential for success.

elPrep

image_pdfimage_print

Genomics sequencing is evolving fast, from research to clinical practice at scale, especially in oncology and rare and infectious diseases. The ability to identify root causes of diseases due to tiny changes in the genome can bring clinical decision-making one step closer to personalized medicine.

Advances in sequencing technology have dramatically reduced the cost per human genome. It now costs about $600, down from close to $300 million when it was first sequenced. As a result, sequencing efforts have exploded globally. The amount of omics data generated has shot up. And so have the computational requirements to turn that data into insights. For example, in oncology in clinical practice, processing data for 1.7 million new patients per year in the US would require between 8 to 34 million hours of compute time.

Given the scale, IT leaders at life sciences organizations are faced with multiple challenges when implementing enterprise-focused omics data strategies. Genomics analyses pipelines can be inefficient, complex, and labor-intensive, with lots of data-staging operations and direct-storage capacity bottlenecks. Not to mention, implementing a very CPU-intensive workflow when there’s a worldwide shortage of CPUs and GPUs can be particularly challenging.

What Is elPrep And Why Is It So Fast?

Built by imec, elPrep is designed to run genomics analyses much like established programs like SAMtools, Picard, and GATK4. Using a smart software architecture, elPrep delivers remarkable performance on workflows, running a whole-genome sequencing sample in less than six hours. In contrast, standard tools could take up to four days. elPrep achieves speeds up to nearly 16 times faster by running multiple prep steps in parallel, optimizing memory management, and minimizing the number of I/O operations in the process. With elPrep, researchers have a single, ultra-fast solution based purely on a software optimization approach—without the need for GPU or FPGA accelerators.

elPrep on Pure FlashBlade Delivers Speed and Scale

Going beyond the software layer, omics workflows have the potential to be further optimized at the platform layer. We tested to see if running elPrep on Pure FlashBlade would deliver additional benefits to pipelines at scale. We performed the test at the Pure Customer Solutions Center and used a single physical server to run the elPrep workload.

Our testing showed that FlashBlade was as performant as direct-attached flash storage, using a standard Ethernet connection between the servers and FlashBlade shared storage. In particular, using elPrep and FlashBlade together eliminated the need for manual, time-consuming, data-staging activities (See Figure 1). This is primarily because FlashBlade supports both fast SMB and fast NFS. As a result, omics data on the shared storage platform can be accessed by both SMB servers for primary genomics analysis, as well as NFS servers for secondary genomics analysis, without any manual data-copy operations. In addition, thanks to the scale-out nature of FlashBlade, it can easily support the increased storage demand for genomics analysis from 70 terabytes to petabytes.

“The match between imec and Pure Storage is logical because elPrep and FlashBlade share the same DNA: to simplify and accelerate high-performance workloads by building intelligent accelerators on top of open industry standards,” says Yves Mahieu, EMEA healthcare and life sciences director at Pure Storage. “It strengthens the position of Pure FlashBlade for genomics to create a data hub that enables personalized medicine and clinical decision support.”

FlashBlade shared storage scales out and simplifies elPrep genomic sequencing by eliminating all complex and labor-intensive data-staging processes, without any loss of performance versus direct-attached flash storage. In other tests and reports from Pure’s genomics customers, FlashBlade has also shown significant performance improvements to the tune of up to 24x compared to traditional infrastructures for high performance computing.

“I am thrilled that an innovative company like Pure Storage sees the benefit of elPrep to make genomics processing faster and cost-efficient, a necessary step in bringing genomics to clinical practice,” says Roel Wuyts, principal scientist at imec.

And the use cases are vast—from pharma companies mining omics data to find their next drug target to diagnostic and sequencing labs looking to speed up the customer experience to hospitals wanting to implement personalized medicine for their patients. For any use case, though, establishing a successful omics data and infrastructure strategy is a must. With elPrep running on FlashBlade, life sciences and healthcare organizations can build a simple, efficient, CPU-optimized genomics platform to meet the growing demands of genomic sequencing.

Written By: