The COVID-19 virus has been mutating. Variants are circulating. The world is waiting to see how the vaccines will fare against this morphing enemy. 

Tracking the virus variants relies on fast-paced genomic sequencing from hundreds of thousands of cases. The process decodes the genetic code of the virus and identifies changes. Public health agencies across the globe are hunting for answers to questions like:: How fast is the virus changing? What within the virus is changing? Are the changes making it more contagious, more severe, or both? How widespread are the viral variants? How are these variations affecting vaccine efficacy?

To get answers to these questions quickly, real-time genomic surveillance efforts are at an all-time high. The COVID-19 Genomics UK Consortium is leading the charge by sequencing close to 20,000 viral genomes per week. In the US, the Centers for Disease Control and Prevention (CDC) is processing up to 9,000 genomes per week. Plus, diagnostic labs across the world are helping scale the effort.

Source: Centers for Disease Control and Prevention, Covid Data Tracker

What does that mean for data gathering, analytics, and sharing? Next-generation sequencing platforms from Illumina, IonTorrent, and Oxford Nanopore can generate large volumes of data of up to 100 gigabytes per sample. That’s as much as one to two petabytes worth of raw data per week for any kind of meaningful surveillance. Most of this data is routinely uploaded to public repositories. However, most labs still store sequences locally. And the costs of the storage architecture required for this amount of data can stack up. 


Storage aside, effective surveillance requires rapid processing of raw reads through sequencing pipelines to get to actionable insights. The kind of high-concurrency analytics, low latency, and high IOPS that such workflows require can test any legacy infrastructure. As a result, public health agencies and diagnostic labs need to update their infrastructure to ensure efficient workflows as they go from sample to published sequence.

FlashBlade: Built for High-Performance, Next-Generation Sequencing

Pure Storage® FlashBlade® is an all-flash, scale-out storage architecture perfectly suited to the genomic sequencing and analysis workflows that can power COVID-19 surveillance. It can also support the fight against future public-health threats. 

Pure provides large, rapid, and nimble data storage capacity that can scale as gene sequencing technology advances. It’s twice as fast at one-third of the cost. Andrew McArthur, Ph.D., McMaster University

FlashBlade is easy to deploy, scale, and manage. It delivers high performance with concurrency and parallelism. FlashBlade can accelerate primary analyses steps such as converting raw BCL files into FASTQ formats by up to 24x. Plus, Pure’s architecture can also speed secondary and tertiary analytics processes, including genome alignment and variant calling. This helps increase lab productivity. Unlike other storage architectures, FlashBlade scales linearly—reporting no increase in runtime as loads increase.

McMaster University Leverages Pure for Genomics

More than 20 leading sequencing labs across the world leverage Pure FlashBlade. For instance, the McArthur Lab at McMaster University quickly implemented FlashBlade to run a tool it developed for the international community. The tool helps researchers identify how the virus is spreading and evolving. Researchers gain insight into the virus through next-gen sequencing by isolating it from biological samples, with near real-time processing times. In fact, the McArthur Lab team was part of the group that isolated the live virus. This was critical to understand how the virus infects people and test therapeutics. With Pure, the research team reduced time to insights from two days to three hours.

“We needed a modern-day infrastructure to underpin our efforts to combat the superbug crisis, which is increasing in both magnitude and severity,” says Dr. Andrew McArthur, who runs the lab. “Pure provides large, rapid, and nimble data storage capacity that can scale as gene sequencing technology advances. It’s twice as fast at one-third of the cost.”

Labs like McArthur’s will continue to fight against COVID-19 as well as keep tabs on other emerging threats to public health. At the moment, surveillance of COVID-19 is still low. The number of genomes in the US that have been shared on public databases is less than 0.3% of the total number of infections in the country. That compares with nearly 5% for the UK, 12% for Denmark, and almost 60% for Australia.

US efforts to scale surveillance will surely continue as state governments and the CDC tap more labs to join the charter. Sequencing labs will need to modernize their data infrastructure to continue to unlock answers and fend off our invisible enemy.