Supercomputer mines genetic data to discover new viruses

Dr Artem Babaian, a former post-doctoral student in medical genetics at the University of British Columbia (UBC) in Canada and now a Banting Fellow at the University of Cambridge in the United Kingdom, together with his mountain-climbing partner, Jeff Taylor, a UBC engineering student, knew that scientists had isolated 15,000 viruses that could infect humans, including the common cold, influenza, Ebola and, of course, most recently, the virus that causes COVID-19.

But what if there was a way to do more?

The plan for a project – that produced a supercomputer that identified a nine-fold increase in the number of viruses that could potentially sicken humans – was sketched out “on the back of a napkin” in early March 2020.

Designed with the help of UBC’s Cloud Innovation Centre (CIC), the supercomputer has the equivalent power of 22,500 central processing units. Its speed surprised even Babaian, who saw some of the results he’d hoped for starting to flash up on his laptop screen shortly after the supercomputer began processing the data.

Two thousand years later, as reckoned by the computer, or 11 days in human time, Project Serratus – named for the rugged mountain (50 miles north of Vancouver in British Columbia) which Babaian and Taylor could see from the mountain they were on – had identified 132,000 novel RNA viruses that could prove pathogenic for humans, all for the cost of US$24,000.

An ‘innovative vision’

“Artem approached us with an innovative vision,” said Marianne Schroeder, director of the CIC, which was launched in January 2020, shortly before the COVID-19 pandemic took hold in North America.

The CIC, which supports innovative programmes that focus on health and well-being, is a public-private partnership between UBC and Amazon Web Services (AWS). “We paired our in-house innovation and technology expert teams from UBC and AWS,” she said.

Though referred to as a supercomputer, in reality Project Serratus takes advantage of cloud computing.

“Effectively, the computer's algorithms form a filter that allows you to pass the entire global libraries of nucleotide [the basic building blocks of RNA and DNA] sequences through the computer to scan for the signature sequence of an RNA virus,” said Jeffrey Joy, a UBC professor of infectious diseases associated with the Serratus Project. The sequence is similar to a fingerprint indicating that the larger sequence is derived from an RNA virus.

Babaian compared the process of developing the supercomputer to the struggle and thrill of mountain climbing.

“It was probably the most exciting scientific period of my life. There are two types of fun: type one is smiling and fun; type two is when you are miserable while doing it but the memory shines, like rock climbing. In many ways Serratus is type two fun. You just kind of have to believe it’s going to work out,” he said.

An ‘extraordinary’ amount of data

The Project Serratus supercomputer searched a database of 20 million gigabytes that contained the gene sequences found in 5.7 million biological samples from around the world. (For comparison, an average computer has between four and eight gigabytes of memory). The gene sequences can be thought of as millions upon millions of lines of computer code that must be searched to find the sequence that indicates the presence of an RNA virus.

Prior to Project Serratus, said Joy, the time it would have taken to examine this extraordinary amount of data “would be measurable in years. And it would have cost in the order of hundreds of thousands of dollars. So, Project Serratus is extremely [cost] effective.”

According to Schroeder, “Serratus can analyse 1,000,000+ sequencing samples per day for under one cent per sample with a dynamic-scaling EC2 cluster reaching 22,250 simultaneous vCPU.”

The needle in this haystack the computer looked for in the millions upon millions of lines of genetic code was the sequence that indicated the presence of an RNA virus. The databank of gene sequence codes, which is publicly available, is the product of more than a dozen years of work by scientists who have examined “everything from ice-core samples to animal dung”, according to a UBC press media release. One set of samples included anal swabs from penguins.

Accordingly, as Babaian, Taylor and the 13 researchers in the Serratus Project wrote in an article about the project in the journal Nature at the end of January, an “important limitation for these analyses is that the nucleic acid reads do not prove that the viral infection has occurred in the nominal host species”.

“For example, we identified five libraries in which a porcine, avian or bat coronavirus was found in plant samples.” In one case, said Joy, they suspect that the bat RNA virus came from droppings on corn that was then sequenced without the researchers noticing it.

“But the Serratus Project found it,” he said.

Not every RNA virus is pathogenic to humans, Joy told University World News. For that determination to be made, there would have to be clinical data showing that the novel RNA virus is dangerous for humans.

Identifying virus spillover into humans

What the Serratus Project does is “help pave the way to rapidly identify virus spillover into humans”, said the UBC statement. Its huge database of RNA viruses can then be used to identify those that affect livestock, crops and animal species, including endangered species.

Babaian, who was conducting genetic research into cancer before turning to this project at the beginning of the COVID pandemic, believes that the findings of Project Serratus will revolutionise clinical practice. If, for example, a patient presents with a fever of unknown origin, the patient’s blood can be sequenced. In about two minutes, doctors in, for example, St Louis, Missouri, will be able to connect the virus affecting the patient to “say, a camel in Sub-Saharan Africa sampled in 2012,” he said.

Accordingly, noted Joy, Project Serratus can help humanity avoid another pandemic.

“This kind of work is incredibly important for understanding viral diversity in general and for setting up surveillance systems that will help us avoid other pandemics in the future. It will help us understand both the diversity of viruses out there and the hosts that they infect. And as new viruses show up, even if we are not specifically looking for them, projects like Serratus will highlight them so that we are aware of them before they infect us.”

For his part, Babaian says, Project Serratus is at the cutting edge of our understanding of the interplay between genetic and spatial diversity in nature and how a variety of animals interface with these viruses.

“The hope is we’re not caught off guard if something like SARS-CoV-2 – the novel coronavirus that causes COVID-19 – emerges again. These viruses can be recognised more easily and their natural reservoirs can be found faster” so that they never spread widely enough to become a pandemic.

The value of open source – and one regret

Project Serratus could not have been organised if the gene sequencing of the 5.7 million biological samples had been proprietary data. In other words, if the sequencings were not “open source,” the vital information contained within the sequence of, say, a novel RNA virus found on bat guano in Chile, would be nothing more than a collection of bits and bytes locked away on a hard drive in Santiago.

According to Joy, Project Serratus is an eloquent example advocating the openness of data, of making data publicly available so that it can be used for multiple purposes by researchers.

Babaian does have one regret about how Project Serratus has unfolded: he didn’t keep the napkin.