AI acceleration of vaccine research hinges on data quality

Artificial intelligence is accelerating progress on screening, identifying and researching promising vaccine candidates against COVID-19. But the usefulness of AI tools hinges on the quality of data at a time when the need for speed in developing vaccines can jeopardise that quality, experts say.

“Data quality is a huge issue in COVID-19 because it is such a new disease. People collect data very, very differently because it’s not a disease that everybody understands,” said Phaik Yeong Cheah, an associate professor at the University of Oxford, head of the department of bioethics and engagement at the Bangkok-based Mahidol Oxford Tropical Medicine Research Unit.

“For COVID-19, data collection might not be as standardised as you would hope, so combining data sets will be difficult because the format is different, the unit might be different, the instrument might not be calibrated the way you would do it for a disease that you understand,” she said.

Quality issues can be worse in low-resource settings where data may not be collected properly and data management resources – such as advanced software and experts – may not be available.

Cheah acknowledged that most data entry is subject to human error, causing missing or biased data and making it difficult to combine the different data sets.

For vaccine development, large datasets are needed because immune responses, which are dictated by genetics, vary among different populations and groups.

To train an AI system, “first you need a lot of data. Then you need to label the data” from which the system learns rules or “ground truths” – answers the system should learn to predict in new, unlabelled data, said Su Dan, AI researcher at Hong Kong University of Science and Technology (HKUST), who led a team that devised CAiRE-COVID, an AI tool to mine scientific literature.

“If you provide a lot of misleading data and labels, the machine may learn something incorrect,” she added.

A need to combine and share data

Data-sharing platforms are a key element in the global effort towards a vaccine as the inevitable knowledge gaps necessitate sharing of data so that AI tools can generate useful results, and to rule out non-viable candidates for vaccine development.

If one research group finds that a candidate does not trigger an immune response, “they should share the data so that everybody else does not go down that route,” explained Calvin Ho, professor of law at the University of Hong Kong.

The United States National Institutes of Health alone lists 34 open-access resources for COVID-19, the World Health Organization and the global health network International Severe Acute Respiratory and emerging Infection Consortium also host data sets from COVID-19 patients.

Using AI tools for both types of adaptive immunity

Vaccines aim to activate the adaptive immune system. Ideally, both types of adaptations occur. B-cells produce antibodies, which bind and deactivate viruses outside of cells but cannot stop already infected cells from releasing loads of new virus. T-cells, on the other hand, evolve to spot and kill infected cells.

After an infection, “the majority of the T-cells evolved to fight that infection will be cleared from the body, but a few of these cells will remain as memory cells which can mount a quick and robust response against the same virus upon reinfection. This is the basis of vaccination,” said Ahmed Abdul Quadeer, a senior scientist at HKUST.

In the race towards vaccines against SARS-CoV-2, which causes COVID-19, machine learning algorithms are being used to look for specific antigens to trigger both B-cell and T-cell responses.

Using data from patients recovered from COVID-19, AI machine learning (ML) tools are taught to identify common traits among those fragments of virus proteins that activated the immune system, then apply these rules to updated data. But this requires a huge amount of data. “People are still building datasets, it’s ongoing,” said Su.

Because it is a novel virus, datasets for COVID-19 are limited in quantity and quality, but COVID-19 has some genetic similarity with the virus that caused the 2003 SARS pandemic – the two viruses share some antigens. A shared antigen that triggers immunity against SARS could trigger immunity against COVID-19 and is worth investigating more, according to scientists.

Beginning in January, a team at Flinders University in Australia and an Australian company Vaxine leveraged the similarities between limited data on SARS-CoV-2 and extensive data for the 2003 SARS virus. They developed ML tools to “characterise the key viral attachment molecule called the spike protein” as a potential antigen that would induce B-cells to produce antibodies, the company’s director Nikolai Petrovsky said on the Flinders University’s webpage.

He noted that the strength of ML tools is that they enable researchers “to run computer simulations on the virus before it is even fully characterised”.

Vaxine’s candidate vaccine designed by these methods has entered clinical trials.

Vaccines and T-cells

Vaccines that can induce T-cell based immunity are needed – although developing them is more complex – because antibodies against COVID-19 may not last.

In Hong Kong, the first confirmed case of COVID-19 reinfection last month showed that antibodies can be cleared from the body after an infection.

A study in Beijing reported that 21 out of 23 patients who recovered from SARS no longer carried the antibodies six years after the infection, but a Singapore-based collaboration found that memory T-cells against SARS persisted up to 11 years post- infection.

For T-cell vaccines, in addition to the virus genomes and records from recovered patients, AI tools need patient-specific details about a protein called MHC – or major histocompatibility complex – found on the surface of all human cells and which displays the viral antigens.

“If the antigens are not displayed by MHCs, the T cell will not be able to see it and respond to it,” explained Matthew McKay, professor of electronic and computer engineering and chemical and biological engineering at HKUST.

A vaccine that induces T-cell mediated immunity against COVID-19 must mimic specific antigen-MHC combinations, which vary from person-to-person. AI methods can reduce the number “from millions to a few hundreds”, Quadeer said.

Furthermore, predicting what portion of the population could benefit depends on knowledge of the prevalence of MHC variants in the population. “For most of the identified T-cell [antigens], only partial [MHC] information or no information is available,” the HKUST team acknowledged in their recent publication where they identified 12 promising antigen-MHC combinations that, collectively, could induce T-cell responses against SARS-CoV-2 in almost 100% of the global population.

In common with other research around the world, the team began with limited data from COVID-19 patients and relied heavily on the similarities between what is known so far from COVID-19 and the SARS virus.

With the need for speed as the pandemic has spread within nine months to register over 31.5 million cases worldwide as of 22 September and claimed almost 970,000 lives globally, experts stress the importance of real-time data.

The HKUST team has set up open access sites like COVIDep, frequently updated with newly discovered antigens and world-wide patient data, to guide COVID-19 vaccine design.