Variant-driven early warning via unsupervised machine learning analysis of spike protein mutations for COVID-19

Adele de Hoffer, Shahram Vatani, Corentin Cot, Giacomo Cacciapaglia, Maria Luisa Chiusano, Andrea Cimarelli, Francesco Conventi, Antonio Giannini, Stefan Hohenegger, Francesco Sannino*


Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

32 Downloads (Pure)


Never before such a vast amount of data, including genome sequencing, has been collected for any viral pandemic than for the current case of COVID-19. This offers the possibility to trace the virus evolution and to assess the role mutations play in its spread within the population, in real time. To this end, we focused on the Spike protein for its central role in mediating viral outbreak and replication in host cells. Employing the Levenshtein distance on the Spike protein sequences, we designed a machine learning algorithm yielding a temporal clustering of the available dataset. From this, we were able to identify and define emerging persistent variants that are in agreement with known evidences. Our novel algorithm allowed us to define persistent variants as chains that remain stable over time and to highlight emerging variants of epidemiological interest as branching events that occur over time. Hence, we determined the relationship and temporal connection between variants of interest and the ensuing passage to dominance of the current variants of concern. Remarkably, the analysis and the relevant tools introduced in our work serve as an early warning for the emergence of new persistent variants once the associated cluster reaches 1% of the time-binned sequence data. We validated our approach and its effectiveness on the onset of the Alpha variant of concern. We further predict that the recently identified lineage AY.4.2 (‘Delta plus’) is causing a new emerging variant. Comparing our findings with the epidemiological data we demonstrated that each new wave is dominated by a new emerging variant, thus confirming the hypothesis of the existence of a strong correlation between the birth of variants and the pandemic multi-wave temporal pattern. The above allows us to introduce the epidemiology of variants that we described via the Mutation epidemiological Renormalisation Group framework.

TidsskriftScientific Reports
StatusUdgivet - 3. jun. 2022

Bibliografisk note

Funding Information:
We acknowledge with gratitude the authors, originating and submitting laboratories of the genetic sequence and metadata made available through GISAID. A full listing of all authors and laboratories is available on the GISAID website.

Publisher Copyright:
© 2022, The Author(s).


Dyk ned i forskningsemnerne om 'Variant-driven early warning via unsupervised machine learning analysis of spike protein mutations for COVID-19'. Sammen danner de et unikt fingeraftryk.