How advances in genome sequencing paved the way to understand COVID-19 and its spread
By Dr Lara Marks, Visiting Research Fellow, Department of Medicine, University of Cambridge and Managing editor of WhatisBiotechnology.org
Published 16 June 2020, and updated 13 March 2021
Scientists are currently working as fast as possible to understand the biological behaviour of the Severe Acute Respiratory Syndrome coronavirus 2 (SARS-CoV-2), the pathogen responsible for COVID-19. All of this work would be much more difficult and slower to do without genome sequencing. The genome is the complete set of genetic instructions an organism needs to function and maintain itself. Being able to sequence the genome makes it much easier to identify a particular organism, and understand how it functions, its origins and how it is evolving. These insights are essential for the development of appropriate vaccines and diagnostic tests. Genomic sequencing is also vital to being able to determine how the SARS-CoV-2 virus is moving at the local and international level. With no effective vaccine or treatment yet widely available, such information is crucial for public health planning, clinical care strategies and establishing which measures are most effective at controlling the spread of the virus.
Genome sequencing grew out of a technique developed by Fred Sanger. First published in December 1977, his method provided the first means to sequence DNA, or deoxyribonucleic acid, a long chain-like molecule made up of nucleotides which provides all the information necessary for an organism to develop, live and reproduce. Known as the dideoxy sequencing system, Sanger's technique involves duplicating and tagging DNA fragments and then placing them on a gel in different lanes. An electric current is then applied to the gel which causes the longer DNA fragments to move more slowly than the shorter ones. The resulting pattern of the fragments is captured with an x-ray film placed on the gel at the end which provides a means to work out the nucleotide sequence. Many different fragments have to be subjected to this process to assemble the whole DNA sequence. Sanger's method could also be used to sequence the nucleotides of RNA (ribonucleic acid), a molecule that is important to protein synthesis and carries the genetic information of many viruses.
Photograph of Fred Sanger, credit Laboratory of Molecular Biology. Sanger, a British biochemist was one of the few scientists to be awarded the Nobel Prize for Chemistry twice. The first was for his sequencing of the first protein, insulin, and the second for DNA. Click here for more about the life and work of Sanger protein, insulin, and the second for DNA. Click here for more about the life and work of Sanger.
Diagram of Sanger's dideoxy sequencing method. For more information on how Sanger developed the method click here.
Originally a slow and time-consuming manual process, Sanger's technique was soon automated and hooked up to computers to help assemble, read and store the sequence data. By the late 1980s a number of new sequencing machines appeared on the market, helped by advances in microfluidics. These radically improved the speed of sequencing and opened the way to contemplate the sequencing of the first human genome in 1990.
It took 13 years and approximately $2.7 billion to completely sequence the 3.2 billion pairs of nucleotides (bases). More innovations followed inspired by the launch of the Revolutionary Genome Sequencing Technologies project in 2003, making available $50 million in grants to improve methods, driving down the cost to $1000 per human genome in 2014. Since then many further improvements have been made to further reduce the cost. Today various companies can sequence the human genome for $600-700. The target is now to bring the cost down to just $100 (Anon, 12 March 2020).
Graph showing the dramatic drop in the cost of genome sequencing. Source: National Human Genome Research Institute.
Sequencing has become significantly cheaper and faster in recent years as a result of nanopore sequencing. First sketched out as a concept in 1989 by David Deamer, nanopore sequencing took many years to develop and was dependent on research advances made by many different scientists. The technique is radically different from the Sanger method in that it directly reads a sequence rather than having to assemble the sequence from different fragments. This is less time-consuming and particularly helpful in the context of pandemic diseases because it allows for the genetic code of infectious agents to be deciphered directly from biological samples in real time (Marx).
Photograph of David Deamer. Credit: Wikipedia. An American biomolecular engineer at the University of California, Santa Cruz, Deamer was one of the first to suggest that it might be possible to sequence a single strand of DNA by drawing it through a nanopore. He came up with the idea in June 1989 while driving from Oregon to California, but did not discuss it with anyone until he was visited by Daniel Branton from Harvard University two years later.
Photograph of Daniel Branton taken by Marion Cave. Image courtesy of the University and Jepson Herbaria Archives, University of California, Berkeley. In 1991, with Deamer's agreement, Branton began pursuing his concept for nanopore sequencing and it was agreed that Harvard University, where Branton was based, would take the lead in taking out patents on the technique. Branton and Deamer first published a description of the technique in 1996.
Nanopore sequencing uses the innate biological property of cellular membranes embedded with proteins that act as selective channels, facilitating the diffusion of molecules in and out of the cell. Current methods of nanopore sequencing use bespoke proteins that form very tiny holes, one billionth of a meter in diameter, in a biological membrane. Negatively charged, single stranded DNA or RNA chains are pulled through the pores down their electrical gradient. An electric current is able to flow through the membrane and across the pore. However, this is disrupted when the pore is blocked by the nucleotides of the RNA or DNA. The resultant change in current can be measured and matched to a specific nucleotide and thus the sequence can be determined.
The first commercial sequencer using nanopore technology, developed by Oxford Nanopore Technology, a company based in Oxford, came on to the market in 2014. Since then many improvements have been made to the technology which has made it uniquely positioned to be used for sequencing the SARS-CoV-2 virus (Marx). Able to fit inside a person's pocket, nanopore sequencers are highly portable. By comparison older sequencers can be as large as a fridge. Relatively cheap to purchase, the new nanopore sequencers can be plugged into a laptop USB port, allowing for data to be sent immediately across the world for rapid analysis. This means that sequencing can be carried out just as easily in remote areas with limited laboratory facilities and no appropriately trained staff as in a well-equipped hospital with experts on hand. Nanopore sequencers recently proved invaluable, for example, for the real-time surveillance of the Ebola virus in parts of West Africa and for the Zika virus in the hard-to reach Amazon jungle (Yong; Deamer, Akeson, Branton; O'Carroll).
Diagram illustrating the process of nanopore sequencing. Credit: Daniel Power.
The first genome sequence of the SARS-CoV-2 virus was drafted on January 5th 2020 by a group of Chinese scientists, led by Yong-zhen Zhang at Shanghai Public Health Clinical Centre and Public Health. Their work showed the virus to be a close relative of the human SARS coronavirus that emerged in 2002 and a bat SARS-like coronavirus (Zhu et al). The genome, a single strand of RNA, was found to be ~30,000 nucleotides long. This is the longest genome of any known RNA virus (Marian; Hassamin). The first sequence was generated remarkably quickly; just a week after the first patient was hospitalised with unusual pneumonia at Wuhan Central Hospital. In comparison the coronavirus for the first outbreak of SARS in 2002 took a few months to complete (CDC).
Based on the rapid sequencing of the SARS-CoV-2 genome, scientists were able to quickly identify a number of genes that encode important viral proteins. Critically, the genes that carry the instructions for the spike (S) protein which helps the virus invade human cells. This is now a key target for the development of vaccines and treatments. They have also found other genes that allow the virus to make copies of itself, notably the NSP7 and NSP 8 proteins. Further work is needed to fully characterise the function of other proteins, some of which may help the virus to suppress the host's immune response (Corum, Zimmer; Kupferschmift; Highfield).
The first genome of the SARS-CoV-2 virus was sequenced from a clinical sample taken from a 41-year-old man admitted to Wuhan Central Hospital on 26 December 2019. He worked at the local seafood market where the first cluster of cases of unusual pneumonia were identified (Wu et al). However, the first sequence could only answer the very basic question of what was causing the new disease. Little could be gauged about where the virus came from, how it was evolving and how fast it was moving. These questions could only be answered by comparing viral genomes sequences isolated from different samples collected from as many patients, animals and places as possible. Efforts quickly got underway to sequence the genome of the virus from as many sources as possible (Kupferschmift).
It is possible to work out the path of a virus from sequencing. This is because the virus' replication machinery is not perfect and errors get incorporated into the genome. As such, each virus has a unique genome that can be determined and lineages tracked using sequencing. This enables scientists to follow the journey of the virus through a population and study its main paths of transmission. They can work out whether the virus is likely to have come from a particular person, geographical hotspot or whether it was encouraged by certain cultural norms or specific workplace practices. Identifying such points is important for working out which modes of transmission need to be targeted in terms of public health control strategies (Genomics Education Programme).
Many scientists across the world are now racing to sequence the genome of SARS-CoV-2 from as many patient samples as possible. Data generated from this work is vital for epidemiologists and public health authorities to understand how the virus is spreading and to evaluate the effectiveness of different interventions. Significantly, genome sequencing can help determine how many cases of infection have been imported or come from a local source. For example, if two samples have exactly the same sequence then they are likely to be part of the same transmission chain. Alternatively, if a viral genome from a patient in a specific country is identical to one from the UK apart from an additional mutation, it is likely that it was introduced into the UK from that country. This kind of tracking can also be done on a local scale, allowing for an assessment of transmission in local communities, hospitals and care settings. It is also possible to work out whether two clusters of cases that appear in the same location are linked to each other or originate from two distinct and independent chains of transmission with separate, earlier origins (UKRI). Circulation of the virus in a country can also be estimated from the number of mutations in the local population (Gudbjartsson et al).
Current SARS-CoV-2 genomic data suggests that the virus is mutating slowly, acquiring an average of one or two mutations a month (Kupferschmift). This is not surprising because coronaviruses generally have very low mutation rates. At present researchers have only found a small number of circulating varaints, reflecting the fact that the SARS-CoV-2 was only introduced into the human population in late 2019 (Rivett et al). More genetic diversity will emerge as the pandemic unfolds. Organisations like Nextstrain are working to map the evolutionary journey of the virus, depicting it as a phylogenetic tree. Like a family tree, a phylogenetic tree charts out the evolution of different lineages of the virus. The tree is drawn with the earliest-known version of the genome, or common ancestor, as the root and then branches out to represent every mutation away from that ancestor (Jarvis). Each lineage represents an individual chain of transmission. The tree is important for making connections between different COVID-19 cases and determining where any undetected transmission of the virus might have occurred. It can also help to highlight where the virus is getting more dangerous (Bedford).
Diagram of a phylogenetic tree in the early stage from Nextstrain. On the right hand side we can see different genome sequences with mutations shown as coloured circles. The longer the line the more mutations. Where the coloured circles, linked together by a vertical line, are identical the sequence is considered identical. These can be grouped together, as in the case of A and B. Both A and B share a mutation (coloured green) not shared in other sequences.
Diagram from Nextstrain showing a more fully sampled phylogenetic genetic tree with samples from different locations denoted by orange and blue.
Undertaking the rapid genome sequencing of SARS-CoV-2 worldwide demands a high degree of collaboration and cooperation between different scientists. In the United Kingdom this work is being coordinated by the COG-UK Consortium, a partnership set up in March 2020, of NHS organisations, four Public Health Agencies of the UK, the Wellcome Sanger Institute and more than 12 academic institutions. Led by Professor Sharon Peacock of the University of Cambridge and Director of Science at Public Health England, the Consortium has established a network of laboratories around the UK to sequence samples collected from patients positively diagnosed with COVID-19 by NHS hospital laboratories, public health laboratories and national coronavirus testing centres.
The work involves painstakingly amplifying and converting the RNA material from different samples to be put into sequences. Such work is very repetitive. As Professor Ian Goodfellow, who heads up the Cambridge team says, 'Put very simply, it's a long series of steps, part of which involves repeatedly sticking genetic material to tiny beads, and then washing all the [unwanted material] off with ethanol – all the RNA from other bugs up your nose or in your throat. But do it too slowly or too fast and the virus RNA diminishes… We get samples from Public Health England about 10am, then finish about 7pm, by which time my brain's a bit fried from concentrating very hard on carefully moving around miniscule quantities of liquid'. To relieve the tedium the team's scientists usually work in pairs which also enables them to double-check each other's work. Laboratories receive positive virus samples within a day of being taken from a patient. These take between one and eight hours to prepare and sequence. One laboratory can sequence between 24 and 70 virus samples a day. This work is carried out in small black hydroponic gazebos (about the size of suitcase) which were originally used in Africa during the Ebola outbreak. Easy to clean and fold down, the gazebos act as self-contained units, thereby helping to prevent the spread of the virus (Lewsey).
The Consortium is managing to sequence and analyse the SARS-CoV-2 genomes from many different sources incredibly quickly. By 28 May 2020 seventeen of the consortium's centres had sequenced and analysed 20,637 SARS-CoV genomes from positive COVID-19 samples within 11 weeks. This accounted for more than 56 per cent of the global total number of SARS-CoV-2 genomes (COG-UK Report #7). The Consortium estimates there are about 40 lineages of the virus now circulating in the UK. In the early days the UK appears to have imported a number of different lineages of a single strain of the virus from abroad, including European countries like Spain, Italy and France. More recent data indicates that most of the new cases in the UK are now arising from local spread rather than entering the UK from other countries (COG-UK Analysis).
Having genomic sequencing information regarding different SARS-CoV-2 variants is especially useful for determining why some diagnostic tests work better than others. One diagnostic test which failed to detect the virus, for example, was demonstrated by the UK Consortium to be linked to the fact that it targets a variant (C26340T) which is very rare in both the UK and globally. In mid-May the variant showed up in just 19 genomes sequenced out of a total of 23,000, which is 0.08 per cent (COG-UK Report #6).
Data generated from the Consortium has also proven invaluable to teams reviewing the robustness of infection control procedures and patient safety. Addenbrooke's Hospital in Cambridge, for example, was able to work out where there had been transmission of the virus between wards by healthcare workers and also from shared patient transport. It was also able to identify links between healthcare workers and a care home not detected by clinicians or the infection control team. The data also helped rule out transmission assumed to have taken place between the dialysis unit and impatient renal wards (COG-UK Report #7).
Number of SARS-CoV2 genomes sequenced by the COG-UK Consortium up to 13 May 2020. Credit: COG-UK Report #6, table 1.
The United Kingdom is just one of many countries involved in sequencing the SARS-CoV2 genome. In early March 2020 scientists from the company deCODE genetics, Iceland's Directorate of Health and the National University Hospital launched a study to sequence the viral genome in the Icelandic population who tested positive for the virus. The aim of the study was to evaluate how the virus mutates and spreads in Iceland which was one of the earliest countries to implement early and aggressive testing, tracking and isolation policies to curb the pandemic (Gudbjartsson et al).
Results from the study were published in April 2020. It included data from 1,221 residents screened between January and March 2020 who returned from countries or regions with high numbers of infection or had been directly in contact with infected people. Another 2,797 people were included in the study from Reykjavik's general population who were either symptom-free or had mild symptoms of the common cold. All participants had their noses and throats swabbed for the study. At the start of the study most of the virus genomes in the Icelandic community sequenced appeared to come from northern Italy or Austria where many of the study's early participants had travelled to. Over time more cases began to emerge that originating from the United Kingdom. Relatively few came from Asia or the Western United States. By the end of the study the infection rate had become very low and stable. This was probably due to the aggressive testing, contact tracing and isolation measures taken in Iceland. The Icelandic team cautioned that the virus would only remain contained in the country through a continuation of such measures (Gudbjartsson et al).
Rapid genomic sequencing in this pandemic is not just the preserve of countries like the UK and Iceland which have well equipped laboratories. It is also possible in places with minimal resources. This is not only thanks to the arrival of nanopore sequencing, but also due to the development of standardised methods and protocols drawn up by the ARCTIC network. Set in 2017 with funds from the Wellcome Trust, the ARCTIC network is made up of an international group of scientists from the Universities of Edinburgh, Birmingham, Cambridge, Oxford, KU Leuven, UCLA and the Fred Hutchinson Cancer Centre. The network has developed a set of laboratory and bioinformatics protocols for the control of infectious diseases. These protocols were built on the back of the experience of sequencing the Ebola virus epidemic in West Africa between 2013 and 2016 using nanopore sequencers. They are currently being used by local scientists out in the field working in the ongoing Ebola outbreak in the Democratic Republic of the Congo, and the surveillance of the polio virus in Pakistan and measles in Rwanda. The ARCTIC method makes it possible to sequence the SARS-CoV2 genome within eight hours. It was first used by scientists from Hangzhou Centre for Disease Control in China (Anon, 12 Feb 2020).
The current COVID-19 pandemic is not only notable for the speed with which SARS-CoV2 genome has been sequenced. What is also remarkable is the degree to which sequencing data and analysis is being shared. Part of this can be attributed to lessons learnt from previous epidemic outbreaks. One of the reasons it took months to sequence the genome for the first SARS outbreak in 2002-03 was because there was an information blackout in the first few months. Another problem happened during the outbreak of the Asian avian influenza A (H5N1) which posed global threats to both animal and human health during 2004. Part way through the pandemic Indonesia, which was hit particularly hard by the flu, refused to continue sharing its H5N1 virus samples with the World Health Organisation on the grounds that samples provided freely by developing countries could be used by companies in wealthy countries to produce vaccines and treatments too costly for developing countries to purchase (Roos). Similar concerns were also raised by Thailand. The dispute brought to the fore the problems that intellectual property rights pose for protecting global public health (Fidler).
Indonesia's action caused great alarm among the global health experts striving to build multilateral strategies to combat the bird flu and prepare for other types of pandemic influenza (Fidler). One good thing to come out of the incident was the formation of the Global Initiative on Sharing Avian Influenza Data (GISAID) consortium. It was founded in 2006 after 70 leading flu scientists, including six Nobel Laureates, published a letter in Nature committing themselves to share flu data more quickly and openly (Bodger et al). Scientists participating in GISAID agree to share their sequence data, jointly analyse the findings and publish their results collaboratively. All data is released through GenBank and other public databases. From the start one of the objectives of the scheme was to provide a model for the 'rapid dissemination of data from outbreaks of future emerging diseases' to help speed up on-the-ground responses and affected countries 'build comprehensive and sustained disease-surveillance programmes'. From the start GISAID was intended to help scientists in low to middle-income countries who lack the means to carry out sequencing to participate in the analysis and interaction of data as soon as it gets shared (Anon, 20 Aug 2006; Bodger et al).
Hosted by the German Federal Ministry of Agriculture and Consumer Protection since 2010, GISAID has proven enormously beneficial for the sharing of information during the current COVID-19 pandemic. GISAID became a central coronavirus repository in December 2019 and since then it has been a key place for laboratories from around the world to deposit their genomic sequencing data for the SARS-CoV-2 virus. As can be seen from the graph below GISAID has seen an explosion in the number of submissions of genomic sequencing data from around the world, especially since September 2020.
Graph showing SARS-CoV-2 genomic sequencing data submitted to GISAID up to 12 March 2021.
Number of SARS-Cov-2 genome sequences reported in either MRC CLIMB or GISAID up to 27 May 2020. Source: COG-UK Report #7, figure 1.
The volume of genomic sequencing data for SARS-CoV-2 is increasing at rapid speed. This is not only being shared through GISAID but many other open access platforms. Such information will be no less important as the pandemic progresses as it will provide more information on how many transmission chains are still circulating and help to identify which interventions are most effective at curbing its spread. Linking the data to different patient's DNA and health records could also help provide clues as to why some people are hit harder than others by the virus. Finding from this research will be important to scientists trying to develop effective vaccines and treatments.
I would like to thank both Stephen Baker and Daniel Power for their careful reading of the first draft of this article.
Anon (12 March 2020), 'Genomics took a long time to fulfil its promise', The Economist.Back
Bedford, T (2 March 2020), 'Cryptic transmission of novel coronavirus revealed by genomic epidemiology'.Back
Bodger, P et al (30 Aug 2006), 'A global initiative on sharing avian flu data', Nature, 442, 981, https://www.nature.com/articles/442981a.Back
CDC (14 April 2003) SARS-Associated Coronavirus (SARS-CoV) Sequencing'.Back
Corum, J, Zimmer C (3 April 2020), 'Bad news wrapped in protein: Inside the coronavirus genome', The New York Times.Back
Deamer, D, Akeson, M, Branton, D (6 May 2016), 'Three decades of nanopore sequencing', Nature Biotechnology, 34/5, 518–24, doi:10.1038/nbt.3423.Back
Fidler, D (28 Feb 2007), 'Indonesia's Decision to Withhold Influenza Virus Samples from the World Health Organization: Implications for International Law', American Society of International Law, 11/4.Back
Genomics Education Programme (9 April 2020), 'New initiative harnesses genomics in coronavirus fight'.Back
Gudbjartsson, GF, et al (14 April 2020), 'Spread of SARS-CoV-2 in the Icelandic population', New England Journal of Medicine.Back
Hassamin, A (24 March 2020) 'Coronavirus could be a 'chimera” of two different viruses, genome analysis suggests', The Conversation.Back
Highfield, R (27 April 2020), 'Coronavirus: Hunting down Covid-19', Science Museum Blog.Back
Jarvis, C (23 April 2020), 'How genomic epidemiology is tracking the spread of COVID-19 locally and globally', C&E, 98/17.Back
Kupferschmift, K (9 March 2020), 'Mutations can reveal how the coronavirus moves—but they're easy to overinterpret', Science.Back
Marian, AJ (Jan-Mar 2014), 'Sequencing your genome: What does it mean?', Methodist Debakey Cardiovasc Journal, 10/1, 3–6.Back
Marx, V (29 Oct 2015), 'Nanopores: a sequencer in your backpack', Nature Methods, 12, 1015–18.Back
O'Carroll, L (13 Feb 2016), 'From Ebola to Zika, tiny mobile lab gives real-time DNA data on outbreaks', The Guardian.Back
UKRI (25 March 2020), 'How does virus genome sequencing help the response to COVID-19?'.Back
Wu, OF et al (3 Feb 2020), 'A new coronavirus associated with human respiratory disease in China', Nature, https://www.nature.com/articles/s41586-020-2008-3.Back
Zamiska, N (31 Aug 2006), 'A Nonscientist Pushes Sharing Bird-Flu Data', Wall Street Journal,, B.1.Back
Zhu, N et al (24 Jan 2020), ' A Novel Coronavirus from Patients with Pneumonia in China, 2019', New England Journal of Medicine.Back