The challenge of data linkage, analysis and visualisation

Drinking from the firehose

Faced with a rapidly growing number of clinical samples from people with suspected COVID-19, one of the formidable tasks faced by COG-UK was handling the increasing volume of genomic data that they generated. In addition the genome of each sample needed to be linked to the information relating to the patient it had been taken from. This was not straightforward. Patient information had to be anonymised in such a way that it did not breach patient confidentiality, and could be shared across the four devolved nations and with researchers.

There was also the question of how to integrate patient and genome data so that it could indicate how the virus was evolving as well as its geographical patterns of spread over time. Creating data that was immediately actionable was critical to helping inform the public health response both at the national and local level. This rested on the development of software tools to analyse the data and visualise this in a user-friendly format. It was also important to use a standard nomenclature system to classify and name the different lineages identified by sequencing. Having such a classification system was critical to researchers worldwide being able 'to better understand the patterns and determinants driving the local, regional and global spread of SARS-CoV-2 and to track new variants as they emerge' (Marjanovic Final Report).

Patient metadata

Dr Anthony Underwood, who contributed to COG-UK's bioinformatics effort from the Centre for Genome Pathogen Surveillance in Oxford, likens patient metadata to the 'family silver'. Without metadata, Underwood argues, 'there was really no point in carrying out the sequencing'. Furthermore, access to more detailed patient information (for example, vaccination status or reinfection) allowed greater insights into the relationship between the virus and host. Underwood points out 'there is a set of minimal metadata required which, when paired with the genome sequence from the sample enables epidemiologists and modellers to examine the emergence of particular clones and lineages to determine if these are linked to a particular patient demographic or location. In addition to minimal metadata it's really critical to collect as much data as possible so that even more powerful studies are possible' (Underwood transcript).

Just how important such data was is also highlighted by Dr Catherine Ludden, who gave the generic example that 'we could say that on the 12th of October we found this new variant in this region of England', which could trigger public health actions. Based on its importance, Ludden and Beth Blane spent a lot of time at COG-UK's hub contacting laboratories who submitted samples for sequencing minus patient information, and finding solutions to overcome any barriers involved (Ludden and Blane transcript).

Collecting patient data

According to Dr Michael Chapman, who helped set up COG-UK data infrastructure, many people in the early days had 'a fairly miserable time having to extract data from the clinical system in their labs, get it into a spreadsheet template, and then upload it'. But over time,'more and more people started to code scripts to extract data 'from their clinical systems so that they weren't having to cut and paste data into spreadsheets' (Chapman transcript).

Some idea of what the work entailed is revealed by Dr Beatrix Kele, who was involved in sequencing at Barts Health NHS Trust. She and her team spent a lot of time collating patient metadata into spreadsheets. The task was made easier by the fact that samples were requested using an electronic patient records system, but it still required a lot of painstaking work. Kele had to have 'the patient data records open on one screen, and the Excel sheet open on the other screen' to manually type the metadata into the spreadsheet. This also had to be done in such a way that no confidential patient information was transferred across to research databases (Kele and Cutino transcript).

Dr James Shepherd also provides some insights into how metadata was collected at the Centre for Virus Research in Glasgow. He was one of five clinical research fellows that Professor Emma Thomson called upon to help with the process. As clinicians they had the advantage that they could straddle the interface between the NHS and research setting. Shepherd was tasked with finding samples that had tested positive for SARS-CoV-2 to take to the sequencing facility and then link these samples to the patient data. For Shepherd, the real challenge in this process was 'having to go through case notes to find out what happened to people', for example, how sick they had become and whether they needed to have oxygen or be cared for in an intensive care unit. Shepherd points out that every sample sequenced required both him and his colleagues 'going through the notes and trying to pull out all that information and categorise it.' He highlights the fact that 'there weren't many cases initially, but as the pandemic exploded we were looking at thousands of cases' (Shepherd transcript).

Fortunately, Shepherd could call on ten medical students who volunteered to help. He says they 'were all very keen because all of their rotations had stopped'. Once Shepherd gave them a list they would then look through the records to pull out the relevant cases. Thomson and her team quickly set up a system locally to store patient data securely. The data collected included 'age, sex, where the patient lived, where they travelled to, what their clinical outcome was, various other clinical information', which they then sought to associate with findings from the sequence data (such a question might be whether some variants of concern caused more severe disease). With this information, the Glasgow team were able to investigate some of the first introductions of COVID-19 into Scotland (Shepherd transcript).

Another person who helped gather patient metadata was Dr William Hamilton, an academic clinician who helped support Goodfellow and his sequencing team in Cambridge. Having affiliations with both Cambridge University Hospitals and the Sanger Institute, he was in an ideal position because he had knowledge of how genomics works and could readily get access to clinical data. Based on this, he took on the role of extracting the patient metadata required by COG-UK from the clinical servers at Addenbrooke's hospital. With Addenbrooke's then being the testing centre for 'basically the whole of the East of England', Hamilton was able to access many samples. To help extract the data, Hamilton wrote computer code to pull out every COVID-19 test done in the previous three days from Addenbrooke's computer patient record system and then remove any data that could not be publicly released, such as the patient's name and date of birth to ensure that it 'was appropriate for information governance.' Working very long hours, Hamilton recalls there was 'lots of data processing' required to turn the data 'into a format that's analysable, anonymised, and appropriate for COG-UK usage' (Hamilton transcript).

Data sharing

Sharing patient metadata is a sensitive issue and handling it must be right every time. Some of the anxieties it raised are captured in the interview with Professor Matthew Holden, a whole genome sequencing advisor to Public Health Scotland. He flags up the 'tension between the academic and the research need for data' with that of data residing within the NHS and public health agencies who need to protect the data. One of the issues that cropped up in Scotland was that because of its small population and geography, there was potentially more risk of individuals being identified from any shared data, albeit anonymised. He says, there was also the added complexity that 'data within Scotland is under the responsibility of the NHS boards and agencies of Public Health Scotland which have their own processes in place.' That meant when a situation came up where someone wanted to exchange data at the UK-level someone they had to meet the local requirements.' By contrast, Holden points out, 'the rest of the UK and Wales come under the same umbrella so they can make the decision about NHS data' (Holden transcript).

Because Public Health Scotland recognised the need to share data with academic partners, it arranged to do this through its National Safe Haven which existed before the arrival of COVID-19. Hosted by the University of Edinburgh, the National Safe Haven contains 'various research data sets that's held within Scotland' to which academic partners can apply for access. Holden explains, 'Because it's operating in what is in effect a safe haven you can query the data and effectively conduct analysis and then export the non-identifiable amalgamated analysis that then allows you to provide the results. So, it's a way in which data can be held secure and controlled but allow researchers access to it' (Holden transcript).

COG-UK took several steps to ensure that no confidential patient information was accidently leaked. It made sure that each sample was given an anonymous COG-UK ID and was not accompanied with identifiable information. It also sent out protocols to the laboratories about how to send the samples and a spreadsheet template with a minimum list of the metadata that needed to be filled in to go with the samples (Ludden and Blane transcript). Chapman argues that the use of spreadsheets was not a perfect system, but it meant they did not have to wait 'weeks or months of computer development to try and get an NHS system to talk to an external academic system in an automated fashion'. Each centre was asked to keep a record of the ID assigned to a sample so that where needed it could be' traced back to an individual's NHS number or other unique identifier' (Chapman transcript).

For the first few months, COG-UK was heavily reliant 'upon the information that was being provided directly into CLIMB by the labs, capturing the samples, and getting the health records' it wanted to carry out any data analysis. Chapman remembers this 'turned out to be really important because it took quite a long time to get engaged with public health agencies in a way that then allowed us to get that data back out'. At this point, he says, it was a question of capturing enough information on the sample that would allow COG-UK 'to backfill that for PHE [Public Health England] and other agencies' (Chapman transcript).

COG-UK subsequently managed to build a data system with assistance from PHE to support the COVID-19 sequencing work. Launched in October 2020, the system took several months to construct. The task involved find a way to bring together the sequencing data stored in CLIMB with the patient data that only the public health agencies could hold. The new system provided 'a flow of sample identifiers, coming into PHE, and then a core dataset going out into the CLIMB system' for analysis (Chapman transcript).

Data linkage and cleaning

Looking back, Sharon Peacock reflects that one of the hardest tasks COG-UK faced was data linkage across the healthcare system and associated with testing sites, especially in the early days of the pandemic (Peacock transcript). This made it more challenging to connect the sample with patient metadata (Harrison transcript). Academic sites were permitted to know that the sample came from a particular time and particular place, but everything else was anonymised. Peacock remembers data linkage was further complicated by the fact that in the very early days of testing in the community, information was recorded manually (Peacock transcript).

From Chapman's perspective, data linkage was complicated by the fact that COG-UK aimed to provide 'a collective pooling of data across different institutions'. His recollection is that 'We had to have mechanisms for everybody to sign up to and for new people coming in to be able to add their data to that pool, but then be able to access everybody else's data that they had contributed to, to analyse it.' Then, he adds, there was 'the complexity of how the NHS and academia are set up. If you're trying to talk to Imperial College for example, you've got Imperial College, but then you've got Imperial College Healthcare Trust, a different organisation. And then, they're probably sending all their samples to a lab that sits at University College London, but is that University College, or is it the hospitals?' (Chapman transcript).

Another obstacle that prevented smooth data linkage was the fact that each hospital and NHS Trust had different laboratory management information systems. Dr Ian Harrison, who was embedded in the SARS CoV-2 genomics analysis cell at PHE, explains, 'Essentially, if you're in the Royal Free in London, they set up a system so that they can label all of their tubes and know exactly which tube's which, and they will buy a system to do that. And all of their systems are based on their healthcare trust. When you have a pandemic like this, and you're pulling in samples from the Royal Free [Hospital] and from all across the country, all 130-odd diagnostic labs, all these sample IDs which are unique within each trust actually are not often then unique when you look at them … across the country.' This meant that one hospital system could give a very similar barcode to another. He argues the system is fine when a hospital or Trust is 'working in its own bubble', but 'when you suddenly try and pull data together at a national level very quickly, it will cause real problems' (Harrison transcript).

Another issue Harrison recollects that COG-UK confronted was the fact that 'people can have different NHS numbers, depending on when they were registered in … and they might have two NHS numbers'. This was not so much of a problem in Wales because there they have a 'national based system, it's all joined up'. Likewise, in Scotland the NHS number is unique to each patient. But in England this is not the case. Harrison highlights that this made it difficult to know who had been 'tested multiple times or whatever' (Harrison transcript).

Some of the data issues, such as duplicates, could be easily ironed out using simple coding (Harrison transcript). But not everything could be resolved programmatically because sometimes as Dr Nabil-Fareed Alikhan, a bioinformatician at the Quadram Institute says, there would be 'a new exception that would break it'. According to him, a lot of the challenges came from the fact that 'hospitals or clinical partners weren't used to giving standardised information about samples. With the Quadram Institute getting spreadsheets from a number of different hospitals, Alikhan recalls that they often got 'disjointed tables of information' (Alikhan transcript).

Dr Matthew Bashton, who helped the COG-UK work at Northumbria University faced similar challenges. His 'overriding memory at least for the early parts of the pandemic was spending many late evenings in Excel trying to fix other people's wonky dates and make sure the sexes were coded correctly and formatting this stuff up late into the night to get it submitted the same day the sequencing came off.' The work was very painstaking, but he was very conscious that 'it needed to be done'. Not wanting to be the source of the delay, he says, 'I basically wouldn't go to bed until this thing was formatted and uploaded' (Bashton transcript).

CLIMB and Majora

From the outset of its operations, COG-UK stored and shared its data using the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) compute facility. Built initially as a digital infrastructure for UK microbiologists to upload, store and share their data, CLIMB was ideal for this purpose in terms of functionality. Hosted at the Universities of Birmingham and Cardiff, the CLIMB database is powered by 3,000 and 4,000 Central Processing Units at each university. One of the original architects of CLIMB, Professor Thomas Connor, describes it as 'probably the largest dedicated system for microbiology of its type in the world', CLIMB was designed to support research using open source software, including OpenStack for cloud computing and Ceph for storage (Mathieson).

CLIMB had previously been used successfully for pathogen outbreak analysis and had no affiliation with one country or public health agency (Nicholls). Offering a trusted environment for the diverse network of COG-UK participants to upload and share their data for analysis, efforts quickly got underway to build a 'walled garden' within the existing CLIMB infrastructure for this purpose. Known as CLIMB-COVID, this was assembled in just two weeks through the hard work of several bioinformaticians at different institutions co-ordinated by Dr Sam Nicholls based at the University of Birmingham. CLIMB-COVID uses only a small fraction of CLIMB's capacity (Mathieson).

Figure 10.1: Tweets put out by Dr Sam Nicholls on 10 April 2020 about the work he put into helping update the CLIMB database and capture metadata about the samples to support the COG-UK effort. This effort was facilitated by funding from COG-UK which the Universities of Birmingham and Cardiff used to buy buy solid-state drives to increase the speed of CLIMB, bringing its storage capacity to 1.5PB of SSD and 2.8PB of disk (Mathieson).

Figure 10.2: Tweets put out by Dr Sam Nicholls, 15 March 2021, to mark the first anniversary for when CLIMB-COVID was first set up.

CLIMB-COVID has one of three access control levels: public, consortium and restricted. The first, public information, was designed to be highly transferable and open to be deposited in databases. By contrast the second, the consortium level data was only available to consortium members or public health agencies and had to be analysed inside CLIMB-COVID. The last level, restricted information, could only be accessed by researchers granted specific approval (Nicholls; Bradley transcript).

Figure 10.3: Tweets put out by Nicholls concerning the development of Majora.

As well as building the CLIMB-COVID infrastructure, Nicholls and his colleagues developed a bespoke web-based application called Majora for the validation and submission of metadata. They designed it so that it could be used by any consortium member whilst making sure that access to the metadata was tightly controlled to satisfy governance requirements. Majora made it possible to upload both sequencing and metadata which were then 'paired together, processed and published' in such a way that they were available to everyone in the consortium wishing to 'perform analyses'. The tool provided 'a collection of application programming interfaces (APIs) to avoid any human intervention delaying the validating, processing or querying of metadata' (Nicholls).

Among those who worked with Nicholls were Underwood and his Oxford colleague Dr Khalil Abudahab, a software engineer. Together, they developed a Metadata Uploader with a web-based interface. This was important because originally, access to the CLIMB server was limited to people with access to specific programmes. To simplify the process Abudahab explains that they 'created a website where there is a form that you could type in the metadata and click submit. That allowed users to download an Excel file template, fill in the metadata, drag it back to the website and then click 'submit' (Abudahab transcript).

Figure 10.4: Tweets put out by Dr Sam Nicholls, 15 March 2021 about Majora and Elan tools used to submit and verify data submitted to CLIMB. The graphs show the steady increase in the number of API requests to add or retrieve sample information and number of SARS-CoV-2 virus samples sequenced by COG-UK sites between April 2020 and March 2021.

Figure 10.5: Diagram showing how sequencing and metadata data from COG-UK sites flowed into and was combined within CLIMB. Credit: Fig 1, Nicholls.

According to Underwood, the metadata uploader was incredibly useful to COG-UK because 'At the time the primary sequencers were hospitals all around the country and they didn't typically have someone who could use command line tools to perform data entry. What the metadata uploader enabled them to do was to take a standardised spreadsheet, which we used as a proforma, which we sent out, enter the data into that, and then to upload that. The website would then interact with the API, doing the command line stuff behind the scenes, and push that data in or warn the user that there were some fields which didn't meet the standards of the database, or didn't have the criteria necessary' (Underwood transcript). According to Bashton, 'That system was vital for preventing headaches because it validated everything, all of the fields. And once you've done it a few times you become razor-sharp at identifying things and what won't work' (Bashton transcript).

Metadata Fridays

According to Goodfellow, the CLIMB framework and workflow of Majora was pivotal to the success of COG-UK because it made the task of collecting and submitting sequencing and metadata as well as analysis much easier (Goodfellow transcript). But it did not completely eliminate the challenges involved in the process. One of the key times that stuck out in the memory of many COG-UK participants was “metadata Fridays”. This was when they were originally expected to upload all their metadata to CLIMB. Its importance is underlined by Alikhan: 'All of the samples, all the work done during the week had to be up on the system by midday Friday otherwise you had to wait for the next Friday. …For a lot of us that metadata Friday was this stressful thing of getting your homework in on time, getting your assignment submitted before they cut you out of the system. You were just furiously trying to make sure all your information is up in the correct format so that they can pull that in and report back to the government' (Alikhan transcript). Similar feelings were expressed by Shepherd who recalls, 'There was an 11 o'clock cut-off or something so it was always a big rush to get everything put together. That was the stressful part of the week, getting everything up. It was sort of like a little bit of a race to see who could do the more sequences between the different sites. So, there was always a pressure to get everything you'd done uploaded that week. I think it was stressful because of the volume, the systems were a little bit clunky. It now seems a bit of a blur, but it was all just a bit of a rush' (Shepherd transcript).

The pressure of metadata Fridays is also mirrored in the experience of Dr Sam Robson, the COG-UK Principal Investigator at Portsmouth University. He says, 'I'll never forget the idea of “metadata Friday's”. The way that it used to work was that every Friday was when the pipeline would run, so all the samples that have been uploaded over the last week would be sucked up into the machine and processed at that time. If you didn't get it in by a certain time, it wouldn't get processed. And for those first few months, it was very much people desperately trying to get all the work that had been done over the week, and getting it all uploaded, and inevitably it all gets left to the last minute, because obviously, you're trying to do everything for the previous week on that day.' Robson remembers 'We'd all be chatting on Slack about the issues that we're having, “…it's not working, I can't access the internet, my internet's gone down, it's going slow”' (Robson and Beckett transcript).

For Robson, one of the pivotal memories was when 'very dramatic music' was discovered to be playing whilst data got uploaded. There was a website for uploading the data. You would submit your data file, and if it runs into any problems, it would spit it back at you and say, “I can't upload that because of this problem". While it's doing that, to stop it from refreshing the page, the person who made the webpage just stuck in a piece of music that they chose as this kind of “Mission Impossible'' style music in the background but put it on silent so it wouldn't affect anybody. The idea being that if there's something happening on the screen in the background, it won't refresh itself' (Robson and Beckett transcript).

Why the music was used is explained by Dr Khalil Abudahab who worked alongside Underwood in Oxford. 'When you are in the web browsers and have so many tabs open what happens is some of them start to suspend to save your battery. Also, let's say there is a website that you visit and it's running something in the background which actually you're not using, then what happens is the active tab that you're on gets suspended. What used to happen is that somebody would come and drag a CSV file with say 200 samples, metadata, and then the website would start to submit them one by one to the server. That's fine if you keep your browser, if that's the active tab. But, if let's say you left that running and you went to check your Gmail, then that website, that tab is not active anymore, then the browser will cut it off. It will suspend it, and it's not uploading anymore. Anthony Underwood found that one way round is to play music or play audio because that does not get suspended… The audio has to be at a very low volume, just in the background. It's very, very subtle. It was just a way to stop the browser from suspending the uploads. Because what people do in the morning, they drag files into a website and then they would want to leave that running in the background and come after one or two hours, because sometimes the queue was very long and took a few hours. So you just want to leave that running and come back to it later' (Abudahab transcript).

Nobody knew about the music until one day, Robson says, Matt Loose happened to be 'using the system while his headphones were in, at which point it played the music. After that, everybody was chatting about this, it was a huge, big thing, and it just made metadata Fridays even more tense because you're listening to this tense music as 500 samples are all uploading and you've got 30 seconds left to get them all uploaded before it all kicks off' (Robson and Beckett transcript). The music is 'Adventures of Flying Jack' by Alexander Nakarada ( licensed under Creative Commons by Attribution 4.0 License. Click here to hear the music:

Figure 10.6: Tweets put out by Dr Sam Nicholls, 15 March 2021 about Elan tool used to submit and verify data submitted to CLIMB, showing the steady increase in the number of API requests to add or retrieve sample information.

Not getting the metadata uploaded on a Friday meant that it could not be uploaded until the following week which Robson points out 'had a big impact on its usefulness' (Robson and Beckett transcript). The pressure was not only immense for the people submitting the data, but also for those analysing it. Dr Áine O'Toole, a postdoctoral research associate at the University of Edinburgh, remembers that 'when the data flowed in once a week' it was then 'a race against the clock to try and get the analysis done and the information out to people as soon as possible' (O'Toole transcript). Over time the stress of uploading lessened once the process switched to twice weekly and then daily (Jermy and Harrison transcript). This shift was aided by the development of several tools which automated the analysis of the data, outlined below.

Pangolin lineage system - naming lineages

Faced with the prospect of SARS-CoV-2 evolving over time combined with the growing volume of genomes sequenced, it quickly became apparent to bioinformaticians that they needed to develop a standardised nomenclature scheme for naming and discussing different viral lineages. Like all viruses, the SARS-CoV-2 accumulates changes in its genetic code as it replicates. Where genome changes show a shared pattern, these are assigned to particular lineages which can be plotted on what is called a phylogenetic tree. Akin to a family tree, branches that appear in the phylogenetic tree represent all the different lineages and their evolutionary relationships (Wellcome Sanger Institute May 2020). Having an agreed set of labels for the different lineages circulating around the world was critical to facilitating real-time identification and communication of where there might be similarities or differences between COVID-19 cases. Such information was vital to understanding the dynamics of the pandemic, conducting outbreak investigations and the implementation of measures to prevent the further spread of the disease.

Dr Áine O'Toole, who worked alongside Professor Andrew Rambaut at the University of Edinburgh, remembers that in the early stages of the pandemic, 'publications and reports that came out had a sort of ad hoc naming system, or were inconsistent with other publications, or just avoided naming things at all. That made communicating between labs or between countries far more difficult, because you could be talking about one set of sequences and calling it X cluster, and I could be talking about the exact same thing and calling it something else.' For this reason, she points out it was really important to have a 'consistent naming system that everybody could access which [meant] that we were all on the same page around the world... When everything has a lineage then it means we're all speaking the same language' (O'Toole transcript).

The task of developing a suitable nomenclature was led by Rambaut and his team at the University of Edinburgh, which they did in collaboration with Professor Oliver Pybus and colleagues at the University of Oxford, Edward Holmes at the University of Sydney and Christopher Ruis at the University of Cambridge. Verity Hill, who was then a PhD student in Rambaut's group, says that the system they developed originated out of them 'just trying to work out different ways of cutting up the epidemic into manageable chunks to analyse it better, and to keep track of what's going on without building phylogenetic trees all the time' (Hill transcript).

This was not easy because the system needed to 'be capable of handling tens to hundreds of thousands of virus genomes sampled longitudinally and densely through time.' In addition, to be practical, it needed to 'have no more than 100 or 200 active lineage labels since any more would obfuscate rather than clarify discussion and would be difficult to conceptualize.' Not intended to represent every evolutionary change in the virus, they needed to design a system that could 'focus on those that have exhibited onward spread in the population, particularly those that have seeded an epidemic in a new location.' It also needed to be dynamic enough that it could incorporate new virus diversity and 'both the birth and death of viral lineages through time' (Rambaut).

O'Toole explains, 'The way that nomenclature system works is we started out with A and B, which were the two original haplotypes that had been identified at the Wuhan market. These were the very early sequences that came out at the beginning of 2020. Essentially, there are only two mutations apart between A and B, and from there any descendants that looked lineage-like, for instance, if they had epidemiological metadata associated with them, if we have a phylogenetic tree of A and B sub-clusters of diversity that looked like lineages…, they get a name in the order that we identify them. So B.1 was the first sub-lineage of B diversity that we found. And then as the virus continues to spread and continues to diversify, these clusters within them get sub-clusters. So, say we have B.1, a sub-lineage of that was identified as B.1.1. Then in December 2020, we labelled a sub-lineage of B.1.1 that had a number of concerning mutations, which later became known as the Alpha variant, but was B.1.1.7. So it was the seventh B.1.1 sub-lineage we'd seen. It's a sort of hierarchical system that encodes the ancestry of that particular virus sample in its name' (O'Toole transcript).

Figure 10.7: Photograph of Dr Áine O'Toole. Credit: Bethany Lavin Photography. Born in Ireland, O'Toole developed a passion for science at a young age. Encouraged by her parents to value education and cherish learning, O'Toole was the first person in her family to go through university. After school she did an undergraduate degree in genetics and a master's degree in genetics and molecular evolution at Trinity College Dublin. Wanting to become more involved in public health, she then took a PhD in Host, Pathogens and Global Health at the University of Edinburgh for which she gained Wellcome Trust funding. As part of this doctorate she was developing software for virus surveillance and clinical outbreaks. Still in the midst of completing her doctorate when the pandemic hit, she was able to quickly pivot to using her skills for COVID-19. She is now a postdoctoral researcher in Rambaut's laboratory (COG-UK Jan 2023).

Quickly adopted as the standard method used worldwide for identifying and tracking SARS-CoV-2, the new system was called PANGOLIN, which stood for 'Phylogenetic Assignment of Named Global Outbreak Lineages'. First published in early 2020, the system aimed 'to define an epidemiologically relevant phylogenetic cluster, for instance an introduction into a distinct geographic area with evidence of onward transmission, or recrudescence of a previously observed lineage, or rapid growth of a lineage with notable phenotypes, etc' (O'Toole, Pybus, Abram).

O'Toole designed a software tool by the same name that could automatically assign lineages. Her reason for developing the tool was because she remembers 'The very first couple of times we defined lineages, Andrew sat down with me and Verity and the three of us worked manually through the phylogenetic tree. Afterwards, I took on the job of maintaining the work on the lineages. But it was quite time consuming and it only got more and more time consuming as more and more data came in.' In addition, 'there was a disconnect between what I was doing, which was creating lineages, and people needing to be able to actually access that information as soon as they've generated their genomes' (O'Toole, Pybus, Abram).

Figure 10.8: Logo for the Pangolin system. Rambaut and his group have a long tradition of finding names for tools they have developed. O'Toole recalls that 'Pangolin came about when we were discussing origin theories and it seemed very likely that the virus originated from a wet market and contact with some sort of animal, whether it be a bat or raccoon dog or whatever other secondary animal that was present in the market. There were some closely related virus sequences that they'd isolated from pangolins that are not the most closely related, but we had this pool of animal names in our minds, like raccoon dog and pangolin. So, pangolin was a nod to the wet-market origins' (O'Toole transcript).

Figure 10.9: Tweets put out by Áine O'Toole and Anthony Underwood highlighting the importance of Pangolin for assigning lineages to the SARS-CoV-2 virus.

Figure 10.10: Lineages assigned to SARS-COV-2 virus in the UK using Pangolin Software. The map shows the proportion of different lineages in each location sequenced by COG-UK centres, for 22 May 2020. Credit: Wellcome Sanger Institute, May 2020.

The Pangolin system, designed to help discussions about epidemiological lineages, functioned separately from the Greek letter system subsequently adopted for naming variants of concern. O'Toole points out that having those two systems helped draw a distinction between 'what's a variant of concern and what's just another lineage'. She remembers sitting on some of the early Zoom discussions with people from WHO, Nextstrain and GISAID who helped formulate the Greek letter system. She says that one of the reasons they came up with it was because it made it easier to remember for the media and public who at first struggled to grasp the concept of the virus mutating and evolving and could get alarmed by it' (O'Toole transcript).

Microreact - visualising data

Having the Pangolin nomenclature system to hand was enormously beneficial to building out the phylogenetic tree of the SARS-CoV-2 virus. The tree provides diagrammatic representation of the evolutionary pathways and relationship between the different lineages of the virus. Processing and getting a sense of the phylogenetic tree was helped enormously by a free interactive web visualisation tool called Microreact which was developed before the pandemic. Originally designed by Professor David Aanensen and colleagues, then at Imperial College in 2016, they describe Microreact as combining 'clustering, geographical and temporal data into an interactive visualization with trees, maps, timeline and tables' (Argimón).

Initially, Microreact had been developed to visualise specific patterns of infectious diseases and subsequently evolved to also incorporate genomic data (Holden transcript). Underwood describes Microreact as essentially a dashboard that displays 'the place, time and genetic relationship between the genomes' sequenced (Underwood transcript). Previously used for bacterial pathogens, Microreact needed to be updated for COVID-19. This involved significant work to provide an appropriate visualisation, and required the software to be totally rewritten. Led by Abudahab, the objective was to make it easier to see how lineages were spreading and draw larger trees from the COG-UK data. The aim was to find a way for the computer 'to spend less time drawing in the tree and become more responsive.' Initially he did this by building on another library called Phylocanvas. This gave them the ability to upscale from visualising 10,000 to 100,000 samples. But it then started to slow down again under the volume of data (Abudahab transcript).

Figure 10.11: Visualisation of SARS-CoV-2 genome data in the UK using Microreact, Dec 2020. Credit Wellcome Sanger Institute Feb 2021.

To resolve the issue, Abudahab looked at various options and settled on Web Graphics Library (WebGL), an open-source JavaScript API, which provides a means to draw 2D and 3D graphics very quickly. Commonly used in the gaming industry, Abudahab's inspiration for using the tool came from Uber, the taxi company, which deploys it to visualise trips made by the taxis. For him, the WebGL offered a tool to 'make a phylogenetic tree viewer' using Web Geographic Information Systems (Web GIS), a cloud-based mapping platform that makes it possible to make and share maps easily. This he believed 'could help us get over the struggle we were facing with 10,080 to 200,000 data points in a fraction of a second.' Realising it could be a 'game changer', he put a lot of effort into getting it to work. Writing the code involved a lot of trial and error. At one point, he had to go back to his old geometry books to help him. Based on this, he and his team managed to develop 'the world's first web GL viewer for genetic data for the web' which they released as open source (Abudahab transcript).

Figure 10.12: Photograph of Jacinda Andern, Prime Minister of New Zealand, looking at the Microreact dashboard showing the situation with SARS-CoV-2. Underwood remembers how pleased he was to see this tweet as it proved how useful Microreact was as a communication tool. He also learnt that it was used in places like Columbia, Portugal and Spain. Credit: Tweet sent out on by The Institute of Environmental Science and Research 20 August 2020 (Underwood transcript).

Where Microreact proved particularly helpful was for getting people with no expertise in genomics to understand how the SARS-CoV-2 virus was spreading at national, regional and local levels. Microreact proved particularly helpful to public health officials to communicate highly visual messages through the public health system and provided a level of detail and geographical resolution previously unavailable (Centre for Genomic Pathogen Surveillance). Just how useful it could be as a communication tool is illustrated by a medical director whose attention it captured when he happened to come across Holden when he was looking at Microreact at Public Health Scotland to understand some transmission patterns between care homes in Scotland. Able to see a cluster of cases and how the data linked together, Holden remembers that the medical director immediately grabbed the incident director to also come over and look. For Holden, having Microreact was invaluable, because it was easy to convey a complex story visually. He argues, 'If I was to reflect on some of the key events that allowed us to do what we did in Scotland, it was ending up in personal meetings where I was able just to drag a couple of influential people over to show them Microreact.' He was also struck by how quickly the First Minister in Scotland, who has no background in infectious diseases, was able to grasp the information she saw with Microreact (Holden transcript). Microreact not only made an impression among policy makers in Scotland. It also caught the attention of politicians elsewhere, such as Jacinda Andern, the Prime Minister of New Zealand (Underwood transcript).


Microreact was not the only tool that helped the timely sharing of genomic data. Another software tool that proved invaluable was civet, which stands for Cluster Investigation and Virus Epidemiology Tool. It was designed to automatically produce summaries of the latest genetic diversity of the virus for use by infection prevention and control teams. Able to extract relevant data from CLIMB and GISAID, civet places this in the context of a specific locality in the UK, in the entire UK, or the world. Developed at the University of Edinburgh, civet was built on the back of another project O'Toole had been working on before the pandemic where she had been developing a real-time reporting system for outbreaks from norovirus or the respiratory syncytial virus. At the start of the pandemic, she was given fresh impetus to push the idea further when she began receiving regular requests from the Royal Infirmary in Edinburgh to provide them with reports on samples they were sequencing. Initially doing the analysis and reports by hand, O'Toole says she rapidly felt like she was 'the weakest link in the real-time reporting chain, because I would do the report, and then they'd email again, but I was also working on the lineage system at the time, and these reports were not being done immediately, because there was just so much else going on at the time' (O'Toole transcript).

O'Toole developed civet together with Hill using Python code. Describing the process, she says 'We had this piece of software that could essentially run the standard analysis pipelines and produce a report automatically, so that when … the hospital had another sample to add in, you could just run the analysis again with the new sample in and it would produce the same report but updated that can be used for outbreak investigations'. The value of civet was that it empowered 'the researchers at the hospital to run this analysis themselves and get the report out' which meant they were no longer dependent on O'Toole to do it. It also had the advantage that it could produce different reports at once ranging from infection prevention to phylogenetics reports (O'Toole transcript).

Figure 10.13: Diagram showing how civet works. Background data generation pipeline (a-c) and how a civet query is defined (d-f). Credit: Fig 1, O'Toole, Hill, Jackson.

Over time, O'Toole and Hill refined civet to include more features and modes to aid routine analysis and facilitate virus outbreak investigations and surveillance at the local level (O'Toole, Hill, Jackson). Ultimately, Hill explains, 'the point with civet was to try and make it so that people could pull out the information they needed in a local context, such as in a hospital or an NHS Trust. It was partly because we thought that would be useful, and also because in a big national consortium you really need the buy-in from, say, local doctors. If they can see the actual, useful results of something transmitted from ward A to ward B, then ideally, that means they're more on-board with the whole process. I don't know if it's specifically in the UK, but historically there's been some resistance of medical personnel to sequencing because it's viewed as academic questions rather than public health ones. So, we were thinking how to make sequencing useful on a small scale, fast' (Hill transcript).

As part of the development of civet, Hill spent a great deal of time 'getting all of the spatial data very clean and coherent' so that it came out as 'usable data at the end'. This was because she says 'it turns out all the geographical systems in the UK are a complete mess just because it's a thousand years of history of stuff being built on top of each other.' Describing the UK as having a 'weird county system', Hill remembers that it turned out that a lot of places 'you think are counties aren't or that they're not Admin 2 regions, which is that level in the UK.' According to her, 'I'd be trying to do something else, and then I'd spend another week cleaning geographical metadata.' She continues, 'I'm a complete spatial nerd so I really love making maps for everything, but I think that data has been extremely useful to map the current spread' (Hill transcript).

The extent to which the hard work paid off can be seen from the experience of Professor Darren Smith, the COG-UK Principal Investigator at Northumbria University. He recalls that when the civet tool came online 'it was a game changing moment for reporting data.' Critically, 'it made it possible to look at all samples that have been sequenced over a time period and how your test viral genome fits within that data taxonomically.' He argues that because the 'output offered a rapid route to high resolution data, it made it a lot easier for us to talk to clinicians about infection control using data and images generated by this analysis tool.' Its high degree of resolution meant that they could 'almost see where there had been transmission within a hospital or residential home, and as time progressed public health and hospital infection control teams became more open to looking at the data to support active control.' In addition, historical samples could be linked in, which meant they could go and see where infection control had worked (Smith transcript).

Figure 10.14: Civet report with 'Schema of clinical outbreak investigation June 2020, colour of cases indicate lineage revealed by genome sequencing (B.1.1 or B.1.418) (a-b) and components of a civet report generated for the outbreak investigation (c-j). a) The outbreak occurred across three wards and involved six members of staff, four patients and one household contact of a staff member. b) Timeline of sample collection dates across wards A, B and C. c) The metadata of all query sequences is summarised in an interactive table, with sortable columns that can be toggled on and off. d) Each catchment is summarised in full, regardless of down-sampling. Number of queries and the countries and lineages within the catchment are indicated. e) The catchment phylogenies are displayed initially in compact form, but can be expanded vertically using the Expansion slider. By default tip nodes are coloured by whether a tip is a query taxa or not, but the dropdown menu allows the user to colour tip nodes by any trait specified in—tree-annotations. f) Tip nodes can be selected to show the metadata associated with that particular sequence and clades can be collapsed to a single node by selecting the parent branch. g-h) snipit graphs highlight nucleotide differences from the reference genome. i-j) A timeline summarises any query date information provided. Note: all metadata has been de-identified for data protection purposes.' Credit: Fig 3, O'Toole, Hill, Jackson.

Figure 10.15: Sample of figures from a civet report demonstrating its use for community surveillance in the UK. Credit: Fig 4, O'Toole, Hill, Jackson.


Argimón, S, Abudahab, K, Goater, RJ, et al (30 Nov 2020) 'Microreact: visualizing and sharing data for genomic epidemiology and phylogeography', Microbial Genomics, 2/11.Back

Centre for Genomic Pathogen Surveillance (July 2020) 'Written Evidence (C190090)' UK Parliament Committees.Back

COG-UK Women (Jan 2023) Snapshots of Women in COG: Scientific excellence during the COVID-19 pandemic.Back

Marjanovic, S, Romanelli, R, Claire-Ali, et al (2022) 'Evaluation of the COVID-19 Genomics UK (COG-UK) Consortium, Final Report, RAND Europe.Back

Mathieson, SA (12 April 2021) 'How the Covid-19 Genomics UK Consortium sequenced SARS-CoV-2', Computer Weekly.Back

Nicholls, S, Poplawski, R, Bull, MJ, et al (1 July 2021) 'CLIMB-COVID: continuous integration supporting decentralised sequencing for SARS-CoV-2 genomic surveillance', Genome Biology, 22/196.Back

O'Toole, Á, Hill, V, Jackson, B, et al (9 Dec 2022), 'Genomics-informed outbreak investigations of SARS-CoV-2 using civet', Plos Global Public Health.Back

O'Toole, Á, Pybus, OG, Abram ME (11 Feb 2022), 'Pango lineage designation and assignment using SARS-CoV-2 spike gene nucleotide sequences', BMC Genomics, 23.Back

Rambaut, A, Holmes, EC, O'Toole, A, et al (15 July 2020) 'A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology', Nature Microbiology, 5, 1403-07.Back

Wellcome Sanger Institute (22 May 2020) 'Analysis of COVID-19 Genomes reveals large numbers of introductions to the UK in March'.Back

Wellcome Sanger Institute (5 Feb 2021) 'Sequencing COVID: Our latest stats'.Back

Interview transcripts

Note: The position listed by the people below is the one that they held when interviewed and may have subsequently changed.

Interview with Dr Khalil Abudahab, Senior Software Engineer for data visualisation and integration, Centre for Genomic Pathogen Surveillance.Back

Interview with Dr Nabil-Fareed Alikhan, Bioinformatics Scientific Programmer at the Quadram Institute.Back

Interview with Dr Matthew Bashton, Computational Biologist, Northumbria University.Back

Interview with Dr Declan Bradley, COG-UK Principal Investigator at the Public Health Agency, Northern Ireland.Back

Interview with Dr Michael Chapman, Director of Health Informatics, Health Data Research UK Cambridge.Back

Interview with Professor Ian Goodfellow, University of Cambridge (interviewed 15 Dec 2021, unpublished transcript).Back

Interview with Dr William Hamilton, academic clinician working in infectious disease and microbiology at Cambridge University Hospitals NHS Foundation Trust.Back

Interview with Dr Ian Harrison, SARS CoV-2 genomics analysis cell, Public Health England (UKHSA).Back

Interview with Verity Hill, PhD student at University of Edinburgh.Back

Interview with Professor Matthew Holden, Director of Impact at St Andrew's University and COG-UK Principal Investigator at Public Health Scotland.Back

Interview with Dr Ewan Harrison (Deputy Director COG-UK and UKRI Innovation Fellow, Wellcome Sanger Institute, Senior Research Associate, Department of Medicine, University of Cambridge) and Dr Andrew Jermy (External Communications Advisor COG-UK).Back

Interview with Dr Beatrix Kele (Clinical Scientist), Dr Maria Teresa Cutino-Moguel (Virology Clinical Lead), Barts Health NHS Trust.Back

Interview with Dr Catherine Ludden, Director of Operations, COG-UK and Beth Blane, Logistics Manager for COG-UK, Research Assistant in the Department of Medicine, University of Cambridge.Back

Interview with Dr Áine O'Toole, Postdoctoral research associate, University of Edinburgh.Back

Interview with Sharon Peacock, Professor of Public Health and Microbiology in the Department of Medicine, Cambridge University and Executive Director of the COVID-19 Genomics UK (COG-UK) Consortium.Back

Dr Sam Robson, Principal Research Fellow (Bioinformatics), Angela Beckett, Specialist Technician (Research), Faculty of Science & Health, School of Biological Sciences, Centre for Enzyme Innovation, University of Portsmouth.Back

Interview with Dr James Shepherd, Specialty Registrar in Infectious Diseases and Medical Microbiology, Clinical research fellow MRC Centre for Virus Research, University of Glasgow.Back

Interview with Professor Darren Smith, Personal Chair in Bacteriophage Biology, Northumbria University.Back

Interview with Dr Antony Underwood, Head of Translational and Operational Bioinformatics, Centre for Genome Pathogen Surveillance, Oxford.Back

Respond to or comment on this page on our feeds on Facebook, Instagram, Mastodon or Twitter.