The next frontier: Human viruses

In 1981 Sanger and his colleagues published the sequence of DNA from the lambda bacteriophage DNA. At 48,000 nucleotides long, this was the longest sequence completed so far. The project had taken three years from start to finish. Soon after this Sanger and his team began looking for their next challenge. Up to this time, with the exception of the human mitochondrial DNA project, they had concentrated their efforts on sequencing DNA taken from bacteriophages, viruses that infect bacteria. This had been largely driven by the fact that DNA molecules in bacteriophages tend to be small so were ideal for testing the effectiveness of Sanger's sequencing techniques. Having successfully demonstrated the method with bacteriophage DNA, the team now began looking for more challenging models with longer sequences. This was to take them into new territory: human viruses..

The work on human viruses was not only aimed at testing the viability of the techniques on longer sequences, but also fulfilled Sanger's long-term ambition to show that DNA sequencing could be used to improve knowledge about the pathways of disease, thus laying the foundation for better treatment. He was inspired in part by his father's work as a doctor, but his aspiration also reflected the growing concern of LMB researchers in the early 1970s to focus on projects that might benefit human health.

Sequencing the genome of the influenza virus

One of the first human viruses to be sequenced with Sanger's dideoxy method was the one responsible for influenza, an infectious disease which every year causes severe illness in three to five million people around the world and between 250,000 and 500,000 deaths. There are three types of influenza virus: A, B and C. Type A is the most dominant and causes the most severe disease. The viruses are further sub-typed according to two types of glycoproteins, called antigens, found on the surface of the spherical shell of the viron, the virus particle. The first, known as haemagglutinin (H), enables the virus to enter host cells, and the second, called neuraminidase (N), facilitates the release of new virons from infected host cells.

This diagram shows an influenza virus particle with 8 RNA segments inside a spherical shell covered with surface proteins and the process of replication.

The first project to sequence the influenza virus was launched in the late 1970s by George Brownlee, Sanger's former doctoral student. It focused on the H1 subtype of the human influenza A strain. One of the aims in sequencing the virus was to understand the genetic mechanism that underlies the outbreak of new influenza pandemics. The influenza genome is unusual in that it is made up of several RNA segments rather than one continuous genome. The H1A virus was known to contain eight single-stranded RNA segments, totalling 14,000 nucleotides. It was these segments Brownlee’s team set out to sequence. A key objective of the project was to establish the degree to which the exchange of RNA segments, which naturally occurs between different strains of influenza, contribute to the initiation of new influenza pandemics. They wanted to find out whether this simple segment exchange was the main mechanism facilitating the emergence of new pandemic strains).

Gregory Winter. Credit: Trinity College Science Society. The influenza project marked a major turning point in Winter's career. Importantly, he learnt how to do DNA sequencing - as he said he became 'quite an expert at it and enjoyed it in a masochist way'. Nonetheless, he soon realised the influenza virus sequence alone could not tell him anything about how proteins worked or what part of the sequence could be changed to stop the influenza virus from becoming pathogenic. Soon after this he learned another technique known as site-directed mutagenesis (SDM) which made it possible to change, in a very precise and specific way, part of an organism’s DNA. This, together with DNA sequencing, inspired him to begin creating proteins from scratch, and this created the foundation for the development of safer monoclonal antibody drugs (Winter, 2011). For more about Winter's work on monoclonal antibodies click here.

Much of the work on the influenza virus was conducted by Stanley Fields, one of Brownlee’s PhD students, and Greg Winter, a protein chemist. Winter had recently completed a doctorate under Brain Hartley. This he had begun with Brian Hartley when still at the LMB, and had completed it at Imperial College when Hartley was appointed to a professorship there. Part of Winter's doctoral research involved sequencing amino acids in serine proteases, enzymes that cleave peptide bonds in proteins. Winter first learnt about DNA sequencing at the end of 1976, when Sanger visited Hartley at Imperial College and gave a lecture on his plus and minus technique. Impressed by how fast Sanger's method could sequence nucleic acids and excited about its possibilities for future protein research, Winter applied for a postdoctoral research fellowship with Sanger. Not having space in his laboratory, Sanger arranged for Winter to help Brownlee in the sequencing of the influenza virus. (Winter, 2011).

Stanley Fields. Credit: Fields. In 1981 Fields completed his doctorate at Cambridge University. He would later co-pioneer a method for identifying protein to protein interactions and protein to DNA interactions.

Soon after launching the influenza project, Brownlee spent a year in Australia where he began testing the Maxam-Gilbert method for sequencing the virus. Winter and Fields decided, however, to continue using the dideoxy method then being promoted by Sanger and his team within the LMB. Still rather new, the dideoxy technique proved rather temperamental initially. As Winter recalled, it worked 'beautifully on some days' and then collapsed suddenly for no obvious reason. This could have been caused by any number of factors, such as a batch of enzymes going off or impurities in the DNA. Sometimes the method failed to work for weeks at a time. Nor were Winter and Fields the only ones to face this problem, researchers in Sanger's Laboratory encountered similar difficulties. Whenever the system collapsed, Winter remembered everyone went 'around trying each other's batch of enzymes, rushing from one conclusion to another. The minute anyone got it working we would watch what it was that they were doing different[ly].' As he pointed out, in fact everyone was doing the same thing, and it was more a question of 'psychology than anything [else], trying to outguess what type of Juju [had] been put on your work'. Over time, however, the method began to work more consistently, aided by improvements in the quality of enzyme batches (Winter, 2011).

Influenza viruses (blue) attaching to cells of the upper respiratory tract. Credit: R. Gourmandism, Welcome Images.

Initially, the team attempted to sequence the RNA directly, but they soon switched to cloning the RNA in the bacteriophage M13, and then sequencing the product with the dideoxy technique used by Sanger and his team for sequencing the human mitochondrial genome. By 1981 the team had successfully sequenced the gene in the influenza virus type A that coded for the neuraminidase protein found on its surface. The gene was 1,413 nucleotides long (Fields, Winter, Brownlee, 1981).

The genome segment sequenced by Brownlee’s group was a major achievement. It was the first complete sequence of the neuraminidase gene of the influenza virus, and the encoded protein later became the target for development of drugs such as Relenza used to treat influenza infections. However, the neuraminidase segment represented only one of the eight segments of the complete genome of the human influenza A virus, which totals about 14,000 nucleotides. In due course Brownlee, Winter and Fields completed the sequence of all segments of the same strain and thereby the genome of this strain. Further work largely focused on sequencing short fragments from the haemagglutinin or neuraminidase segments from different strains which were thought to play a role in its antigenic variation. Little of this work provided complete sequences of the virus, however. In part this reflected the technical difficulty of developing an efficient sequencing pipeline for the RNA based organism. This situation changed with the setting up of the Influenza Genome Project (IGP) in 2005 by an international consortium of scientists with funding from the U.S. National Institutes of Health. The IGP continues to this day, and sequences and analyses many different types of influenza viruses. By 2015 the IGP had sequenced over 6,000 genomes for human influenza virus A, and just under 2,000 for the human virus B. The sequencing data is being used to understand the rate of mutation underlying the evolution of the virus and to monitor the effectiveness of vaccines. (Ghedin, E, Sengamalay, Shumway et al, 2005; J Craig Venter Institute).

The Epstein-Barr Virus

While Brownlee and his team were sequencing the influenza virus, Sanger and his team began working on the Epstein Barr Virus (EBV), also known as the human herpes virus 4 (HHV-4). It is one of eight viruses in the herpes family. The virus only infects humans. It is carried by more than 90 per cent of the world's population. The virus causes few major symptoms when picked up in childhood, but adolescents who are infected run the risk of developing mononucleosis (glandular fever). It is also associated with several forms of cancer. Every year, worldwide, EBV causes approximately 84,000 cases of gastric cancer, 28,000 cases of Hodgkin lymphoma, 78,000 cases of nasopharyngeal carcinoma, cancer of the throat and nose (most prevalent in Southern China, North Africa and in Inuit populations in the Arctic), and 6,000 cases of Burkitt lymphoma, a blood cancer (most common in equatorial Africa and New Guinea). It can also cause lymphomas in patients who receive organ or stem cell transplants. Overall EBV-related cancers are estimated to affect up to 1.5% of humans worldwide. There is also some evidence that an EBV infection raises the risk of autoimmune diseases, including multiple sclerosis (Cohen, Mocarski, Raab-Traub, et al, 2013; Balfour, 2014; Palser, Grayson, White et al, 2015).

This shows normal human B lymphocyte cells multiplying in response to an EBV infection. Credit: Paul Farrell.

The possible existence of such a virus was first suggested in the early 1960s by Denis Parsons Burkitt, an Irish surgeon working in a mission hospital in Kampala, Uganda. Burkitt had begun thinking about such a virus when he began investigating why so many children with lymphoma, a blood cancer, were coming to his hospital. These children had dental problems and swollen faces and necks with strange lesions. He found it particularly striking as most of these children came to the mission hospital from the north and east of Uganda, the least populous areas of the country with limited transport to Kampala. He soon discovered, from the hospital's previous records and some drawings done by a previous mission doctor, that patients had been coming to the hospital with similar symptoms since 1902. Soon after this he happened to speak to George Oettle, a visiting pathologist and cancer epidemiologist from Johannesburg, who commented that he had never come across such cases in South Africa (Magrath, 2009; Crawford, Rickinson, Johannessen, 2014).

Curious to know whether the disease was confined to one specific geographical region, Burkitt set out to map out its distribution across Africa. In 1961 he began by sending out 1,000 questionnaires to government and mission hospitals throughout Africa, and followed this up by paying personal visits to 57 hospitals in eight countries in equatorial Africa, a round trip of ten thousand miles. Based on this research, he was able to establish that the disease, later known as Burkitt Lymphoma, was most prevalent in tropical Africa. The lymphoma appeared to be most common in areas with some of the highest annual rainfalls and temperatures. As the disease appeared to be most common in environments commonly associated with malaria, Burkitt and his colleagues hypothesised that its transmission might be linked to mosquitoes or a similar insect vector (Magrath, 2009; Crawford, Rickinson, Johannessen, 2014).

Denis Burkitt with a map of Africa plotting the distribution of lymphoma. Credit: Davis Coakley. Burkitt originally started training as an engineer, but switched to medicine after joining a Christian Evangelical Group at Trinity College, Dublin. Following his degree he began working as a doctor in West Africa as part of the Colonial Medical Service. In 1946 he transferred to the Royal Army Medical Corps and began working as a surgeon in Uganda. His first post was in a 100-bed district hospital some 275 miles away from the nearest x-ray facility. Thereafter he began working as a surgeon at Mulago Hospital in Kampala where he remained for the rest of his career. It was there that he began working on lymphoma.

Ugandan children, 1967. Photograph from Denis Parsons Burkitt's papers. Credit: Wellcome Library.

In March 1961, soon after mapping the distribution of the lymphoma in Africa, Burkitt took a leave of absence and went to London where he presented his findings to an audience at Middlesex Hospital. By chance, one of those listening to his talk was Anthony Epstein, a British pathologist then researching a virus that causes tumours in chickens, known as the Rous sarcoma virus. Greatly inspired by Burkitt's presentation and eager to be the first person to identify a virus that causes cancer in humans, Epstein invited Burkitt to send him some tumour cell samples from his Ugandan patients so that he could examine them for virus particles with one of the few newly-developed electron microscopes that he had in his London laboratory.

It was three years before Epstein managed to isolate the virus in the tumour cells from Uganda. The breakthrough came in December 1963 when he received some tumour cells from a nine-year old patient. Initially he had not expected to find anything of value in that particular sample, because it appeared contaminated on arrival. This he put down to the fog that had delayed the flight carrying the specimen. To his amazement, however, he managed to detect a virus in the sample. The virus would later be named after him and Yvonne Barr, the doctoral student who had participated in his research.

Anthony Epstein and Yvonne Dawson. Credit: Anthony Epstein.

The detection of the virus was significant, but Epstein had yet to prove its link with Burkitt's lymphoma cases. This was not easy. One of the difficulties was the fact that the electron microscope only detected the presence of the EBV in one out of every 100 cells from the lymphoma tumours Burkitt sent him. Epstein soon joined forces with the husband and wife virologists, Werner and Gertrude Henle, as well as other researchers at the Children's Hospital in Philadelphia, and by 1967 they had established how infected B lymphocytes, a type of white blood cell, transmitted the virus to uninfected B lymphocytes which then became cancerous. It was the first time a virus had been shown to be the cause of cancer in humans (Henle, Diehl, Kohn, 1967).

Sequencing the Epstein-Barr Virus

The possibility of sequencing the EBV was first suggested to Sanger in the early 1980s by Beverly Griffin, one of his former PhD students based at the Imperial Cancer Research Fund (ICRF) Laboratories in London. Sanger found this a highly attractive proposition. Not only did EBV have a suitably long sequence, but Griffin and some of her colleagues were willing to share some cloned DNA fragments of the EBV that they had recently prepared. They had made them from a virus strain labelled B95.8. The cell line had been created in 1972 by American scientists based at Yale and Harvard Universities, who had cultivated it by infecting B lymphocytes in marmosets with the virus taken from a patient with glandular fever. This cell line was unique in producing sufficient quantities of virus to extract DNA. Prior to the development of the B95.8 strain, scientists had struggled to obtain large amounts of the virus because only a small proportion of B lymphocytes spontaneously support virus production(Miller, Shope, Lisco, et al, 1972).

Beverly Griffin. Credit: Cold Spring Harbor Laboratory Archives. Born in the Southern United Sates, Griffin first came to England to study chemistry at Cambridge University. She went on to become Sanger's doctoral student, during which time she became involved in his protein and DNA sequencing efforts. Following her doctorate, Griffin began working with researchers at the ICRF Laboratories on small DNA viruses known to cause cancer in animals. She soon developed an interest in EBV, inspired in part by the work of her Swedish husband Tomas Lindahl, a specialist in DNA damage and cancer.

What Griffin and her colleagues were offering to send Sanger was a set of EBV DNA fragments they had generated using restriction enzymes and cloning procedure. These fragments varied in length, some being up to 17,000 nucleotides long. Almost nothing was known about their sequences, although there was some information on how the fragments fitted together from restriction enzyme site mapping (Arrand, Rymo, Walsh, 1981; Farrell, 2015).

Sequencing the EBV was a major challenge, as virtually nothing was known about its genetic structure so that any prediction of its genetic content was crucially dependent on the accuracy of its sequencing. This differed from what Sanger and his group had managed to do when identifying the genes in bacteriophages which they had done with the help of a number of genetic maps already constructed by other researchers. By contrast very little was known about the genetic structure of the EBV. This made it much harder to work out the sequence of the virus and pinpoint which of its regions coded for the particular amino acids that made up its proteins (Farrell, 2015; Crawford, Rickinson, Johannessen, 2014).

Sanger delegated responsibility for the EBV sequencing project to Barrell, who launched the project in 1980 with the help of several researchers. One of the team's first tasks was to extend the DNA fragments supplied by Griffin's group. By this time, advances had been made which increased the number of nucleotides that could be read from each sequence reaction. These longer sequences greatly simplified the assembly of the total sequence. Once the DNA fragments had been extended, members of the group were assigned different portions of the EBV's genome to sequence. Much of this sequencing was done by doctoral students who were then expected to analyse the results (Farrell, 2015).

This shows Paul Farrell, a key member of the EBV sequencing project. Credit: Farrell. Prior to arriving at the Laboratory of Molecular Biology (LMB), in August 1980, Farrell had completed a doctorate at Cambridge University and then been a postdoctoral fellow at Yale University. At the time that he joined the LMB, Farrell had originally intended to work with John Gurdon, but he soon switched to Barrell's project because sequencing the EBV offered him an exciting opportunity to work on gene expression and structure which was a long-standing research interest of his. Farrell was a welcome addition because he had specialised expertise in virology which no one else on the team possessed.

The sequence took three years to complete, with an average of eight people working on it at any one time. Published in Nature in 1984, the sequence contained 172,282 nucleotides, which was more than three times longer than the DNA sequence of the lambda bacteriophage. This was a significant achievement. Some idea of how large an undertaking it had been can be gauged from the fact that in 1984 when the EBV sequence entered an international collaborative DNA sequencing database run by the European Molecular Biology Laboratory, it comprised 10 per cent of the entire accumulated database (Baer, Bankier, Biggin et al, 1984; Farrell, 2015).

The significance of the EBV sequencing project

Completion of the EBV project marked a significant moment in DNA sequencing. It was remarkable not only because of the size of the sequence, but because it had been determined without any prior knowledge about its genetic profile. The success of the EBV project rested on the development of computer-assisted methods for searching the nucleotide sequence for features known to be relevant to the gene structure. This combed the sequence for long stretches of nucleotide sequences lacking particular sets of three nucleotides, known as codons, which function as breaks in the coding process. Such codons do not code for any amino acids. These stop codons occur on average about once in every 21 codons. Once the stop codons were identified, the sequence stretches between them were translated into protein sequences using the genetic code. This code defines the codons that specify each of the 20 amino acids naturally found in protein. The longest of these open reading frames in each region of the genome was usually found to represent the actual gene product (Barrell, Farrell, 1986; Farrell, 2015).

What is remarkable is how accurate the EBV sequence turned out to be, as was the genetic map that was predicted on the back of it. Barrell’s group estimated the sequence encoded for approximately 84 genes. The team were initially nervous about putting their predictions into the public domain in case they were wrong. Farrell remembers spending a whole year with Barrell reading through computer printouts spread out over a desk, painstakingly checking the sequence manually for errors before they finally published the information. In subsequent years, only three mistakes were found in the original EBV sequence, an error rate of 1/50,000, which was the same as in the phage lambda sequence. The analytical methods developed during the EBV project became the standard technique for DNA sequencing until the next generation high throughput techniques were invented.(Farrell, 2015).

It would take the team many years to construct the genetic map of the virus and work out all the gene coding regions in the sequence. One of those at the forefront of this effort was Farrell, who built up part of his career around the work (Baer, Bankier, Biggin et al, 1984; Farrell, 2015).

Initially, scientists hoped the sequencing data collected from the project would lead to better prevention of EBV infections. Many attempts have been made since the 1980s to develop a vaccine against the virus, but progress has been slow. One of the key challenges has been the difficulty of conducting meaningful animal studies, because only humans are affected by the virus. Without such animal testing it is difficult to launch human trials. Furthermore, many pharmaceutical companies have been reluctant to invest in the area because the market it is too small. At present a vaccine looks like it would be able to prevent glandular fever but not prevent the infection in general. Some progress, however, is being made in the development of screening tests to identify those at a high risk of developing nasopharyngeal carcinoma. Such tests hinge on knowledge of the EBV sequence which could not have been unravelled without DNA sequencing. Efforts to track the transmission of the disease and its mutation worldwide are also highly dependent on DNA sequencing (Baer, Bankier, Biggin et al, 1984; Balfour, 2014; Farrell, 2015; Palser, Grayson, White et al, 2015).


Completion of the EBV sequence defined the beginning of a new phase of research on human herpes viruses (Farrell, 2015). In 1984, Barrell and his team began sequencing cytomegalovirus, another virus in the herpes family. Cytomegalovirus is a common virus that is spread through bodily fluids, such as saliva and urine. It is often passed on through the changing of nappies in young children. Other transmission routes are kissing, unprotected sex and the transplantation of an infected organ. The virus is largely harmless, and most infected people do not experience any symptoms, although some have a sore throat, swollen glands and a high temperature. The virus stays in the body for the rest of a person's life. In most cases, however, the virus remains inactive and does not cause further problems. It can become a problem when it is reactivated in people with weakened immune systems, such as HIV sufferers or those on immunosuppressant drugs to prevent the rejection of a transplanted organ.

The cytomegalovirus worked on by Barrell and his colleagues was sourced from the adenoids of a 7 year old girl, and it took them five years to determine its complete sequence. They published this in 1990. Consisting of 229,354 nucleotides, it was the largest sequence completed so far. Following its completion, Barrell's team continued to work on the virus, investigating its coding capacities (Chee, Bankier, Beck et al, 1990; Bankier, Beck, Bohni et al, 1991; Rawlinson, Farrell, Barrell, 1996).

Research on the human viruses was just the start of the adoption of DNA sequencing in many different areas of human health. Such work was to be greatly aided by the rise of the computer and the development of automated sequencers.


Arrand, J, Rymo, L, Walsh, J E, Bjorck, Lindahl, T, Griffin, B E (1981) 'Molecular cloning of the complete Epstein-Barr virus genome as a set of overlapping restriction endonuclease fragments', Nucleic Acids Research, 9/13: 2999-3014. Back

Baer, R, Bankier, A T, Biggin, M D, Deininger, P L, Farrell, P J, Gibson, T J, Hutfull, G, Hudson, G S, Satchwell, S C, Seguin, C. Tufnell, P S, Barrell, B G (1984) 'DNA sequence and expression of the B95-8 Epstein-Barr virus genome', Nature, 310, 207-11. Back

Balfour, H H (2014) 'Progress, prospects, and problems in Epstein-Barr virus vaccine development', Current Opinion in Virology, 6: 1-5. Back

Bankier, Beck, Bohni et al (1991) 'The DNA sequence of the human cytomegalovirus genome', DNA Seqwuencing, 2/1:1-12. Back

Barrell, B G, Farrell, P (1986) 'Using nucleotide sequence determination to understand viruses', A.L. Notkins et al, eds, Concepts in viral pathogenesis II Back

Chee, M, A, Bankier, A T, Beck, S, et al (1990) 'An analysis of the protein coding content of the sequence of human cytomegalovirus strain AD 169', Current Topics Microbiology Immunology Back

Cohen, J I, Mocarski, Raab-Traub, Corey, L, Nabel, G J (2013) 'The need and challenges for development of an an Epstein-Barr virus vaccine', Vaccine, 31S: B194-96. Back

Crawford, D H , Rickinson, A, Johannessen, I (2014), Cancer Virus: The story of the Epstein-Barr Virus, Oxford. Back

Farrell, P (2015), Interview with Lara Marks, 28 April and 14 May 2016, notes. Back

Fields, S, Winter, G, Brownlee, G G (1981) 'Structure of the neuraminidase gene in human influenza virus A/PR/8/34', Nature, 290: 213-17. Back

Ghedin, E, Sengamalay, Shumway et al (2005), 'Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution', Nature, 437, 1162-66. Back

Henle, W, Diehl, V, Kohn, G, zur Hausen, H, Henle, G (1967) 'Herpes-type virus and chromosome marker in normal leukocytes after with irradiated Burkitt cells', Science: 157: 1064-5. Back

J Craig Venter Institute, Influenza Genome Project. Back

Magrath, I (2009) 'Denis Burkitt and the African lymphoma', INCTR Newsletter, eCancer Back

Miller, G, Shope, T, Lisco, H, Stitt, Lipman, M (1972) 'Epstein-Barr Virus: Transformation, cytopathic changes, and viral antigens in squirrel money and marmoset leukocytes', PNAS, 69/2: 383-87. Back

Palser, A L, Grayson, N E, White, R E, et al (2015) 'Genome diversity of Epstein-Barr Virus from multiple tumor types and normal infection', Journal of Virology, 89/10: 5222-37. Back

Rawlinson, W D, Farrell, H E, Barrell, B G (1996) 'Analysis of complete DNA sequence of murine cytomegalovirus', Journal of Virology, 70: 8833-49. Back

Winter, G, Fields, S (1980) 'Cloning of influenza cDNA into M13: The sequence of the RNA segment encoding the A/PR/8/34 matrix protein', Nucleic Acids Research, 8/9: 1965-74. Back

Winter, G, Interview with Lara Marks, 31 Aug 2011, notes. Back

Respond to or comment on this page on our feeds on Facebook, Instagram or Twitter.