Computers, Databases and Automation
The introduction of computers
Much of the sequencing of the human viruses could not have been done without the aid of computers. In the very early days, most of the sequences were copied out by hand, with different workers recording particular portions of the molecule they were working on in their own notebooks. Writing them out by hand, however, became increasingly prone to error as the sequences grew longer. Amalgamating the data from the different notebooks was also very laborious. Some idea of the scale of the effort can be seen from the phi X 174 bacteriophage project where nine different workers were involved in recording the original sequencing data (Hutchison, 2007).
Mike Smith. Credit: Laboratory of Molecular Biology (LMB). Smith was a British biochemist based at University of British Columbia in Vancouver who visited the LMB in the 1970s. In 1993 he shared the Nobel Prize with Kary Mullis for co-developing site-directed mutagenesis.
For many LMB researchers, computers offered an ideal solution to sequence collection, storage and analysis. LMB researchers had been using computers since the 1950s. In addition, computer databases were being developed for protein and RNA sequences from the early 1970s. Sanger, however, was initially resistant to using computers for his DNA sequencing work. Indeed, as he himself recalled, whenever the idea was suggested he could be 'quite rude' in his reply. This was, in part, because he felt he was not generating enough data to warrant it. But equally important, he feared computers would diminish his enjoyment in the work. Unlike some of his colleagues he greatly liked taking the autoradiographs home with him, and looked 'forward to the pleasure of reading them in the peace of the evening'. For him reading autoradiographs 'was always a delight’ because it reminded him of ‘the way we used to do sequences one residue at a time by painstaking partial hydrolysis, fractionation, and analysis’. Part of the fun for Sanger was 'looking at the autoradiographs and figuring out what sequence it was' (Sanger, 1988; Sanger, 1992).
By the early 1970s Sanger began to change his mind about the use of computers, but neither he nor anyone else in his team had computing expertise. Michael Smith, a British-Canadian visitor to the laboratory, soon came to the rescue, and contacted his brother-in-law, Duncan McCallum, to help. McCallum was using computers routinely for processing administrative data as part of his job in the management division of Ciba-Geigy, a multinational chemical company (Garcia-Sancho, 2012).
Together, McCallum and Smith developed a computer programme for DNA sequencing which could compile and number the complete sequence and facilitate the editing of data. It also included other important elements, such as the ability to translate the data into protein sequences and search for specific short sequences, or families of sequences (Garcia-Sancho, 2012).
McCallum and Smith's system was first used to store sequences generated with the plus and minus system for the phi X174 DNA project. Everyone in Sanger's group was expected to transcribe their manually deduced sequences onto paper forms, which were then entered on to punched cards in blocks of 60. The punch cards were then sent off to be processed by mainframe operators at a remote computing centre. Once this was done a printout of the data was sent back to the laboratory. Such printouts, while invaluable, were unsuitable for publication, so Sanger's secretary, Margaret Dowding had to transcribe the sequence manually on her typewriter (Hutchison, 2007).
The presentation of the sequence data began to change with the introduction of the computer. In the early days, when working with insulin and RNA, Sanger had sought to preserve the spatial configuration of the molecular structure when writing out the sequences. By the time of the phi X174 project, the DNA sequence was written out as successive characters on consecutive lines which bore no resemblance to the original shape of the molecule. Increasingly, from this time, DNA sequencing data was presented in a form that bore no relationship to the biochemical structure of the template molecule (Garcia-Sancho, 2012).
In 1977 Smith returned to British Columbia and Sanger turned to the computing expertise of Rodger Staden who was based in LMB’s Structural Division. Staden was already well versed in the writing of computer software, having done it for the repetitive processing of data gathered by the x-ray crystallographers in his division. A mathematical physicist by training, Staden had a strong interest in sequencing and eagerly turned his attention to helping Sanger and his team (Garcia-Sancho, 2012).
Rodger Staden. Credit: LMB. Staden devised the first DNA sequencing software. His programming software would go on to be acknowledged internationally as one of the best to use for managing data from sequencing projects.
Over the next three years, Staden wrote a number of computer programmes, designed to help with the assembly and analysis of DNA sequences. His system ran the same operations as those developed by McCallum and Smith, but differed in that his programme defined sequences as 'character strings'. These could be entered, assembled and edited directly by the researchers themselves. The programme also made searches for any overlaps in the sequences possible. Later editions of the programme, developed in collaboration with the researchers working on the EBV sequencing project, improved the assembly of DNA sequences and enabled DNA sequences to be translated into proteins with the ability to search for sequence patterns that marked the beginning and end of genes. These features were invaluable to sequencing the EBV virus, because the team had no prior genetic map to work from when starting the project (Garcia-Sancho, 2012; Farrell, 2015).
Staden's software also had the further advantage in that it eliminated the need for a mainframe computer, as it was written and designed to be run on minicomputers which were becoming more common in the laboratory. Those at the forefront of the sequencing process could now input and edit their own data. All they needed to do was read the sequence from an autoradiograph and type it on a keyboard attached to a minicomputer. All the data was saved on to magnetic disks in files measured in kilobytes. The minicomputers were attached to a large processor shared with other LMB scientists, which was located in a specific room. Partial sequences were printed out on a continuous string of paper by the processor (Garcia-Sancho, 2012).
Along with the introduction of computers, Sanger's group continued to improve and speed up the sequencing and analytical process. A number of improvements were made incrementally to the method during the EBV sequencing project. These included switching to the use of the 35S radioactive label instead of 32P. One of the advantages of this new label was that it emitted less powerful radiation than the 32P label and so formed sharper bands on the autoradiograph. The team also experimented with new methods of pouring the gel mix, and explored ways of getting higher concentrations of the acrylamide gels at the bottom rather than the top of each column. In the end they had created what would be known as gradient gels, i.e. gels which displayed more bands in a sequence. Together the changes substantially increased both the accuracy and quality of the autoradiograph images which now showed a greater number of bands in sharper detail, allowing for longer and more accurate readings of the sequences. This greatly sped up the sequencing process. By the time of the cytomegalovirus project, the average speed of sequencing was 1,000 base pairs per week. Completing the sequence of the virus was estimated to have been a 12-year workload for one person (Farrell, 2015; Sijmons, Van Ranst, Maes, 2014).
The EMBL Database
In addition to the introduction of computers and technical changes, Sanger and his colleagues became involved in the development of an international database for the storage of DNA sequences through the European Molecular Biology Laboratory (EMBL). Two of the key people involved in its development were Kenneth Murray, who had been involved in Sanger's early DNA sequencing work, and Rodger Staden (see above). The EMBL database was managed by an advisory committee with scientists from 14 European states, and its main aim was to amalgamate protein and nucleic acid repositories that had been set up around Europe. Part of the rationale for Sanger and his colleagues involvement in the project arose from their strong conviction that they should make their data publicly available because their work was publicly funded (Garcia-Sancho, 2012; Farrell, 2015).
The EMBL database was attractive to Sanger and other scientists because it was a centralised and international resource maintained by specialised staff and supported with EMBL funds. This was seen as a major step forward for biological researchers, freeing them to pursue their research without the hassle of raising funds to develop and support databases. Prior to the establishment of the EMBL database individual researchers had experienced many difficulties in securing funding to maintain and run their database collections (Garcia-Sancho, 2012).
Computers and databases remained the only mechanised part of the sequencing process for many years. Much of the work involved in sequencing remained manual, with each researcher performing sequencing reactions in test tubes, carefully preparing gels and using electrophoresis to separate the radioactively labelled DNA fragments. The final stage of the process, producing and interpreting the autoradiographs, also remained manual. All of the sequencing work in the EBV project, which was completed in 1984, was carried out in this way, with the computer being used only in the final stage of the process.
The work demanded a great deal of skill and patience. Just how labour-intensive and technically demanding DNA sequencing was in the early period can be seen from the steps needed to prepare gels. In the first instance, a gel mould needed to be made by taping two glass plates together with a 2mm spacer to keep them apart. Once this was done a comb device was inserted at the top. The comb was important because its teeth formed grooves in the gel. These grooves provided wells for where the DNA samples could be loaded. In addition to creating a mould, the gel had to be prepared. This required mixing together different ingredients. Following preparation, the gel was poured between the two plates of the mould. Pouring could be very fiddly because of the very narrow gap between the two plates, and the need to guard against the development of any air bubbles in the solution. After this the gel was left to solidify. The comb was taken out once the gel had hardened and the DNA samples were then loaded into the wells it had left behind. Loading the DNA samples was also awkward because the wells were very small. In the early days the only way researchers could load a DNA sample was to suck it up by mouth into a capillary pipette and then blow it into a well. This was the method Barrell's team used for sequencing the EBV. On average they could load 80 samples on to the gels at any one time. This took 30 minutes. Not surprisingly the work was physically demanding (Farrell, 2015).
Diagram showing set up for gel ready for sequencing DNA. Credit: Regents Genetics Technology, Wikispaces.
Once the DNA samples had been loaded, an electric current was applied to separate the nucleotides in the DNA sample. Glass plates were then carefully levered apart so as not to distort the gels but leave it attached to just one of the plates. The plate with the remaining gel was then lowered into a trough with acid so as to fix the DNA into position on the gel. It was then lifted out and carefully covered with a 3MM sheet, a type of blotting paper. The objective was to transfer the gel from the plate to the paper. When this was done, the paper was carefully turned over so that it faced gel side up, and was covered with cling film. This had to done very carefully to avoid trapping air bubbles. The paper was then put through a suction gel dryer to heat the gel so that it stuck to its surface. By the end of the process the paper had a thin 'plastic' film of gel on one side. Finally the paper was covered with an x-ray film to generate an autoradiograph. The whole process was very repetitive and time-consuming, demanding a great deal of dexterity and skill (Farrell, 2015).
Reading an autoradiograph also remained a manual process for a long time. Indeed, many were reluctant for this to become automated. Scientists, for example, who met in April 1980 at a workshop to discuss the initiation of the EMBL database, expressed some anxiety about this possibility. They were highly reluctant to relinquish to a machine the task of deciding whether an adenine nucleotide came before or after a thymine in a sequence. They certainly did not want to see their work being transformed into the mere accumulation of data. Such a move, they feared, would diminish its status and intellectual value (Garcia-Sancho, 2012).
From the early 1980s, however, researchers at the LMB and elsewhere began to develop some instruments to automate the reading of the sequence of nucleotides directly from autoradiographs. By 1984 Sanger's team had devised a hand-held scanning pen device for directly reading sequences from autoradiographs into the computer. Eliminating the need to type in sequences, the device made the transcription process much faster and less prone to error. The scanning pen was connected to a particular computer programme, which kept track of which lane it was following, thus avoiding a common source of error with a pencil and paper reading (Staden, 1984; Farrell, 2015).
By the late 1980s, a number of new sequencing machines began to emerge which radically improved the speed of the sequencing process. The first automated DNA sequencer, AB370, was developed by a group of scientists at the California Institute of Technology (Caltech) led by the immunologist Leroy Hood. A key driver in its development was the team's ambition to isolate and clone immune genes faster than its rivals. Unlike Sanger and many European biological researchers who took pride in their sequencing skills, Hood and his colleagues were frustrated by the sequencing process which they regarded as too slow, tedious and fraught with error. What they particularly disliked was the amount of time the process took away from their immunological research (Garcia-Sancho, 2012).
Lloyd M Smith. Credit: Smith.
First automated DNA sequencing machine. Credit: Lloyd Smith.
The project to build the first automated DNA sequencer was headed by Lloyd M Smith, with the collaboration of Michael and Timothy Hunkapillar, two brothers. They were supported by private money mainly from a start-up company, Applied Biosystems, set up by Hood. The machine took five years to construct. It was closely modelled on Sanger's original, manual sequencing process. A technician needed to pour gel into the space between two glass plates set less than a millimetre apart. Once this gel had set, the technician loaded the DNA on to slabs of gel in a number of lanes ready for separation and analysis by the machine (Garcia-Sancho, 2012; Genome News Network).
The new machine depended on a small modification to Sanger's original dideoxy sequencing approach. Instead of using radioactive tags in each of the four dideoxy reactions it used fluorescent dyes. This produced a series of overlapping DNA fragments with distinct fluorescent tagged endings which corresponded to the four different DNA bases. The adoption of fluorescent tags made it possible to separate the fragments in a single gel lane and pick out individual nucleotides by means of a laser. The end readout was a linear colour pattern corresponding with the DNA sequence. This data was automatically sent to a computer to determine the sequence (Garcia-Sancho, 2012).
The introduction of fluorescent labelling and lasers greatly accelerated the sequencing process. Now it was possible to sequence 96 bases all at once, 500 kilobases per day, and to read DNA stretches of up to 600 bases. Until the late 1980s it took a whole year for an individual to determine a sequence of 20,000 to 50,000 bases; with the new sequencer it could be read in a matter of hours. Further sequencing machines followed, with important design improvements. This included the development of capillary sequencers, which made it possible to run DNA through an array of 96 gel-filled glass tubes the width of a human hair rather than through a slab of gel (Genome News Network).
These revolutionary innovations laid the basis for the foundation of the Human Genome Project in 1990. Whereas in the early 1980s Farrell had estimated that it would take nearly 25,000 years to sequence all 3 billion base pairs in the whole human genome, based on the progress in sequencing the EBV. Yet, the whole human genome sequence was determined in just 13 years, in 2003 (Crawford, Rickinson, Johannessen, 2014).
Automation and competition also helped reduce the cost of the work. In 2002 the cost of sequencing an average human genome was $10,000. From 2005 onwards companies entered a race to reduce the cost of sequencing down to $1,000 so that it could be used routinely in clinics. This led to the introduction of a new generation of sequencers from 2007. By 2013 the cost had dropped to $5,000 (Check Hayden, 2014). Just five years later commercial companies were offering to carry out whole genome sequencing for less than $200. Private companies are currently racing to reduce the cost to only $100.
Sample read outs from a manually produced autoradiograph (left) and automatic sequencer (right). Note how the four lanes that appear in the left hand have been condensed down into one lane. This was made possible by the adoption of fluorescent tagging. Credit: Abizar Lakdawalla.
Check Hayden, E (2014) 'Technology: The $1,000 genome', Nature, 207: 294-5. Back
Crawford, D H , Rickinson, A, Johannessen, I (2014), Cancer Virus: The story of the Epstein-Barr Virus, Oxford. Back
Farrell, P (2015), Interview with Lara Marks, 28 April and 14 May 2016, notes. Back
Garcia-Sancho, M (2012) Biology, Computing and the History of molecular Sequencing: From Proteins to DNA, 1945-2000, Basingstoke. Back
Genome News Network, 'How does DNA sequencing work', Genome News Network. Back
Hutchison, C (2007) 'DNA sequencing: Bench to bedside and beyond', Nucleic Acids Research, 35/18: 6227-37. Back
Sanger, F (1988) 'Sequences, sequences, sequences', Annual Review of Biochemistry, 57: 1-29. Back
Sanger, F (1992) 'A life of research on the sequences of proteins and nucleic acids: Dr Fred Sanger in conversation with George Brownlee, Biochemistry Society archives. Back
Sijmons, S, Van Ranst, Maes, R (2014) 'Genomic and functional characteristics of human cytomegalovirus revealed by next-generation sequencing', Viruses, 6: 1049-72. Back
Staden, R (1984) 'A computer program to enter DNA gel reading data into a computer', Nucleic Acids Research, 12/1: 499-503. Back