Computing Science is in the middle of a major paradigm shift, driven by Molecular Biology. Adleman by his breath-taking paper announced the arrival of computers based on biochemical operations and has showed that a large class of difficult and computationally hard problems is best solved not by pushing electrons through wires in a computing laboratory, but by mixing solutions in test tubes in a molecular biology laboratory. As the computationally hard problems are the stumbling blocks for the contemporary Von Neumann computers, the DNA based computation is poised to play a greater role in computing. This paper discusses about this novel idea of DNA based computation.
Today’s computers are millions of times more powerful than their crude ancestors in the 40’s and 50’s. Almost every two years, computers have become twice as fast whereas their components have assumed only half the space and however, it has also been realized that integrated circuit-technology is running against its limits and it has been a hotly debated question whether computers of an entirely new kind, quantum-mechanical computers or computers based on Molecular Biology is in the offing. One of the recently introduced unconventional paradigms, which promises to have a tremendous influence on the theoretical and practical progress of computer science is DNA computing, which under some circumstances might be an elegant alternative to the classical Turing/Von Neumann notion of computing. It is sure that the union of two of science’s most fertile fields, molecular biology and computer science is to produce some remarkable offsprings.
In 1994, Leonard M. Adleman solved an unremarkable computational problem with a remarkable technique. It was a problem that a person could solve it in a few moments or an average desktop machine could solve in the blink of an eye. He solved the problem by an in vitro DNA-recombination assay which he performed experimentally using hybridization, several agarose-gel separations, and PCR by handling DNA sequences in a test tube. Before discussing about this experiment, here is an overview about DNA molecules, which make the way for this sort of innovative computing model.It took Adleman, however, seven days to find a solution. Why then was this work exceptional? Because he solved the problem with DNA. It was a landmark demonstration of computing on the molecular level.
Nevertheless, his work is significant for a number of reasons.
- It illustrates the possibilities of using DNA to solve a class of problems that is difficult or impossible to solve using traditional computing methods.
- It’s an example of computation at a molecular level, potentially a size limit that may never be reached by the semiconductor industry.
- It demonstrates unique aspects of DNA as a data structure.
- It demonstrates that computing with DNA can work in a massively parallel fashion.
The Structure and Manipulation of DNA
DNA (deoxyribonucleic acid) encodes the genetic information of cellular organisms. It consists of polymer chains, commonly referred to as DNA strands. DNA is like two strings twisted together in a long spiral.
DNA is found in all cells as base pairs made of four different nucleotides. Each base pair is formed from two complementary nucleotides bonded together. The four bases in DNA’s alphabet are:
Adenine and thymine always bond together as a pair, and cytosine and guanine bond together as a pair. The pairs link together like rungs in a ladder:
Base pairs in DNA bond together to form a ladder-like structure. Because bonding occurs at angles between the bases, the whole structure twists into a helix.
In an E. coli bacterium, this ladder is about 4 million base pairs long. The two ends link together to form a ring, and then the ring gets wadded up to fit inside the cell. The entire ring is known as the genome, and scientists have completely decoded it. That is, scientists know all 4 million of the base pairs needed to form an E. coli bacterium’s DNA exactly. The human genome project is in the process of finding all 3 billion or so of the base pairs in a typical human’s DNA.
DNA: A Unique Data Structure
The amount of information gathered on the molecular biology of DNA over the last 40 years is almost overwhelming in scope. So instead of getting bogged down in biochemical and biological details of DNA, we’ll concentrate on only the information relevant to DNA computing.
The data density of DNA is impressive. Just like a string of binary data is encoded with ones and zeros, a strand of DNA is encoded with four bases, represented by the letters A, T, C, and G. The bases (also known as nucleotides) are spaced every 0.35 nanometers along the DNA molecule, giving DNA a remarkable data density of nearly 18 M bytes per inch. In two dimensions, if you assume one base per square nanometer, the data density is over one million G bytes per square inch. Compare this to the data density of a typical high performance hard drive, which is about 7 G bytes per square inch — a factor of over 100,000 smaller.
Another important property of DNA is its double stranded nature. The bases A and T, and C and G, can bind together, forming base pairs. Therefore, every DNA sequence has a natural complement.
For example, if sequence S is ATTACGTCG, its complement, S’, is TAATGCAGC. Both S and S’ will come together (or hybridize) to form double stranded DNA. This complementarity makes DNA a unique data structure for computation and can be exploited in many ways. Error correction is one example. Errors in DNA happen due to many factors.
Occasionally, DNA enzymes simply make mistakes, cutting where they shouldn’t, or inserting a T for a G. DNA can also be damaged by thermal energy and UV energy from the sun. If the error occurs in one of the strands of double stranded DNA, repair enzymes can restore the proper DNA sequence by using the complement strand as a reference. In this sense, double stranded DNA is similar to a RAID 1 array, where data is mirrored on two drives, allowing data to be recovered from the second drive if errors occur on the first. In biological systems, this facility for error correction means that the error rate can be quite low. For example, in DNA replication, there is one error for every 10^9 copied bases or in other words an error rate of 10^-9. (In comparison, hard drives have read error rates of only 10^-13 for Reed-Solomon correction).
Operations in Parallel
In the cell, DNA is modified biochemically by a variety of enzymes, which are tiny protein machines that read and process DNA according to nature’s design. There is a wide variety and number of these “operational” proteins, which manipulate DNA on the molecular level. Just like a CPU has a basic suite of operations like addition, bit-shifting, logical operators (AND, OR, NOT NOR), etc. that allow it to perform even the most complex calculations, DNA has cutting, copying, pasting, repairing, and many others. And note that in the test tube, enzymes do not function sequentially, working on one DNA at a time. Rather, many copies of the enzyme can work on many DNA molecules simultaneously. This is the power of DNA computing, that it can work in a massively parallel fashion.
DNA Vs Silicon
DNA, with its unique data structure and ability to perform many parallel operations, allows you to look at a computational problem from a different point of view. Transistor-based computers typically handle operations in a sequential manner. Of course, there are multi-processor computers, and modern CPUs incorporate some parallel processing, but in general, in the basic von Neumann architecture computer, instructions are handled sequentially. A von Neumann machine, which is what all modern CPUs are, basically repeats the same “fetch and execute cycle” over and over again; it fetches an instruction and the appropriate data from main memory, and it executes the instruction. It does this many, many times in a row, really, really fast. The great Richard Feynman, in his Lectures on Computation, summed up von Neumann computers by saying, “the inside of a computer is as dumb as hell, but it goes like mad!” DNA computers, however, are non-von Neuman, stochastic machines that approach computation in a different way from ordinary computers for the purpose of solving a different class of problems.
Typically, increasing performance of silicon computing means faster clock cycles (and larger data paths), where the emphasis is on the speed of the CPU and not on the size of the memory. For example, will doubling the clock speed or doubling your RAM give you better performance? For DNA computing, though, the power comes from the memory capacity and parallel processing. If forced to behave sequentially, DNA loses its appeal. For example, let’s look at the read and write rate of DNA. In bacteria, DNA can be replicated at a rate of about 500 base pairs a second. Biologically this is quite fast (10 times faster than human cells) and considering the low error rates, an impressive achievement. But this is only 1000 bits/sec, which is a snail’s pace when compared to the data throughput of an average hard drive. But look what happens if you allow many copies of the replication enzymes to work on DNA in parallel. First of all, the replication enzymes can start on the second replicated strand of DNA even before they’re finished copying the first one. So already the data rate jumps to 2000 bits/sec. But look what happens after each replication is finished – the number of DNA strands increases exponentially (2^n after n iterations). With each additional strand, the data rate increases by 1000 bits/sec. So after 10 iterations, the DNA is being replicated at a rate of about 1Mbit/sec; after 30 iterations it increases to 1000 Gbits/sec. This is beyond the sustained data rates of the fastest hard drives.
Now let’s consider how you would solve a nontrivial example of the traveling salesman problem (# of cities > 10) with silicon vs. DNA. With a von Neumann computer, one naive method would be to set up a search tree, measure each complete branch sequentially, and keep the shortest one. Improvements could be made with better search algorithms, such as pruning the search tree when one of the branches you are measuring is already longer than the best candidate. A method you certainly would not use would be to first generate all possible paths and then search the entire list. Why? Well, consider that the entire list of routes for a 20 city problem could theoretically take 45 million GBytes of memory (18! routes with 7 byte words)! Also for a 100 MIPS computer, it would take two years just to generate all paths (assuming one instruction cycle to generate each city in every path). However, using DNA computing, this method becomes feasible! 10^15 is just a nanomole of material, a relatively small number for biochemistry. Also, routes no longer have to be searched through sequentially. Operations can be done all in parallel.
The Hamilton Path Problem
The Hamiltonian Path Problem (HPP) may be phrased as follows: given a set of n cities connected by one-way and two-way roads, does there exist a path through this network starting at the first city and ending at the last city such that each city is visited once and only once?
The Adleman experiment
There is no better way to understand how something works than by going through an example step by step. So let’s solve our own directed Hamiltonian Path problem, using the DNA methods demonstrated by Adleman. The concepts are the same but the example has been simplified to make it easier to follow and present.
Suppose that I live in LA, and need to visit four cities: Houston, Chicago, Miami, and NY, with NY being my final destination. The airline I’m taking has a specific set of connecting flights that restrict which routes I can take (i.e. there is a flight from L.A. to Chicago, but no flight from Miami to Chicago). What should my itinerary be if I want to visit each city only once?
It should take you only a moment to see that there is only one route. Starting from L.A. you need to fly to Chicago, Dallas, Miami and then to N.Y. Any other choice of cities will force you to miss a destination, visit a city twice, or not make it to N.Y.
For this example, you obviously don’t need the help of a computer to find a solution. For six, seven, or even eight cities, the problem is still manageable. However, as the number of cities increases, the problem quickly gets out of hand. Assuming a random distribution of connecting routes, the number of itineraries you need to check increases exponentially. Pretty soon you will run out of pen and paper listing all the possible routes, and it becomes a problem for a computer or perhaps DNA. The method Adleman used to solve this problem is basically the shotgun approach mentioned previously. He first generated all the possible itineraries and then selected the correct itinerary. This is the advantage of DNA. It’s small and there are combinatorial techniques that can quickly generate many different data strings. Since the enzymes work on many DNA molecules at once, the selection process is massively parallel.
Specifically, the method based on Adleman’s experiment would be as follows:
- Generate all possible routes.
- Select itineraries that start with the proper city and end with the final city.
- Select itineraries with the correct number of cities.
- Select itineraries that contain each city only once.
All of the above steps can be accomplished with standard molecular biology techniques.
Part I: Generate all possible routes
Strategy: Encode city names in short DNA sequences. Encode itineraries by connecting the city sequences for which routes exist.
DNA can simply be treated as a string of data. For example, each city can be represented by a “word” of six bases:
The entire itinerary can be encoded by simply stringing together these DNA sequences that represent specific cities. For example, the route from L.A -> Chicago -> Dallas -> Miami -> New York would simply be GCTACGCTAGTATCGTACCTACGGATGCCG, or equivalently it could be represented in double stranded form with its complement sequence.
So how do we generate this? Synthesizing short single stranded DNA is now a routine process, so encoding the city names is straightforward. The molecules can be made by a machine called a DNA synthesizer or even custom ordered from a third party. Itineraries can then be produced from the city encodings by linking them together in proper order. To accomplish this you can take advantage of the fact that DNA hybridizes with its complimentary sequence. For example, you can encode the routes between cities by encoding the compliment of the second half (last three letters) of the departure city and the first half (first three letters) of the arrival city. For example the route between Miami (CTACGG) and NY (ATGCCG) can be made by taking the second half of the coding for Miami (CGG) and the first half of the coding for NY (ATG). This gives CGGATG. By taking the complement of this you get, GCCTAC, which not only uniquely represents the route from Miami to NY, but will connect the DNA representing Miami and NY by hybridizing itself to the second half of the code representing Miami (…CGG) and the first half of the code representing NY (ATG…).
Random itineraries can be made by mixing city encodings with the route encodings. Finally, the DNA strands can be connected together by an enzyme called ligase. What we are left with are strands of DNA representing itineraries with a random number of cities and random set of routes.
We can be confident that we have all possible combinations including the correct one by using an excess of DNA encodings, say 1013 copies of each city and each route between cities. Remember DNA is a highly compact data format, so numbers are on our side.
Part II: Select itineraries that start and end with the correct cities
Strategy: Selectively copy and amplify only the section of the DNA that starts with LA and ends with NY by using the Polymerase Chain Reaction.
After Part I, we now have a test tube full of various lengths of DNA that encode possible routes between cities. What we want are routes that start with LA and end with NY. To accomplish this we can use a technique called Polymerase Chain Reaction (PCR), which allows you to produce many copies of a specific sequence of DNA. PCR is an iterative process that cycles through a series of copying events using an enzyme called polymerase. Polymerase will copy a section of single stranded DNA starting at the position of a primer, a short piece of DNA complimentary to one end of a section of the DNA that you’re interested in. By selecting primers that flank the section of DNA you want to amplify, the polymerase preferentially amplifies the DNA between these primers, doubling the amount of DNA containing this sequence. After many iterations of PCR, the DNA you’re working on is amplified exponentially. So to selectively amplify the itineraries that start and stop with our cities of interest, we use primers that are complimentary to LA and NY. What we end up with after PCR is a test tube full of double stranded DNA of various lengths, encoding itineraries that start with LA and end with NY.
Part III: Select itineraries that contain the correct number of cities.
Strategy: Sort the DNA by length and select the DNA whose length corresponds to 5 cities.
Our test tube is now filled with DNA encoded itineraries that start with LA and end with NY, where the number of cities in between LA and NY varies. We now want to select those itineraries that are five cities long. To accomplish this, we can use a technique called Gel Electrophoresis, which is a common procedure used to resolve the size of DNA. The basic principle behind Gel Electrophoresis is to force DNA through a gel matrix by using an electric field. DNA is a negatively charged molecule under most conditions, so if placed in an electric field it will be attracted to the positive potential. However, since the charge density of DNA is constant (charge per length) long pieces of DNA move as fast as short pieces when suspended in a fluid. This is why you use a gel matrix. The gel is made up of a polymer that forms a meshwork of linked strands. The DNA now is forced to thread its way through the tiny spaces between these strands, which slows down the DNA at different rates depending on its length. What we typically end up with after running a gel is a series of DNA bands, with each band corresponding to a certain length. We can then simply cut out the band of interest to isolate DNA of a specific length. Since we known that each city is encoded with 6 base pairs of DNA, knowing the length of the itinerary gives us the number of cities. In this case we would isolate the DNA that was 30 base pairs long (5 cities times 6 base pairs).
Part IV: Select itineraries that have a complete set of cities
Strategy: Successively filter the DNA molecules by city, one city at a time. Since the DNA we start with contains five cities, we will be left with strands that encode each city once.
DNA containing a specific sequence can be purified from a sample of mixed DNA by a technique called affinity purification. This is accomplished by attaching the compliment of the sequence in question to a substrate like a magnetic bead. The beads are then mixed with the DNA. DNA, which contains the sequence you’re after then hybridizes with the complement sequence on the beads. These beads can then be retrieved and the DNA isolated.
So, we now affinity purify fives times, using a different city complement for each run. For example, for the first run we use L.A.’-beads (where the ‘ indicates compliment strand) to fish out DNA sequences which contain the encoding for L.A. (which should be all the DNA because of step 3), the next run we use Dallas’-beads, and then Chicago’-beads, Miami’-beads, and finally NY’-beads. The order isn’t important. If an itinerary is missing a city, then it will not be “fished out” during one of the runs and will be removed from the candidate pool. What we are left with are the are itineraries that start in LA, visit each city once, and end in NY. This is exactly what we are looking for. If the answer exists we would retrieve it at this step.
Reading out the answer
One possible way to find the result would be to simply sequence the DNA strands. However, since we already have the sequence of the city encodings. We can use an alternate method called graduated PCR. Here we do a series of PCR amplifications using the primer corresponding to L.A., with a different primer for each city in succession. By measuring the various lengths of DNA for each PCR product we can piece together the final sequence of cities in our itinerary. For example, we know that the DNA itinerary starts with LA and is 30 base pairs long, so if the PCR product for the LA and Dallas primers was 24 base pairs long, you know Dallas is the fourth city in the itinerary (24 divided by 6).
Finally, if we were careful in our DNA manipulations the only DNA left in our test tube should be DNA itinerary encoding LA, Chicago, Miami, Dallas, and NY. So if the succession of primers used is LA & Chicago, LA & Miami, LA & Dallas, and LA & NY, then we would get PCR products with lengths 12, 18, 24, and 30 base pairs.
Applications of DNA Computing
The unique properties of DNA make it a fundamental building block in the fields of supramolecular chemistry, nanotechnology, nano-circuits, molecular switches, molecular devices, and molecular computing. In our recently introduced autonomous molecular automaton, DNA molecules serve as input, output, and software, and the hardware consists of DNA restriction and ligation enzymes using ATP as fuel. In addition to information, DNA stores energy, available on hybridization of complementary strands or hydrolysis of its phosphodiester backbone. Here we show that a single DNA molecule can provide both the input data and all of the necessary fuel for a molecular automaton. Each computational step of the automaton consists of a reversible software molecule/input molecule hybridization followed by an irreversible software-directed cleavage of the input molecule, which drives the computation forward by increasing entropy and releasing heat. The cleavage uses a hitherto unknown capability of the restriction enzyme FokI, which serves as the hardware, to operate on a noncovalent software/input hybrid. In the previous automaton, software/input ligation consumed one software molecule and two ATP molecules per step. As ligation is not performed in this automaton, a fixed amount of software and hardware molecules can, in principle, process any input molecule of any length without external energy supply. Our experiments demonstrate 3 x 1012 automata per µl performing 6.6 x 1010 transitions per second per µl with transition fidelity of 99.9%, dissipating about 5 x 10-9 W/µl as heat at ambient temperature.
Advantages of DNA Computing
The possible advantages of DNA-based computer architecture became immediately apparent:
Computing with DNA offers the advantage of massive degrees of miniaturization and parallelism over conventional silicon-based machines. For example, a square centimeter of silicon can currently support around a million transistors, whereas current manipulation techniques can handle to the order of 1020 strands of DNA.
- Size: the information density could go up to 1 bit/nm3.
- High parallelism: every molecule could act as a small processor on nano-scale and the number of such processors per volume would be potentially enormous. In an in vitro assay we could handle easily with about 1018 processors working in parallel.
- Speed: although the elementary operations (electrophoretic separation, ligation, PCR-amplifications) would be slow compared to electronic computers, their parallelism would strongly prevail, so that in certain models the number of operations per second could be of order 1018 operations per second, which is at least 100,000 times faster than the fastest supercomputers existing today.
- Energy efficiency: 1019 operations per Joule. This is about a billion times more energy efficient than today’s electronic devices.
Pitfalls of DNA Computing
The idea of using DNA to solve computational problems is certainly intriguing and elegant, and DNA does provide a massive parallelism far beyond what is available on existing silicon-based computers. However, there are many technological hurdles to overcome. We give below one of the huge fundamental problems to be solved to attain the goal of designing universally programmable molecular computer.
The fundamental problem is that, the function of 2n is exponential whether it counts time or molecules. It has been estimated that Adleman’s Hamiltonian path problem, if enhanced to 50 or 100 cities, would require tons of DNA. The minimum amount of required DNA for Lipton’s SAT method needs a few grams of DNA molecules for 70 variables. If this is increased to 100 variables, the minimum DNA requirement of millions of kilograms.
Thus, raw idea of brute-force enumeration is not going to work beyond modest problem sizes. Thus, it is imperative to bring forth new revolutionary ideas to make this notion of DNA-based computing to work realistically. Only time and investment will tell where the initial ideas for DNA computing from those experts will lead. Many enhancive ideas have been published but all of them suffer under this fundamental problem. Hopefully the future molecular computation methods may bring forth new revolutionary ideas to overcome this very fundamental as well as significant hurdle.
Future of DNA Computing
The significance of this research is two-fold: it is the first demonstrable use of DNA molecules for representing information, and also the first attempt to deal with an NP-complete problem. But still much more work needs to be done to develop error-resistant and scalable laboratory computations. Designing experiments that are likely to be successful in the laboratory and algorithms that proceed through polynomial-sized volumes of DNA is the need of the hour. It is unlikely that DNA computers will be used for tasks like word processing, but they may ultimately find a niche market for solving large-scale intractable combinatorial problems. The goal of automating, miniaturizing and integrating them into a general-purpose desktop DNA computer may take much longer time.
So, considering all the attention that DNA has garnered, it isn’t too hard to imagine that one day we might have the tools and talent to produce a small integrated desktop machine that uses DNA, or a DNA-like biopolymer, as a computing substrate along with set of designer enzymes. Perhaps it won’t be used to play Quake IV or surf the web things those traditional computers are good at but it certainly might be used in the study of logic, encryption, genetic programming and algorithms, automata, language systems, and lots of other interesting things that haven’t even been invented yet.