Tag Archives: computational biology

Shakespearian Sonnets Now Available on DNA

Shakespeare meet thy DNA. The most famous literary figure in the English language had a recent rendezvous with that most famous and studied of molecules. Together chemists, cell biologists, geneticists and computer scientists are doing some amazing things — storing information using the base-pair sequences of amino-acids on the DNA molecule.

[div class=attrib]From ars technica:[end-div]

It’s easy to get excited about the idea of encoding information in single molecules, which seems to be the ultimate end of the miniaturization that has been driving the electronics industry. But it’s also easy to forget that we’ve been beaten there—by a few billion years. The chemical information present in biomolecules was critical to the origin of life and probably dates back to whatever interesting chemical reactions preceded it.

It’s only within the past few decades, however, that humans have learned to speak DNA. Even then, it took a while to develop the technology needed to synthesize and determine the sequence of large populations of molecules. But we’re there now, and people have started experimenting with putting binary data in biological form. Now, a new study has confirmed the flexibility of the approach by encoding everything from an MP3 to the decoding algorithm into fragments of DNA. The cost analysis done by the authors suggest that the technology may soon be suitable for decade-scale storage, provided current trends continue.

Trinary encoding

Computer data is in binary, while each location in a DNA molecule can hold any one of four bases (A, T, C, and G). Rather than using all that extra information capacity, however, the authors used it to avoid a technical problem. Stretches of a single type of base (say, TTTTT) are often not sequenced properly by current techniques—in fact, this was the biggest source of errors in the previous DNA data storage effort. So for this new encoding, they used one of the bases to break up long runs of any of the other three.

(To explain how this works practically, let’s say the A, T, and C encoded information, while G represents “more of the same.” If you had a run of four A’s, you could represent it as AAGA. But since the G doesn’t encode for anything in particular, TTGT can be used to represent four T’s. The only thing that matters is that there are no more than two identical bases in a row.)

That leaves three bases to encode information, so the authors converted their information into trinary. In all, they encoded a large number of works: all 154 Shakespeare sonnets, a PDF of a scientific paper, a photograph of the lab some of them work in, and an MP3 of part of Martin Luther King’s “I have a dream” speech. For good measure, they also threw in the algorithm they use for converting binary data into trinary.

Once in trinary, the results were encoded into the error-avoiding DNA code described above. The resulting sequence was then broken into chunks that were easy to synthesize. Each chunk came with parity information (for error correction), a short file ID, and some data that indicates the offset within the file (so, for example, that the sequence holds digits 500-600). To provide an added level of data security, 100-bases-long DNA inserts were staggered by 25 bases so that consecutive fragments had a 75-base overlap. Thus, many sections of the file were carried by four different DNA molecules.

And it all worked brilliantly—mostly. For most of the files, the authors’ sequencing and analysis protocol could reconstruct an error-free version of the file without any intervention. One, however, ended up with two 25-base-long gaps, presumably resulting from a particular sequence that is very difficult to synthesize. Based on parity and other data, they were able to reconstruct the contents of the gaps, but understanding why things went wrong in the first place would be critical to understanding how well suited this method is to long-term archiving of data.

[div class=attrib]Read the entire article following the jump.[end-div]

[div class=attrib]Image: Title page of Shakespeare’s Sonnets (1609). Courtesy of Wikipedia / Public Domain.[end-div]

Living Organism as Software

For the first time scientists have built a computer software model of an entire organism from its molecular building blocks. This allows the model to predict previously unobserved cellular biological processes and behaviors. While the organism in question is a simple bacterium, this represents another huge advance in computational biology.

[div class=attrib]From the New York Times:[end-div]

Scientists at Stanford University and the J. Craig Venter Institute have developed the first software simulation of an entire organism, a humble single-cell bacterium that lives in the human genital and respiratory tracts.

The scientists and other experts said the work was a giant step toward developing computerized laboratories that could carry out complete experiments without the need for traditional instruments.

For medical researchers and drug designers, cellular models will be able to supplant experiments during the early stages of screening for new compounds. And for molecular biologists, models that are of sufficient accuracy will yield new understanding of basic biological principles.

The simulation of the complete life cycle of the pathogen, Mycoplasma genitalium, was presented on Friday in the journal Cell. The scientists called it a “first draft” but added that the effort was the first time an entire organism had been modeled in such detail — in this case, all of its 525 genes.

“Where I think our work is different is that we explicitly include all of the genes and every known gene function,” the team’s leader, Markus W. Covert, an assistant professor of bioengineering at Stanford, wrote in an e-mail. “There’s no one else out there who has been able to include more than a handful of functions or more than, say, one-third of the genes.”

The simulation, which runs on a cluster of 128 computers, models the complete life span of the cell at the molecular level, charting the interactions of 28 categories of molecules — including DNA, RNA, proteins and small molecules known as metabolites that are generated by cell processes.

“The model presented by the authors is the first truly integrated effort to simulate the workings of a free-living microbe, and it should be commended for its audacity alone,” wrote the Columbia scientists Peter L. Freddolino and Saeed Tavazoie in a commentary that accompanied the article. “This is a tremendous task, involving the interpretation and integration of a massive amount of data.”

They called the simulation an important advance in the new field of computational biology, which has recently yielded such achievements as the creation of a synthetic life form — an entire bacterial genome created by a team led by the genome pioneer J. Craig Venter. The scientists used it to take over an existing cell.

For their computer simulation, the researchers had the advantage of extensive scientific literature on the bacterium. They were able to use data taken from more than 900 scientific papers to validate the accuracy of their software model.

Still, they said that the model of the simplest biological system was pushing the limits of their computers.

“Right now, running a simulation for a single cell to divide only one time takes around 10 hours and generates half a gigabyte of data,” Dr. Covert wrote. “I find this fact completely fascinating, because I don’t know that anyone has ever asked how much data a living thing truly holds. We often think of the DNA as the storage medium, but clearly there is more to it than that.”

In designing their model, the scientists chose an approach that parallels the design of modern software systems, known as object-oriented programming. Software designers organize their programs in modules, which communicate with one another by passing data and instructions back and forth.

Similarly, the simulated bacterium is a series of modules that mimic the different functions of the cell.

“The major modeling insight we had a few years ago was to break up the functionality of the cell into subgroups which we could model individually, each with its own mathematics, and then to integrate these sub-models together into a whole,” Dr. Covert said. “It turned out to be a very exciting idea.”

[div class=attrib]Read the entire article after the jump.[end-div]

[div class=attrib]Image: A Whole-Cell Computational Model Predicts Phenotype from Genotype. Courtesy of Cell / Elsevier Inc.[end-div]