A Sequence Like Poison

By evolgen on January 3, 2007.

New Scientist reports on research to identify DNA sequences that cannot be found in any nucleotide database. These sequences are short -- so as to decrease the probability that they are missing due to chance alone -- and the researchers from the Boise State University have identified over 60,000 15 nucleotide stretches of DNA that are not present in any known sequenced region from all species. They also found 746 sequences of 5 amino acids that are not present in any known polypeptide. The article does not indicate whether the scientists utilized any hook and ladder or Statue of Liberty approaches in their analysis.

The researchers postulate that some of the sequences cannot be tolerated by living organisms. Many of them are probably missing simply due to chance -- and who knows how many will be found as our DNA sequence databases continue to expand exponentially. But these guys are specifically interested in those sequences which may act like genomic poison. They plan to test some of the amino acid sequences in bacteria to see if they can be tolerated.

More like this

Big Apple Bugs: New Cockroach Species Discovered in NYC

tags: researchblogging.org, new species, insects, American cockroach, Periplaneta americana, DNA barcoding, Brenda Tan, Matt Cost, Mark Stoeckle, Rockefeller University, American Museum of Natural History, AMNH Mystery cockroach found in NYC apartment. Image: Brenda Tan and Matt Cost. Moving…

Silent Mutations Continue to Speak Up

Pim van Meurs has a blog post at The Panda's Thumb about the recent paper on translational selection on a synonymous polymorphic site in a eukaryotic gene (DOI link). He points out that this was predicted in a paper from 1987. In short, the rate of translation depends on the tRNA pool -- amino…

And the plant goes "moo" ? - a bioinformatics case study with insulin

Sometimes when you go digging through the databases, you find unexpected things. When I was researching the previous posts on insulin structure and insulin evolution, I found something curious indeed. Human insulin, colored by rainbow. Image from the Molecule World iPad app by Digital World…

Another Neat Article in PLoS ONE

Molecular markers are becoming more and more popular for species identification -- a practice known as DNA barcoding. Researchers sequence a region of the genome from an organism of interest and search that sequence against a DNA database using BLAST. Such an analysis is contingent on a…

I checked a short sequences a few years ago, when ego-BLASTing was in vogue. The most interesting sequence that doesn't exist, IMO, is SATAN, which does not turn up on a BLAST search of the protein databases (even though it contains four of the commonest amino acids). It's the most powerful evidence I know of for ID!*

*(Of course, it's piss-poor evidence for ID, but I stand by my statement.)

Would there necessarily be any sort of correlation to viral DNA?

Also, can you elaborate on the Hook and Ladder or Statue of Liberty approaches?

From the New Scientist article it isn't clear if they've considered simple steric effects for the missing peptide sequences -- some series of amino acids may simply not fold well due to sidechain crowding and have never made it into a functional protein for that reason.

Interesting suggestion, Jonathan - I'm not sure if you could draw any conclusions regarding energy requirements for folding or sidechain crowding from the five amino acid sequences that were provided - as the tertiary folding of the peptide would be influenced greatly by the flanking amino acids. Regardless, I don't have any alternate suggestions.

Dave,

Surely you could at least bootsrap a la Tanford to get a rough estimate for free energy of folding, no?

My guess (sight unseen) is that many of these peptides are laden with polar residues, and are relatively expensive energetically to maintain.

The article does not indicate whether the scientists utilized any hook and ladder or Statue of Liberty approaches in their analysis.

Can you say a little more about these? Google wasn't much help

.. or am I displaying my newbish-ness by having missed a joke?

'Hook and Ladder' and 'Statute of Liberty' refer to trick plays rarely used in American football, but which were used to great effect in Boise State's recent victory over the much-favored team from Oklahoma Univeristy. These are highly humorous, but purely North American-centric references.

About "hook and ladder" and "statue of liberty": what Tom wrote. They're references to American football -- specifically a game involving Boise State.

I'm sure there are lots of reasons why certain peptide sequences are avoided -- Jonathan's being one of them, and Brian presenting another. Regarding steric effects, it would be interesting to have more info on the secondary/tertiary structures of more proteins. I'd bet certain peptide sequences would have different steric effects depending on the region of the protein in which they are involved.

What about simple combinatorics? With 20 amino acids, there would seem to be over 3 million possible sequences of 5. With DNA there would be more, especially if you allow non-coding sequences. Of course, real genomes can be pretty big, but even so, real DNA sequences aren't even close to random. In fact, there are plenty of genes that are conserved (or nearly so) even across multiple species. Given all that, the numbers given look pretty low to me....

Assuming neutrality, DNA sequence conservation does not persist very long. Aligning noncoding regions across mammals or across Drosophila (let alone between these two taxa) is quite difficult. Amino acid conservation lasts fairly long, but finding the same five consecutive amino acids between two diverse taxa within animals (not even out to non-animal eukaryotes or non-eukaryotes) will be rare.

The fact that these sequence are non-independent (they share a common ancestor) is important for determining the probability of finding a particular sequence at random amongst all available sequences from all species. But by sampling eukaryotes, archaea, and bacteria that lack of independence shouldn't be as important.

Allow me to ask the layperson's dumb question: what's to be gained from identifying sequences that do not occur in DNA or finding some that may be genomic poison?

Besides the fact that it's an interesting question, it would probably be useful to find sequences to avoid when transforming (genetic engineering) organisms. But the New Scientist article suggests a couple other possibilities:

1. Tags for DNA samples in forensic analysis. If you use a unique sequence to tag a sample from a suspect it reduces the chance it will get mixed up with a sample from the crime scene.

2. Using the "poisonous" amino acid sequences as "self-destruct" buttons for genetically engineered organisms. If you ever wanted to destroy all the organisms you'd turn on a gene with that sequence (apply an environmental stimulus that activates expression of the gene).

My one complaint ... they didn't even test their hypothesis ... and such an easyone to test at that. So now tons of money will be poured into this (from the DOD) before their ideas are validated ONCE? WTF?

apalazzo: The grant they already have, by the article, is on the development of the use of "primes" to tag DNA samples without contaminating them meaningfully.

This seems reasonable, and seems at least moderately supported already; after all, if those sequences were found in any known human, well, then they wouldn't be on the list. ;)

it would be interesting to see if the 746 sequences of 5 amino acids that are not present in any known polypeptide are missing in synthetic phage display libraries containing random heptamers

Advertisment

Donate

ScienceBlogs is where scientists communicate directly with the public. We are part of Science 2.0, a science education nonprofit operating under Section 501(c)(3) of the Internal Revenue Code. Please make a tax-deductible donation if you value independent science communication, collaboration, participation, and open access.

You can also shop using Amazon Smile and though you pay nothing more we get a tiny something.

Science 2.0

Science Codex

More by this author

This is a Good-bye Post

January 16, 2009

This is the final post ever at evolgen. It was a fun 4+ years, the last three spent at ScienceBlogs, but it has come time for me to close up shop. When I first got into blogging, I did it as a way to share what was on my mind to the few people who would read what I had to say (usually in topics…

Mendel's Garden #27 - Call for Submissions

January 2, 2009

Mendel's Garden is the original genetics blog carnival. The next edition will be hosted by Jeremy at Another Blasted Weblog. If you would like to submit a blog post to be included in the carnival, send an email to Jeremy (jcherfas at mac dot com). The carnival should be posted within the next few…

Eric Lander Teaches?

December 20, 2008

John Hawks points out that Eric Lander has been appointed to co-chair Obama's Council of Advisers on Science and Technology along with science adviser John Holdren and Nobel Laureate Harold Varmus. Here's how the AP article describes Lander: Lander, who teaches at both MIT and Harvard, founded the…

The Implementation of Molecular Evolution for the Masses

December 18, 2008

A couple of years ago, there was talk in the bioblogosphere about getting the general public interested in bioinformatics and molecular evolution: Amateur bioinformatics? Lowering the Ivory Tower with Molecular Evolution Molecular Evolution for the Masses The idea was inspired by the findings of…

Do people still use microarrays?

December 17, 2008

Larry Moran points to a couple of posts critical of microarrays (The Problem with Microarrays): Why microarray study conclusions are so often wrong Three reasons to distrust microarray results Microarrays are small chips that are covered with short stretches of single stranded DNA. People…