Last week, we embarked on an adventure with BLAST.
BLAST, short for Basic Alignment Search Tool, is a collection of programs, written by scientists at the NCBI (1) that are used to compare sequences of proteins or nucleic acids. BLAST is used in multiple ways, but last week my challenge to you, dear readers, was to a pick a sequence, any sequence, from a set of 16 unknown sequences and use BLAST to identify that sequence.
This week, we'll examine the results.
I did the experiment, too, with a completely different unknown sequence that's pasted below. This sequence is not part of the data set that I put at the Geospiza Education site.
>unknown_seq
ATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCC
TTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATC
AGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCC
TTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCT
GCTATGTGGCGCGGTATTATCCCGTGTTGACGCCGGGCAAGAGCAACTCGGTCG
CCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAA
GCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCAT
GAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATC
Looking at the letters, of course, doesn't really help me at all. All I see are A's, G's, C's, and T's.
To solve the problem and identify the sequence, I have to compare my unidentified sequence to a collection of sequences of that have already been identified by other people and see if my sequence matches any sequences that are already known.
First, I copy my unknown sequence, then I follow the steps that are outlined in the BLAST for Beginners tutorial at the Geospiza Education web site. In the tutorial, I click the bright green arrows to move from page to page and see what to do.
My favorite way to use the tutorials is to open two web browser windows and resize the windows so they fit side by side on a computer screen. Then, I go through the tutorial in one window and do the steps myself in the other window.
(FYI: I started making these tutorials because I thought I would go crazy if I had to teach classes by spending fifty minutes saying "Click here" then "Click here" then "Click here".)
Eventually, I get to a page with results.
BLAST has looked into it's crystal ball and we get:
Hmm, I see......
A graph with lots of red lines.
What does this mean?
Click the graph to see a larger version with some explanations.
To put it simply, the graph shows me that at least one hundred sequences in GenBank match my entire sequence.
If I look farther down the page, I come to more curious results.
Click the image to see a larger version.
To summarize what I see, I have a list of fifty results (only some of them are shown in this image). All the results have a score of 833 and an E. value of 0.0, but the descriptions look like different things. C'mon what do Dengue virus, SIV, and E. coli have in common?
(at least if we don't read carefully, wink, wink, nudge, nudge)
Strange....
Why would my sequence match (at least) 50 different sequences in the nucleotide database?
Can you solve the mystery?
Copy the sequence at the beginning of this post and give it at try. Feel free to submit comments with your answer.
Or wait until next week, for more of the story.
References:
1. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
technorati tags: digital biology, blast, bioinformatics
- Log in to post comments
Hmm, I'm reaching here ... could this sequence be the origin of replication for plasmids as well as some viruses? Curious.
It is a beta-lactamase, an enzyme related to antibiotic resistance. BLAST it against a protein DB, and/or run it against PFAM
Coleen,
Good guess. It is a gene that's found in many plasmids.
Diego,
You are right but you're solving the problem the hard way. I'll show you an easier way to find the answer next week.