Nick Loman listed the fifty most sequenced bacterial genomes according to NCBI. A reader at Nick's blog came up with an improved list--one that reflects the genomes for which we actually have data (depending on who is doing the sequencing, a project will be registered with NCBI, often months before any sequencing is done). Here's the 'improved' top twenty:
173 Escherichia coli
82 Salmonella enterica
78 Staphylococcus aureus
69 Propionibacterium acnes
56 Streptococcus pneumoniae
56 Enterococcus faecalis
45 Bacillus cereus
42 Mycobacterium tuberculosis
36 Vibrio cholerae
29 Pseudomonas syringae
28 Listeria monocytogenes
27 Neisseria meningitidis
27 Helicobacter pylori
27 Enterococcus faecium
27 Acinetobacter baumannii
25 Yersinia pestis
23 Methanobrevibacter smithii
23 Clostridium difficile
23 Burkholderia pseudomallei
22 Campylobacter jejuni
It's kinda very cool to realize that half of the E. coli genomes are, in part, my fault (obviously lots of people are involved with the project). I've discussed that project before, but those genomes are actually commensals (bacteria that live on us and in us and typically don't cause disease): while NIAID cares about pathogens (bacteria associated with disease), they realize that we also need non-pathogens to make sense of the pathogens. In fact, when you look at the list, most of the organisms are commensals, although my sense is that most of them sequenced strains were isolated from sick patients and are thought to be associated with disease.
Also, the list contains only de novo genomes: we start from DNA and wind up with a new sequence. These are not resequenced genomes (SNP calling) where we map a strain's mutations back to a previously sequenced genome (there was some confusion about that over at Nick's place).
In the coming attractions department, I feel pretty confident that we (here I mean the larger scientific community) will be increasing these numbers massively: E. coli will probably triple, S. aureus and Enterococcus will explode many-fold, B. cereus will triple. And this could very well be a large underestimate on my part.
Of course, this then leads to the question of how one goes about analyzing hundreds of genomes. If people want to read a post about this, let me know (I have to give a talk this week though; you can tweet me at @mikethemadbiol or email, which is on the sidebar).
- Log in to post comments
Yes, go Bacillus cereus! I find it kind of cool that we will be responsible for the tripling ;)
Please do post about the analysis part.
Another month or two and my lab will add five more S. aureus genomes to that list. Must do battle with the Salmonella enterica| folk for 2nd place!
These numbers are not really surprising since it is easier sequencing the whole genome compared to cloning a single gene. In reallity, E. coli has been sequenced more than 1000-fold including all the training runs for customers bying NGS systems. Of course these (drafts) are not published. However, it could be interesting to decipher evolution of E. coli in the fridge?!