Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg

Nature recently had a nice news article on Bio-wikis and biological databases connected to Wikipedia where Alex Bateman says they’re working on a protein-family wiki that will be hosted on Wikipedia, similar to the Rfam wiki, which he talked about at FEBS this summer. I am of course very excited about this, and hope that the new Pfam (?) wiki will come rather sooner than later. As pointed out earlier, the Nature article also underlines the problem of scientific wikis; currently there is no career incentive to get researchers to spend their time editing wiki-articles. That is a shame, and perhaps a system copying the system of Rfam and RNA Biology could help in this direction. The only question is which journal(s) that would be interested in such a commitment to open science…

Perhaps because of my roots in systems biology (or the cause of going there in the first place), I have always had an interest in creating visually appealing images of data, many times in the form of networks. I find that often in bioinformatics, one of the hardest problems is to make information understandable. For example, a BLAST output might say very little about how the genes or proteins are connected to each other, at least to the untrained eye.

Therefore, during the last weeks I have fiddled around with various ways of viewing interesting portions of BLAST reports. By making all-against-all BLAST searches, and outputting the data in table format (blastall option -m 8), I have been able to extract the hits I am interested in and export them into a Cytoscape compatible format, with some accompanying metadata (scores, e-values, alignment length, etc.). The results are many times pretty unparsable by the eye, rendering them a bit meaningless, but have been more and more interesting as I have put more effort into the extraction script. Just as an example, I here provide a simple map of the best all-against-all matches in the Saccharomyces cerevisiae genome, as a Cytoscape network (click for full size):

The largest circle consists of transposable elements (jumping DNA which inserts itself at multiple locations in the genome, no surprise there is a lot of them, and that these are pretty conserved). The circle to the left of the transposon circle consists of genes located inside the telomeric regions. Why they show such high similarity I do not know, but it seems plausible that the telomere thing could play a role here. The third circle contain mostly members of the seripauperin multigene family, which is also located close to the telomeres. At the bottom you found the gene pairs, that match to each other. You could go on with all the smaller structures as well, but I am no yeast expert, so I will stop here, letting this serve as an example of what a BLAST report really look like.

For this image, I have used a blastn report of all yeast ORFs (taken from yeastgenome.org) as input to my extraction tool, selected Cytoscape compatible output, and used a maximal e-value of 0.00001 and an alignment length of at least 50 nts as criteria to be extracted. I have also pooled the sequences according to chromosome number. The pooling was used to color code the nodes in Cytoscape. The edge width is connected to alignment score, a high score renders a thick line, and a low score causes the line to be thin.

I am still working on the extraction tool and will not provide any code yet. Input would, however, be appreciated. My personal opinion is that in the near future, the overload of newly produced DNA and protein sequences will choke us if do not come up with more intuitive ways of displaying data. I don’t think that the network above is there yet. Still, it conveys information I would not have been able to understand from just looking at the BLAST output. The first attempts to come around the sequence overload problem won’t be the best ones. But we got to start working on visualization methods today, so that we do not end up with sequences over our shoulders in just a few years. Besides, a network image seems much more impressive than a number of lines of text…

I have fixed two small bugs in the blastgrep tool (see below), and the version number has been increased to 1.0.2. This update is recommended to everybody who downloaded the previous version of blastgrep. The new version of blastgrep can be downloaded using this link.

Version 1.0.2 fixes:
  • Fixed a bug with extracting information from queries without any matches
Version 1.0.1 fixes:
  • Fixed an inconsistency bug while using “-o count”

There’s a lot of stuff going on at the moment, and I will not be able to make it to this event myself, but I encourage everyone interested in the future of science that is able to to go there. It is important, interesting, and not expensive. Copy/paste from the website:

Join us at the first Open Science Summit, an attempt to gather all stakeholders who want to liberate our scientific and technological commons to enable an new era of decentralized, distributed innovation to solve humanity’s greatest challenges. (…) The Open Science Summit is the first and only event to consider what happens throughout the entire innovation chain as reform in one area influences the prospects in others.

Tickets are available until Wednesday (the 28th), and the event runs from July 29 to 31 at the International House Berkeley, CA. Please be there for me and represent a movement towards increased openness in science. See this previous post by me for my opinion on things.

Useful links:

I have added some software I have written to this page (see link to Software at the top of the page). Among these is the useful little Unix/Linux utility blastgrep, which functions as a grep adopted for extracting useful information from BLAST-reports. I wrote it recently as I increasingly use complicated combinations of piped Unix-commands to do the same thing. blastgrep makes it all more easy. Use it as you wish, and if you do, please tell me about its bugs (hopefully none…)

I listened to a great talk by Alex Bateman (one of the guys behind Pfam and Rfam, as well as involved in HMMER development) at FEBS yesterday. In addition to talking about the problems of increasing sequence amounts, Alex also provided some reflections on co-operativity and knowledge-sharing – not only among fellow researchers, but also to a wider audience. The starting point of this discussion is Rfam, where the annotation of RNA families is entirely based on a community-driven wiki, tightly integrated with Wikipedia. This means that to make a change in the Rfam annotation, the same change is also made at the corresponding Wikipedia page for this RNA family. And what’s the use of this? Well, as Alex says, for most of the keywords in molecular biology (and I would guess in all of science), the top hit on Google will be a Wikipedia entry. If not, the Wikipedia entry will be in the top ten list of hits, if a good Wiki page exists. This means that Wikipedia is the primary source of scientific information for the general public, as well as many scientists. Wikipedia – not scientific journals.

The consequence of this is that to communicate your research subject, you should contribute to its Wikipedia page. In fact, Bateman argues, we have a responsibility as scientists to provide accurate and correct information to the public through the best sources available, which in most cases would be Wikipedia. To put this in perspective (and here I once again borrow Alex’ words), if somebody told you ten years ago that there would be one single internet site that everybody would visit to find scientific information, and where discussion and continuous improvement would be allowed, encouraged and performed, most people would have said that was too good to be true. But that’s what Wikipedia offers. It is time to get rid of the Wiki-sceptisism, and start improving it.

And so, what about the future of publishing? Bateman has worked hard to form an agreement with the journal RNA Biology to integrate the publishing into the process of adding to the easily accessible public information. To have an article on a new RNA family published under the journal’s RNA families track, the family must not only be submitted to the Rfam database, but the authors must also provide a Wikipedia formatted article, which undergo the same peer-review process as the journal article. This ensures high-quality Wikipedia material, as well as making new scientific discoveries public.

I don’t think there’s a long stretch to guess that in the future, more journals and/or funding agencies will take on similar approaches, as researchers and decision-makers discover the importance of correct, publicly available information. The scientific world is slowly moving towards being more open, also for non-scientists. This openness is of extremely high importance in these times of climate scepticism, GMO controversy, extinction of species, and nuclear power debate. For the public to make proper decisions and send a clear message to the politicians, scientists need to be much better at communicating the current state of knowledge, or what many people prefer to call “truth”.

Here at FEBS, I am for the first time doing the reflection that sex obviously can sell anything – even biology. With a mixture of disgust and interest of how much more attention it actually brings, I have been watching the two “antibody princess” girls that’s been running around at the conference, trying to sell antibodies. Unfortunately, I have not brought my camera, so all I have is these pretty bad pictures taken with my phone. I will not name the company behind this, as I do not want to function as an inappropriate extra advertising space for them, but it’s interesting to note that the “sex sells” thing has reached into molecular biology. And it makes me wonder what’s next…

The time is running out if you want to attend to the workshop session on mapping signal transduction, hosted by Stefan Hohmann and Marcus Krantz, which I will take part in. Deadline is on the 15 of May, so register soon if you have not already done. You can find all important info here.

The workshop will take place on June 29:th, between 13.00 and 15.30. The goal is to show some visualisation strategies for signal transduction pathways, and how to use pathway maps as a base to create mathematical models. There will be a brief introduction to mapping and modelling and to the software used (Cytoscape, CellDesigner). This will be followed by independent work with a set of small case studies that demonstrates the basic methodology. I will take part in answering questions and assisting during the case study part.

If you did not already know, or at least suspected, that pesticides used in agriculture could have a negative impact on species diversity, there is now proof. In this article:

the result of a joint study in eight European countries, we present that biodiversity indeed takes a strike by the use of pesticides, at several levels. Also, actions are needed for a change in the structure of the large-scale agriculture. And why do I say we? This isn’t exactly microbiology, is it? Well, this is the first publication related to the field assistant work I did during the Summers of 2007 and 2008. There is more in the pipeline, but this first publication at least shows that there are considerable risks with the way we use weed control.

Welcome

Comments off

My name is Johan Bengtsson-Palme. I am doing research in microbiology and microbial ecology, primarily focusing on investigating antibiotic resistance of bacterial communities through the use of metagenomics and bioinformatics. I also have an interest in microbial taxonomy and improving the quality of reference databases. I am currently a member of Joakim Larsson’s group at the Sahlgrenska Academy, and reside in Gothenburg on the Swedish west coast. To contact me, feel free to send an e-mail to my firstname.lastname@microbiology.se