I proudly announce that today Metaxa has been officially released. Metaxa is a a software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequence datasets. We have been working on Metaxa for quite some time, and it has now been in beta for about two months. However, it seems to be stable enough for public consumption. In addition, the software package is today presented in a talk at the SocBiN conference in Helsinki.
A more thorough post on the rationale behind Metaxa, and how it works will follow when I am not occupied by the SocBiN conference. A paper on Metaxa is to be published in the journal Antonie van Leeuwenhoek. The software can be downloaded from here.
For those of you who are not already fed up with my writings on biology stuff on the web site, two opportunities to hear me talk in real life has popped up in May. The first is already on May 2nd, on the Open Day in Life Sciences, arranged by the Science Faculty at the University of Gothenburg. I will talk about the search for detoxification systems in metagenomic sequence data (from a collections point of view, as that is the theme for the day). There will also be an opportunity be guided in the herbarium and the botanical garden, plus having lunch and an optional after-work drink at Botaniska Paviljongen. But hurry, last day of admission is tomorrow! Register here.
The second opportunity will be at the SocBiN-2011 bioinformatics conference in Helsinki, on the 12th of May. I will present in the session called “Bioinformatics of Metagenomics”, and talk about a software tool for rRNA classification. I really look forward to this Bioinformatics conference, there are a number of highly prominent and interesting speakers, and I have heard that Helsinki in May is very beautiful. Besides, I am going there with extremely nice people, adding up to potentially being the best biology venue I will attend this spring.
So, last week I started my Ph.D. in Joakim Larsson’s group at the Sahlgrenska Academy. While I am very happy about how things have evolved, I will also miss the ecotox group and the functional genomics group a lot (though both do their research within 10 minutes walking distance from my new place…) I spent last week getting through the usual administrative hassle; getting keys and cards, signing papers, installing bioinformatics software on my new monster of a computer etc. Slowly, the new room starts to feel like it is mine (after nailing phylogenetic trees, my favorite map of the amino acids, and my remember-why-Cytoscape-visualisation-might-not-be-a-good-idea-for-all-network-like-structures poster to the billboard).
So what will this change of positions mean? Will I quit doing research on microbial communities? Of course not! In my new position, my subject of investigation will be bacterial communities subjected to antibiotics. We will look for resistance genes in such communities, and try to answer questions like: How do a high antibiotic selection pressure affect abundance of resistance genes and mobile elements that could facilitate their transfer between bacteria? Can resistance genes found in environmental bacteria be transferred to the microbes of the human gut? Can the environmental bacteria tell us what resistance genes that will be present in clinical situations in the near future? All these questions could, at least partially, be answered by metagenomic approaches and good bioinformatics tools, and my role will be to come up with the solutions provide answers to them.
I am excited over this new project, which involves my favorite subject – metagenomics and community analysis – as well as important factors, such as the clinical connections, the possibility to add pieces to the antibiotic resistance puzzle, and the role of gene and species transfer in resistance development. I also like the fact that I will need to handle high-throughput sequence data, meaning that there will be many opportunities to develop tools, a task I highly enjoy. I think the next couple of years will be an exciting time.
I have reorganised my Software page a little bit, putting the smaller scripts on a separate page, to make the main software page tidier. The content of the pages is the same, and you still find bloutminer and metaorf on the main software page.
I have put some “new” software online. I have had this piece of code lying around for some time but never got to upload it as I didn’t view it as “finished”. It is still not finished, but I would nevertheless like to share it with a wider audience. So, today I introduce bloutminer – the BLAST output mining script I have been using lately. bloutminer allows you to specify e.g. an E-value cutoff, a length cutoff and a percent identity cutoff, and extract a list of the hits satisfying these cutoffs. It takes table output (blastall option -m 8 ) as input. This is the software I used for the BLAST visualisation I have discussed earlier.
I normally use an E-value cutoff of 10 for my BLAST searches, and then extracts hits with bloutminer, allowing me to change the cutoffs at a later stage without redoing the whole BLAST search. You can also “pool” sequences into groups, based on their sequence tags. bloutminer is work in progress, and may contain nasty bugs. It can be found on the Software page. Please improve it at will.
Perhaps because of my roots in systems biology (or the cause of going there in the first place), I have always had an interest in creating visually appealing images of data, many times in the form of networks. I find that often in bioinformatics, one of the hardest problems is to make information understandable. For example, a BLAST output might say very little about how the genes or proteins are connected to each other, at least to the untrained eye.
Therefore, during the last weeks I have fiddled around with various ways of viewing interesting portions of BLAST reports. By making all-against-all BLAST searches, and outputting the data in table format (blastall option -m 8), I have been able to extract the hits I am interested in and export them into a Cytoscape compatible format, with some accompanying metadata (scores, e-values, alignment length, etc.). The results are many times pretty unparsable by the eye, rendering them a bit meaningless, but have been more and more interesting as I have put more effort into the extraction script. Just as an example, I here provide a simple map of the best all-against-all matches in the Saccharomyces cerevisiae genome, as a Cytoscape network (click for full size):
The largest circle consists of transposable elements (jumping DNA which inserts itself at multiple locations in the genome, no surprise there is a lot of them, and that these are pretty conserved). The circle to the left of the transposon circle consists of genes located inside the telomeric regions. Why they show such high similarity I do not know, but it seems plausible that the telomere thing could play a role here. The third circle contain mostly members of the seripauperin multigene family, which is also located close to the telomeres. At the bottom you found the gene pairs, that match to each other. You could go on with all the smaller structures as well, but I am no yeast expert, so I will stop here, letting this serve as an example of what a BLAST report really look like.
For this image, I have used a blastn report of all yeast ORFs (taken from yeastgenome.org) as input to my extraction tool, selected Cytoscape compatible output, and used a maximal e-value of 0.00001 and an alignment length of at least 50 nts as criteria to be extracted. I have also pooled the sequences according to chromosome number. The pooling was used to color code the nodes in Cytoscape. The edge width is connected to alignment score, a high score renders a thick line, and a low score causes the line to be thin.
I am still working on the extraction tool and will not provide any code yet. Input would, however, be appreciated. My personal opinion is that in the near future, the overload of newly produced DNA and protein sequences will choke us if do not come up with more intuitive ways of displaying data. I don’t think that the network above is there yet. Still, it conveys information I would not have been able to understand from just looking at the BLAST output. The first attempts to come around the sequence overload problem won’t be the best ones. But we got to start working on visualization methods today, so that we do not end up with sequences over our shoulders in just a few years. Besides, a network image seems much more impressive than a number of lines of text…
I have added some software I have written to this page (see link to Software at the top of the page). Among these is the useful little Unix/Linux utility blastgrep, which functions as a grep adopted for extracting useful information from BLAST-reports. I wrote it recently as I increasingly use complicated combinations of piped Unix-commands to do the same thing. blastgrep makes it all more easy. Use it as you wish, and if you do, please tell me about its bugs (hopefully none…)
I listened to a great talk by Alex Bateman (one of the guys behind Pfam and Rfam, as well as involved in HMMER development) at FEBS yesterday. In addition to talking about the problems of increasing sequence amounts, Alex also provided some reflections on co-operativity and knowledge-sharing – not only among fellow researchers, but also to a wider audience. The starting point of this discussion is Rfam, where the annotation of RNA families is entirely based on a community-driven wiki, tightly integrated with Wikipedia. This means that to make a change in the Rfam annotation, the same change is also made at the corresponding Wikipedia page for this RNA family. And what’s the use of this? Well, as Alex says, for most of the keywords in molecular biology (and I would guess in all of science), the top hit on Google will be a Wikipedia entry. If not, the Wikipedia entry will be in the top ten list of hits, if a good Wiki page exists. This means that Wikipedia is the primary source of scientific information for the general public, as well as many scientists. Wikipedia – not scientific journals.
The consequence of this is that to communicate your research subject, you should contribute to its Wikipedia page. In fact, Bateman argues, we have a responsibility as scientists to provide accurate and correct information to the public through the best sources available, which in most cases would be Wikipedia. To put this in perspective (and here I once again borrow Alex’ words), if somebody told you ten years ago that there would be one single internet site that everybody would visit to find scientific information, and where discussion and continuous improvement would be allowed, encouraged and performed, most people would have said that was too good to be true. But that’s what Wikipedia offers. It is time to get rid of the Wiki-sceptisism, and start improving it.
And so, what about the future of publishing? Bateman has worked hard to form an agreement with the journal RNA Biology to integrate the publishing into the process of adding to the easily accessible public information. To have an article on a new RNA family published under the journal’s RNA families track, the family must not only be submitted to the Rfam database, but the authors must also provide a Wikipedia formatted article, which undergo the same peer-review process as the journal article. This ensures high-quality Wikipedia material, as well as making new scientific discoveries public.
I don’t think there’s a long stretch to guess that in the future, more journals and/or funding agencies will take on similar approaches, as researchers and decision-makers discover the importance of correct, publicly available information. The scientific world is slowly moving towards being more open, also for non-scientists. This openness is of extremely high importance in these times of climate scepticism, GMO controversy, extinction of species, and nuclear power debate. For the public to make proper decisions and send a clear message to the politicians, scientists need to be much better at communicating the current state of knowledge, or what many people prefer to call “truth”.
The time is running out if you want to attend to the workshop session on mapping signal transduction, hosted by Stefan Hohmann and Marcus Krantz, which I will take part in. Deadline is on the 15 of May, so register soon if you have not already done. You can find all important info here.
The workshop will take place on June 29:th, between 13.00 and 15.30. The goal is to show some visualisation strategies for signal transduction pathways, and how to use pathway maps as a base to create mathematical models. There will be a brief introduction to mapping and modelling and to the software used (Cytoscape, CellDesigner). This will be followed by independent work with a set of small case studies that demonstrates the basic methodology. I will take part in answering questions and assisting during the case study part.