I will present my master thesis “Metagenomic Analysis of Marine Periphyton Communities”, on Tuesday the 22nd of March, at 13.00. The presentation will take place in the room Folke Andreasson at Medicinaregatan 11 in Gothenburg. The presentation is open for everyone, but the number of seats are limited.
There is currently an interesting competition going on organised by UCSC called the Assemblathon. The idea is that participating research groups will try to assemble simulated short-reads to a simulated genome, with the winner being the group doing it “best” (by some criteria set up by the evaluation team at the UC Davis Genome Center). The complete set of rules can be found here. The whole thing will culminate in a Genome Assembly Workshop at UC Santa Cruz in mid-March.
I think the competition is an interesting initiative, hopefully inspiring new, more efficient, sequence assembly ideas. Those are desperately needed in these times of ever-incresing DNA sequence generation. In addition, there are numerous already existing genome assembly programs, but (as noted on the Assemblathon site) it is not obvious which one is the best in a given situation. Hopefully the competition can shed some light on that too. The deadline for participation is the sixth of February, and even though I am not myself competent enough to participate, I hope the ones who do are successful in their work.
Perhaps because of my roots in systems biology (or the cause of going there in the first place), I have always had an interest in creating visually appealing images of data, many times in the form of networks. I find that often in bioinformatics, one of the hardest problems is to make information understandable. For example, a BLAST output might say very little about how the genes or proteins are connected to each other, at least to the untrained eye.
Therefore, during the last weeks I have fiddled around with various ways of viewing interesting portions of BLAST reports. By making all-against-all BLAST searches, and outputting the data in table format (blastall option -m 8), I have been able to extract the hits I am interested in and export them into a Cytoscape compatible format, with some accompanying metadata (scores, e-values, alignment length, etc.). The results are many times pretty unparsable by the eye, rendering them a bit meaningless, but have been more and more interesting as I have put more effort into the extraction script. Just as an example, I here provide a simple map of the best all-against-all matches in the Saccharomyces cerevisiae genome, as a Cytoscape network (click for full size):
The largest circle consists of transposable elements (jumping DNA which inserts itself at multiple locations in the genome, no surprise there is a lot of them, and that these are pretty conserved). The circle to the left of the transposon circle consists of genes located inside the telomeric regions. Why they show such high similarity I do not know, but it seems plausible that the telomere thing could play a role here. The third circle contain mostly members of the seripauperin multigene family, which is also located close to the telomeres. At the bottom you found the gene pairs, that match to each other. You could go on with all the smaller structures as well, but I am no yeast expert, so I will stop here, letting this serve as an example of what a BLAST report really look like.
For this image, I have used a blastn report of all yeast ORFs (taken from yeastgenome.org) as input to my extraction tool, selected Cytoscape compatible output, and used a maximal e-value of 0.00001 and an alignment length of at least 50 nts as criteria to be extracted. I have also pooled the sequences according to chromosome number. The pooling was used to color code the nodes in Cytoscape. The edge width is connected to alignment score, a high score renders a thick line, and a low score causes the line to be thin.
I am still working on the extraction tool and will not provide any code yet. Input would, however, be appreciated. My personal opinion is that in the near future, the overload of newly produced DNA and protein sequences will choke us if do not come up with more intuitive ways of displaying data. I don’t think that the network above is there yet. Still, it conveys information I would not have been able to understand from just looking at the BLAST output. The first attempts to come around the sequence overload problem won’t be the best ones. But we got to start working on visualization methods today, so that we do not end up with sequences over our shoulders in just a few years. Besides, a network image seems much more impressive than a number of lines of text…
I have fixed two small bugs in the blastgrep tool (see below), and the version number has been increased to 1.0.2. This update is recommended to everybody who downloaded the previous version of blastgrep. The new version of blastgrep can be downloaded using this link.
- Fixed a bug with extracting information from queries without any matches
- Fixed an inconsistency bug while using “-o count”
I have added some software I have written to this page (see link to Software at the top of the page). Among these is the useful little Unix/Linux utility blastgrep, which functions as a grep adopted for extracting useful information from BLAST-reports. I wrote it recently as I increasingly use complicated combinations of piped Unix-commands to do the same thing. blastgrep makes it all more easy. Use it as you wish, and if you do, please tell me about its bugs (hopefully none…)