Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg

Browsing Posts tagged Data visualization

Late last year, we introduced FARAO – the Flexible All-Round Annotation Organizer – a software tool that allows visualization of annotated features on contigs. Today, the Applications Note describing the software was published as an advance access paper in Bioinformatics (1). As I have described before, storing and visualizing annotation and coverage information in FARAO has a number of advantages. FARAO is able to:

  • Integrate annotation and coverage information for the same sequence set, enabling coverage estimates of annotated features
  • Scale across millions of sequences and annotated features
  • Filter sequences, such that only entries with annotations satisfying certain given criteria will be outputted
  • Handle annotation and coverage data produced by a range of different bioinformatics tools
  • Handle custom parsers through a flexible interface, allowing for adaption of the software to virtually any bioinformatic tool not supported out of the box
  • Produce high-quality EPS output
  • Integrate with MySQL databases

I have previously used FARAO to produce annotation figures in our paper on a polluted Indian lake (2), as well as in a paper on sewage treatment plants (which is in press and should be coming out any day now). We hope that the tool will find many more uses in other projects in the future!

References

  1. Hammarén R, Pal C, Bengtsson-Palme JFARAO: The Flexible All-Round Annotation Organizer. Bioinformatics, advance access (2016). doi: 10.1093/bioinformatics/btw499 [Paper link]
  2. Bengtsson-Palme J, Boulund F, Fick J, Kristiansson E, Larsson DGJ: Shotgun metagenomics reveals a wide array of antibiotic resistance genes and mobile elements in a polluted lake in India. Frontiers in Microbiology, 5, 648 (2014). doi: 10.3389/fmicb.2014.00648 [Paper link]

A problem with annotating contigs from genomic and metagenomic projects is that there are few tools that allow the visualization of the annotated features, particularly if those features come from different sources. To alleviate this problem, I have (with assistance from Rickard Hammarén and Chandan Pal) over the last years developed a new annotation and read coverage visualization package – FARAO – which we today introduce to the public. FARAO has been used to produce the basis for the the contig annotation figures in my paper on the polluted Indian lake. Storing and visualizing annotation and coverage information in FARAO has a number of advantages. FARAO is able to:

  • Integrate annotation and coverage information for the same sequence set, enabling coverage estimates of annotated features
  • Scale across millions of sequences and annotated features
  • Filter sequences, such that only entries with annotations satisfying certain given criteria will be outputted
  • Handle annotation and coverage data produced by a range of different bioinformatics tools
  • Handle custom parsers through a flexible interface, allowing for adaption of the software to virtually any bioinformatic tool
  • Produce high-quality EPS output
  • Integrate with MySQL databases

FARAO is today moved from a private pre-release state to a public beta state. It is still possible that this version contains bug that we have not discovered in our testing. Please send me an e-mail and make us aware of the potential shortcomings of our software if you find any unexpected behavior in this version of FARAO.

I have put some “new” software online. I have had this piece of code lying around for some time but never got to upload it as I didn’t view it as “finished”. It is still not finished, but I would nevertheless like to share it with a wider audience. So, today I introduce bloutminer – the BLAST output mining script I have been using lately. bloutminer allows you to specify e.g. an E-value cutoff, a length cutoff and a percent identity cutoff, and extract a list of the hits satisfying these cutoffs. It takes table output (blastall option -m 8 ) as input. This is the software I used for the BLAST visualisation I have discussed earlier.

I normally use an E-value cutoff of 10 for my BLAST searches, and then extracts hits with bloutminer, allowing me to change the cutoffs at a later stage without redoing the whole BLAST search. You can also “pool” sequences into groups, based on their sequence tags. bloutminer is work in progress, and may contain nasty bugs. It can be found on the Software page. Please improve it at will.

Perhaps because of my roots in systems biology (or the cause of going there in the first place), I have always had an interest in creating visually appealing images of data, many times in the form of networks. I find that often in bioinformatics, one of the hardest problems is to make information understandable. For example, a BLAST output might say very little about how the genes or proteins are connected to each other, at least to the untrained eye.

Therefore, during the last weeks I have fiddled around with various ways of viewing interesting portions of BLAST reports. By making all-against-all BLAST searches, and outputting the data in table format (blastall option -m 8), I have been able to extract the hits I am interested in and export them into a Cytoscape compatible format, with some accompanying metadata (scores, e-values, alignment length, etc.). The results are many times pretty unparsable by the eye, rendering them a bit meaningless, but have been more and more interesting as I have put more effort into the extraction script. Just as an example, I here provide a simple map of the best all-against-all matches in the Saccharomyces cerevisiae genome, as a Cytoscape network (click for full size):

The largest circle consists of transposable elements (jumping DNA which inserts itself at multiple locations in the genome, no surprise there is a lot of them, and that these are pretty conserved). The circle to the left of the transposon circle consists of genes located inside the telomeric regions. Why they show such high similarity I do not know, but it seems plausible that the telomere thing could play a role here. The third circle contain mostly members of the seripauperin multigene family, which is also located close to the telomeres. At the bottom you found the gene pairs, that match to each other. You could go on with all the smaller structures as well, but I am no yeast expert, so I will stop here, letting this serve as an example of what a BLAST report really look like.

For this image, I have used a blastn report of all yeast ORFs (taken from yeastgenome.org) as input to my extraction tool, selected Cytoscape compatible output, and used a maximal e-value of 0.00001 and an alignment length of at least 50 nts as criteria to be extracted. I have also pooled the sequences according to chromosome number. The pooling was used to color code the nodes in Cytoscape. The edge width is connected to alignment score, a high score renders a thick line, and a low score causes the line to be thin.

I am still working on the extraction tool and will not provide any code yet. Input would, however, be appreciated. My personal opinion is that in the near future, the overload of newly produced DNA and protein sequences will choke us if do not come up with more intuitive ways of displaying data. I don’t think that the network above is there yet. Still, it conveys information I would not have been able to understand from just looking at the BLAST output. The first attempts to come around the sequence overload problem won’t be the best ones. But we got to start working on visualization methods today, so that we do not end up with sequences over our shoulders in just a few years. Besides, a network image seems much more impressive than a number of lines of text…