I have fixed a long-standing bug in the Bloutminer script, which has thereby been pushed to version 0.9.6. The new version fixes an issue when using the
-o blast option without the
-n option. The new version can be downloaded here.
Good news for everyone using my bloutminer script; it has received an update making it even more useful! Basically, I have added a function to extract the top N matches to each query (using the -n option), and I have also added the ability to output a filtered set of sequences in the same tabulated BLAST-format as the input came in. Thereby, bloutminer can now be used in more settings to easily filter out a subset in a large BLAST report (in tabular format, generated using the blastall -m 8 option). The script can be downloaded here: https://microbiology.se/software/
I have reorganised my Software page a little bit, putting the smaller scripts on a separate page, to make the main software page tidier. The content of the pages is the same, and you still find bloutminer and metaorf on the main software page.
I have put some “new” software online. I have had this piece of code lying around for some time but never got to upload it as I didn’t view it as “finished”. It is still not finished, but I would nevertheless like to share it with a wider audience. So, today I introduce bloutminer – the BLAST output mining script I have been using lately. bloutminer allows you to specify e.g. an E-value cutoff, a length cutoff and a percent identity cutoff, and extract a list of the hits satisfying these cutoffs. It takes table output (blastall option -m 8 ) as input. This is the software I used for the BLAST visualisation I have discussed earlier.
I normally use an E-value cutoff of 10 for my BLAST searches, and then extracts hits with bloutminer, allowing me to change the cutoffs at a later stage without redoing the whole BLAST search. You can also “pool” sequences into groups, based on their sequence tags. bloutminer is work in progress, and may contain nasty bugs. It can be found on the Software page. Please improve it at will.
Perhaps because of my roots in systems biology (or the cause of going there in the first place), I have always had an interest in creating visually appealing images of data, many times in the form of networks. I find that often in bioinformatics, one of the hardest problems is to make information understandable. For example, a BLAST output might say very little about how the genes or proteins are connected to each other, at least to the untrained eye.
Therefore, during the last weeks I have fiddled around with various ways of viewing interesting portions of BLAST reports. By making all-against-all BLAST searches, and outputting the data in table format (blastall option -m 8), I have been able to extract the hits I am interested in and export them into a Cytoscape compatible format, with some accompanying metadata (scores, e-values, alignment length, etc.). The results are many times pretty unparsable by the eye, rendering them a bit meaningless, but have been more and more interesting as I have put more effort into the extraction script. Just as an example, I here provide a simple map of the best all-against-all matches in the Saccharomyces cerevisiae genome, as a Cytoscape network (click for full size):
The largest circle consists of transposable elements (jumping DNA which inserts itself at multiple locations in the genome, no surprise there is a lot of them, and that these are pretty conserved). The circle to the left of the transposon circle consists of genes located inside the telomeric regions. Why they show such high similarity I do not know, but it seems plausible that the telomere thing could play a role here. The third circle contain mostly members of the seripauperin multigene family, which is also located close to the telomeres. At the bottom you found the gene pairs, that match to each other. You could go on with all the smaller structures as well, but I am no yeast expert, so I will stop here, letting this serve as an example of what a BLAST report really look like.
For this image, I have used a blastn report of all yeast ORFs (taken from yeastgenome.org) as input to my extraction tool, selected Cytoscape compatible output, and used a maximal e-value of 0.00001 and an alignment length of at least 50 nts as criteria to be extracted. I have also pooled the sequences according to chromosome number. The pooling was used to color code the nodes in Cytoscape. The edge width is connected to alignment score, a high score renders a thick line, and a low score causes the line to be thin.
I am still working on the extraction tool and will not provide any code yet. Input would, however, be appreciated. My personal opinion is that in the near future, the overload of newly produced DNA and protein sequences will choke us if do not come up with more intuitive ways of displaying data. I don’t think that the network above is there yet. Still, it conveys information I would not have been able to understand from just looking at the BLAST output. The first attempts to come around the sequence overload problem won’t be the best ones. But we got to start working on visualization methods today, so that we do not end up with sequences over our shoulders in just a few years. Besides, a network image seems much more impressive than a number of lines of text…