Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg

Browsing Posts tagged HMMER

An ITSx user yesterday made me aware of an information-problem (thanks Suzanne!) regarding the use of ITSx in combination with the HMMER 3.1 beta. I have not been entirely clear on why you might get the “Error: bad format, binary auxfiles, (…) binary auxfiles are in an outdated HMMER format (3/b); please hmmpress your HMM file again” error message when running ITSx with HMMER 3.1 installed. You might think that following the instructions for Metaxa might do the trick. As you will notice, however, it will not. Instead you will be presented with the following error message: “Error: Failed to open binary auxfiles”. This is because while Metaxa 1.1.2 will re-create the HMM-files if needed, ITSx does not. Instead, ITSx has the option "--reset T" which can be added to the command line to recreate the HMM-files for the current HMMER version installed (regardless of which 3.x version).

Thus, the solution for the “bad format, binary auxfiles” error is to simply add "--reset T" (without quotes) to the ITSx command line and run the software again. You only need to do this once, unless you update HMMER and/or get the same error message again for some other reason. The Metaxa-post has been updated to clarify this as well.

An ITSx user informed me a couple of days ago of an issue that caused ITSx to sometimes accidentally remove the HMM-files in the database when multiple ITSx jobs were run in parallel. Although this issue should be relatively rare, it was also very easy to fix. Therefore, we already push out a new version of ITSx (1.0.3), which is available for download here.

In short, the bug was introduced because I overlooked this usage scenario when fixing another bug related to the HMM-files in an earlier pre-release. Let’s keep our fingers crossed that version 1.0.3 will be more long-lived than 1.0.2!

As promised yesterday, I have now uploaded an update to ITSx, bringing it to version 1.0.2. So what’s new in this version?

First of all, ITSx is now taken out of beta and is now considered ready for production use. We do no longer find any bugs in it, and since there’s now a wide range of people already using it for various purposes, we feel confident that any significant bugs would have been unraveled by now.

Secondly, I have also added support for the new HMMER version (3.1b) released in May in this version of ITSx. So you can now go ahead and install HMMER 3.1 if you want to try out the new HMMER beta and still be able to use ITSx.

Finally, we have also updated the manual somewhat, hopefully making it a little easier to use ITSx for a first-time user.

Version 1.0.2 of ITSx can be downloaded from here. As previously, you may still report any bugs, strange behaviors, ideas for new features, or inconsistencies with certain lineages, by mailing to “itsx” at this domain name.

As you might be aware, a new version of HMMER is out since late May. You might wonder how Metaxa (relying on HMMER3) will work if you update to the new version of HMMER, and I have finally got around to test it! The answer, according to my somewhat limited testing, is that Metaxa 1.1.2 seems to be working fine with HMMER 3.1.

You might need to go into the database directory (“metaxa_db”; should be located in the same directory as the Metaxa binaries), and remove all the files ending with suffixes .h3f .h3i .h3m and .h3p inside the “HMMs” directory. On most installation, this should not be necessary. Myself, I just plugged HMMER 3.1 in and started Metaxa, but if you get error messages complaining that “Error: bad format, binary auxfiles, .hmm:
binary auxfiles are in an outdated HMMER format (3/b); please hmmpress your HMM file again”, then you should try removing the files and re-running Metaxa. This might especially be a problem on older Metaxa versions. [Update: Note that this fix will likely not work with ITSx!]

Bear in mind that I have not run thorough testing on Metaxa and HMMER 3.1, and probably won’t for the 1.1.2 version, since there’s a 2.0 version waiting just around the corner…

Additionally, if you experience problems with Megraft, you should try the same fix as for Metaxa, but with the Megraft database directory instead. Regarding ITSx, a minor update will be released very soon, which also will address HMMER 3.1b compatibility. [Update: See this post for how to work around HMMER 3.1 problems with ITSx.]

Happy barcoding everyone!

For a couple of years, I have been working with microbial ecology and diversity, and how such features can be assessed using molecular barcodes, such as the SSU (16S/18S) rRNA sequence (the Metaxa and Megraft packages). However, I have also been aiming at the ITS region, and how that can be used in barcoding (see e.g. the guidelines we published last year). It is therefore a great pleasure to introduce my next gem for community analysis; a software tool for detection and extraction of the ITS1 and ITS2 regions of ITS sequences from environmental communities. The tool is dubbed ITSx, and supersedes the more specific fungal ITS extractor written by Henrik Nilsson and colleagues. Henrik is once more the mastermind behind this completely rewritten version, in which I have done the lion’s share of the programming. Among the new features in ITSx are:

  • Robust support for the Cantharellus, Craterellus, and Tulasnella genera of fungi
  • Support for nineteen additional eukaryotic groups on top of the already present support for fungi (specifically these groups: Tracheophyta (vascular plants), Bryophyta (bryophytes), Marchantiophyta (liverworts), Chlorophyta (green algae), Rhodophyta (red algae), Phaeophyceae (brown algae), Metazoa (metazoans), Oomycota (oomycetes), Alveolata (alveolates), Amoebozoa (amoebozoans), Euglenozoa, Rhizaria, Bacillariophyta (diatoms), Eustigmatophyceae (eustigmatophytes), Raphidophyceae (raphidophytes), Synurophyceae (synurids), Haptophyceae (haptophytes) , Apusozoa, and Parabasalia (parabasalids))
  • Multi-processor support
  • Extensive output options
  • Virtually zero false-positive extractions

ITSx is today moved from a private pre-release state to a public beta state. No code changes has been made since February, indicative of that the last pre-release candidate is now ready to fly on its own. As far as our testing has revealed, this version seems to be bug free. In reality though, researchers tend to find the most unexpected usage scenarios. So please, if you find any unexpected behavior in this version of ITSx, send me an e-mail and make us aware of the potential shortcomings of our software.

We expect this open-source software to boost research in microbial ecology based on barcoding of the ITS region, and hope that the research community will evaluate its performance also among the eukaryote groups that we have less experience with.

Those attending the Metagenomics lab (part of the basic NGS course for PhD students given at GU this week), can find the material for the lab on this page:
http://microbiology.se/ngs-metagenomics-lab/
Of course, the page is open for anyone else as well, although you won’t get the support that the GU students are given.

Yesterday, our paper on Megraft – a software tool to graft ribosomal small subunit (16S/18S) fragments onto full-length SSU sequences – became available as an accepted online early article in Research in Microbiology. Megraft is built upon the notion that when examining the depth of a community sequencing effort, researchers often use rarefaction analysis of the ribosomal small subunit (SSU/16S/18S) gene in a metagenome. However, the SSU sequences in metagenomic libraries generally are present as fragmentary, non-overlapping entries, which poses a great problem for this analysis. Megraft aims to remedy this problem by grafting the input SSU fragments from the metagenome (obtained by e.g. Metaxa) onto full-length SSU sequences. The software also uses a variability model which accounts for observed and unobserved variability. This way, Megraft enables accurate assessment of species richness and sequencing depth in metagenomic datasets.

The algorithm, efficiency and accuracy of Megraft is thoroughly described in the paper. It should be noted that this is not a panacea for species richness estimates in metagenomics, but it is a huge step forward over existing approaches. Megraft shares some similarities with EMIRGE (Miller et al., 2011), which is a software package for reconstruction of full-length ribosomal genes from paired-end Illumina sequences. Megraft, however, is set apart in that it has a strong focus on rarefaction, and functions also when the number of sequences is small, which is often the case in 454 and Sanger-based metagenomics studies. Thus, EMIRGE and Megraft seek to solve a roughly similar problem, but for different sequencing technologies and sequencing scales.

Megraft is available for download here, and the paper can be read here.

  1. Bengtsson, J., Hartmann, M., Unterseher, M., Vaishampayan, P., Abarenkov, K., Durso, L., Bik, E.M., Garey, J.R., Eriksson, K.M., Nilsson R.H. (2012). Megraft: A software package to graft
  2. Miller, C. S., Baker, B. J., Thomas, B. C., Singer, S. W., & Banfield, J. F. (2011). EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biology, 12(5), R44. doi:10.1186/gb-2011-12-5-r44

I am extremely happy to announce that Metaxa 1.1 (first announced back in July) has finally left the beta stage, and is now designated as a feature complete 1.1 update. We consider this update stable for production use. The 1.1 update utilize hmmsearch instead of hmmscan for higher extraction speeds and better accuracy. This clever trick was inspired by a blog post by HMMER’s creator Sean Eddy on hmmscan vs hmmsearch (http://selab.janelia.org/people/eddys/blog/?p=424). As the speedup comes from the extraction step, the speed increase will be largest for huge data sets with only a small proportion of actual SSU sequences (typically large 454 metagenomes).

What took so long, you might ask, as I promised an imminent release already in August. Well, during testing a difference in scoring was discovered. This difference did not have any implications for long sequences (> ~350 bp), but caused Metaxa to have problems on short reads (most evident on ~150 bp and shorter). Therefore, the scoring system had to be redesigned, which in turn required more extensive testing. Now, however, Metaxa 1.1 has a fine-tuned scoring system, which by default is based on scores instead of E-values, and in some instances have even better detection accuracy than the old Metaxa version. We encourage everyone to try out this new version of Metaxa (although the 1.0.2 version will remain available for download). It should be bug free, but we cannot ensure 100% compatibility in all usage scenarios. Therefore, we are happy if you report any bugs or inconsistencies to the e-mail address: metaxa (at] microbiology [dot) se.

The new version of Metaxa can be downloaded here: http://microbiology.se/software/metaxa/ Please note that the manual has not yet been updated yet, so use the help feature for the up-to-date options. Happy SSU detecting!

I’m working on an update to Metaxa that will bring at least double speed to the program (and even more when run on really large data sets on many cores). While there is still no real release version of this update (version 1.1), I have today posted a public “beta”, which you can use for testing purposes. Do not use this version for anything important (e.g. research) as it contains at least one known bug (and maybe even more I haven’t discovered yet). I would appreciate, if you are interested, that you download this version and e-mailed any bugs or inconsistencies found to me (firstname.lastname[at]microbiology.se).

Note that to install this version, you first need to download and install the current version of Metaxa (1.0.2). Then the new version can be used with the old’s databases.

Download the Metaxa 1.1 beta here

I listened to a great talk by Alex Bateman (one of the guys behind Pfam and Rfam, as well as involved in HMMER development) at FEBS yesterday. In addition to talking about the problems of increasing sequence amounts, Alex also provided some reflections on co-operativity and knowledge-sharing – not only among fellow researchers, but also to a wider audience. The starting point of this discussion is Rfam, where the annotation of RNA families is entirely based on a community-driven wiki, tightly integrated with Wikipedia. This means that to make a change in the Rfam annotation, the same change is also made at the corresponding Wikipedia page for this RNA family. And what’s the use of this? Well, as Alex says, for most of the keywords in molecular biology (and I would guess in all of science), the top hit on Google will be a Wikipedia entry. If not, the Wikipedia entry will be in the top ten list of hits, if a good Wiki page exists. This means that Wikipedia is the primary source of scientific information for the general public, as well as many scientists. Wikipedia – not scientific journals.

The consequence of this is that to communicate your research subject, you should contribute to its Wikipedia page. In fact, Bateman argues, we have a responsibility as scientists to provide accurate and correct information to the public through the best sources available, which in most cases would be Wikipedia. To put this in perspective (and here I once again borrow Alex’ words), if somebody told you ten years ago that there would be one single internet site that everybody would visit to find scientific information, and where discussion and continuous improvement would be allowed, encouraged and performed, most people would have said that was too good to be true. But that’s what Wikipedia offers. It is time to get rid of the Wiki-sceptisism, and start improving it.

And so, what about the future of publishing? Bateman has worked hard to form an agreement with the journal RNA Biology to integrate the publishing into the process of adding to the easily accessible public information. To have an article on a new RNA family published under the journal’s RNA families track, the family must not only be submitted to the Rfam database, but the authors must also provide a Wikipedia formatted article, which undergo the same peer-review process as the journal article. This ensures high-quality Wikipedia material, as well as making new scientific discoveries public.

I don’t think there’s a long stretch to guess that in the future, more journals and/or funding agencies will take on similar approaches, as researchers and decision-makers discover the importance of correct, publicly available information. The scientific world is slowly moving towards being more open, also for non-scientists. This openness is of extremely high importance in these times of climate scepticism, GMO controversy, extinction of species, and nuclear power debate. For the public to make proper decisions and send a clear message to the politicians, scientists need to be much better at communicating the current state of knowledge, or what many people prefer to call “truth”.