I am very happy to announce that a first public beta version of Metaxa2 version 2.2 has been released today! This new version brings two big and a number of small improvements to the Metaxa2 software (1). The first major addition is the introduction of the Metaxa2 Database Builder, which allows the user to create custom databases for virtually any genetic barcoding region. The second addition, which is related to the first, is that the classifier has been rewritten to have a more solid mathematical foundation. I have been promising that these updates were coming “soon” for one and a half years, but finally the end-product is good enough to see some real world testing. Bear in mind though that this is still a beta version that could contain obscure bugs. Here follows a list of new features (with further elaboration on a few below):
- The Metaxa2 Database Builder
- Support for additional barcoding genes, virtually any genetic region can now be used for taxonomic classification in Metaxa2
- The Metaxa2 database repository, which can be accessed through the new metaxa2_install_database tool
- Improved classification scoring model for better clarity and sensitivity
- A bundled COI database for athropods, showing off the capabilities of the database builder
- Support for compressed input files (gzip, zip, bzip, dsrc)
- Support for auto-detection of database locations
- Added output of probable taxonomic origin for sequences with reliability scores at each rank, made possible by the updated classifier
- Added the -x option for running only the extraction without the classification step
- Improved memory handling for very large rRNA datasets in the classifier (millions of sequences)
- This update also fixes a bug in the metaxa2_rf tool that could cause bias in very skewed datasets with small numbers of taxa
The new version of Metaxa2 can be downloaded here, and for those interested I will spend the rest of this post outlining the Metaxa2 Database Builder. The information below is also available in a slightly extended version in the software manual.
The major enhancement in Metaxa2 version 2.2 is the ability to use custom databases for classification. This means that the user can now make their own database for their own barcoding region of choice, or download additional databases from the Metaxa2 Database Repository. The selection of other databases is made through the “-g” option already existing in Metaxa2. As part of these changes, we have also updated the classification scoring model for better stringency and sensitivity across multiple databases and different genes. The old scoring system can still be used by specifying the –scoring_model option to “old”.
There are two different main operating modes of the Metaxa2 Database Builder, as well as a hybrid mode combining the features of the two other modes. The divergent and conserved modes work in almost completely different ways and deal with two different types of barcoding regions. The divergent mode is designed to deal with barcoding regions that exhibit fairly large variation between taxa within the same taxonomic domain. Such regions include, e.g., the eukaryotic ITS region, or the trnL gene used for plant barcoding. In the other mode – the conserved mode – a highly conserved barcoding region is expected (at least within the different taxonomic domains). Genes that fall into this category would be, e.g., the 16S SSU rRNA, and the bacterial rpoB gene. This option would most likely also be suitable for barcoding within certain groups of e.g. plants, where similarity of the barcoding regions can be expected to be high. There is also a third mode – the hybrid mode – that incorporates features of both the other. The hybrid mode is more experimental in nature, but could be useful in situations where both the other modes perform poorer than desired.
In the divergent (default) mode, the database builder will start by clustering the input sequences at 20% identity using USEARCH (2). All clusters generated from this process are then individually aligned using MAFFT (3). Those alignments are split into two regions, which are used to build two hidden Markov models for each cluster of sequences. These models will be less precise, but more sensitive than those generated in the conserved mode. In the divergent mode, the database builder will attempt to extract full-length sequences from the input data, but this may be less successful than in the conserved mode.
In the conserved mode, on the other hand, the database builder will first extract the barcoding region from the input sequences using models built from a reference sequence provided (see above) and the Metaxa2 extractor (1). It will then align all the extracted sequences using MAFFT and determine the conservation of each position in the alignment. When the criteria for degree of conservation are met, all conserved regions are extracted individually and are then re-aligned separately using MAFFT. The re-aligned sequences are used to build hidden Markov models representing the conserved regions with HMMER (4). In this mode, the classification database will only consist of the extracted full-length sequences.
In the hybrid mode, finally, the database builder will cluster the input sequences at 20% identity using USEARCH, and then proceed with the conserved mode approach on each cluster separately .
The actual taxonomic classification in Metaxa2 is done using a sequence database. It was shown in the original Metaxa2 paper that replacing the built-in database with a generic non-processed one was detrimental to performance in terms of accuracy (1). In the database builder, we have tried to incorporate some of the aspects of the manual database curation we did for the built-in database that can be automated. By default, all these filtration steps are turned off, but enabling them might drastically increase the accuracy of classifications based on the database.
To assess the accuracy of the constructed database, the Metaxa2 Database Builder allows for testing the detection ability and classification accuracy of the constructed database. This is done by sub-dividing the database sequences into subsets and rebuilding the database using a smaller (by default 90%), randomly selected, set of the sequence data (5). The remaining sequences (10% by default) are then classified using Metaxa2 with the subset database. The number of detections, and the numbers of correctly or incorrectly classified entries are recorded and averaged over a number of iterations (10 by default). This allows for obtaining a picture of the lower end of the accuracy of the database. However, since the evaluation only uses a subset of all sequences included in the full database, the performance of the full database actually constructed is likely to be slightly better. The evaluation can be turned on using the “–evaluate T” option.
Metaxa2 2.2 also introduces the database repository, from which the user can download additional databases for Metaxa2. To download new databases from the repository, the metaxa2_install_database command is used. This is a simple piece of software but requires internet access to function.
- Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved Identification and Taxonomic Classification of Small and Large Subunit rRNA in Metagenomic Data. Molecular Ecology Resources (2015). doi: 10.1111/1755-0998.12399 [Paper link]
- Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461 (2010).
- Katoh K, Standley DM: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30, 772–780 (2013).
- Eddy SR: Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195 (2011).
- Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, 17, 4, 760–769 (2017). doi: 10.1111/1755-0998.12628
Today, Microbiome put online a paper lead-authored by my colleague Fanny Berglund – one of Erik Kristiansson‘s brilliant PhD students – in which we identify 76 novel metallo-ß-lactamases (1). This feat was made possible because of a new computational method designed by Fanny, which uses a hidden Markov model based on known B1 metallo-ß-lactamases. We analyzed over 10,000 bacterial genomes and plasmids and over 5 terabases of metagenomic data and could thereby predict 76 novel genes. These genes clustered into 59 new families of metallo-β-lactamases (given a 70% identity threshold). We also verified the functionality of 21 of these genes experimentally, and found that 18 were able to hydrolyze imipenem when inserted into Escherichia coli. Two of the novel genes contained atypical zinc-binding motifs in their active sites. Finally, we show that the B1 metallo-β-lactamases can be divided into five major groups based on their phylogenetic origin. It seems that nearly all of the previously characterized mobile B1 β-lactamases we identify in this study were likely to have originated from chromosomal genes present in species within the Proteobacteria, particularly Shewanella spp.
This study more than doubles the number of known B1 metallo-β-lactamases. As with the study by Boulund et al. (2) which we published last month on computational discovery of novel fluoroquinolone resistance genes (which used a very similar approach but on a completely different type of genes), this study also supports the hypothesis that environmental bacterial communities act as sources of uncharacterized antibiotic resistance genes (3-7). Fanny have done a fantastic job on this paper, and I highly recommend reading it in its entirety (it’s open access so you have virtually no excuse not to). It can be found here.
- Berglund F, Marathe NP, Österlund T, Bengtsson-Palme J, Kotsakis S, Flach C-F, Larsson DGJ, Kristiansson E: Identification of 76 novel B1 metallo-β-lactamases through large-scale screening of genomic and metagenomic data. Microbiome, 5, 134 (2017). doi: 10.1186/s40168-017-0353-8
- Boulund F, Berglund F, Flach C-F, Bengtsson-Palme J, Marathe NP, Larsson DGJ, Kristiansson E: Computational discovery and functional validation of novel fluoroquinolone resistance genes in public metagenomic data sets. BMC Genomics, 18, 682 (2017). doi: 10.1186/s12864-017-4064-0
- Bengtsson-Palme J, Larsson DGJ: Antibiotic resistance genes in the environment: prioritizing risks. Nature Reviews Microbiology, 13, 369 (2015). doi: 10.1038/nrmicro3399-c1
- Allen HK, Donato J, Wang HH et al.: Call of the wild: antibiotic resistance genes in natural environments. Nature Reviews Microbiology, 8, 251–259 (2010).
- Berendonk TU, Manaia CM, Merlin C et al.: Tackling antibiotic resistance: the environmental framework. Nature Reviews Microbiology, 13, 310–317 (2015).
- Martinez JL: Bottlenecks in the transferability of antibiotic resistance from natural ecosystems to human bacterial pathogens. Frontiers in Microbiology, 2, 265 (2011).
- Finley RL, Collignon P, Larsson DGJ et al.: The scourge of antibiotic resistance: the important role of the environment. Clinical Infectious Diseases, 57, 704–710 (2013).
Today, I am very happy to announce that after years in the making and months in testing, the next generation of ITSx, version 1.1, is ready to step into the public light and scrutiny. I have today uploaded a public beta version of the ITSx 1.1 release, which I encourage everyone that have enjoyed using ITSx to try out.
The 1.1 release of ITSx includes a wide range of new feature, including:
- A 2-10x performance increase (depending on the dataset), since ITSx now utilizes hmmsearch instead of hmmscan to detect the ITS regions and distributes the CPU cores better
- Improved ITS detection among fungi and chlorophyta, by addition of new HMM-profiles
- The HMM profile format for ITSx has been updated to HMMER3/f (thus ITSx now requires HMMER version 3.1 or later)
- Better handling of interrupted HMMER searches
- Added the
--require_anchoroption to only include sequences where the complete anchor is found in the output
- Added the possibility for partial sequence output for the SSU, LSU and 5.8S regions
- Fixed a bug causing problems when reading sequence data from standard input
A lot of the code has changed in this version, which means that there might still be bugs lingering in the program. Since I will be on vacation throughout July, I encourage everyone to submit bug reports and questions, but I will not promise to respond to them until in August.
I hope that you will enjoy this new ITSx release, which you can download here. Happy barcoding!
An ITSx user yesterday made me aware of an information-problem (thanks Suzanne!) regarding the use of ITSx in combination with the HMMER 3.1 beta. I have not been entirely clear on why you might get the “Error: bad format, binary auxfiles, (…) binary auxfiles are in an outdated HMMER format (3/b); please hmmpress your HMM file again” error message when running ITSx with HMMER 3.1 installed. You might think that following the instructions for Metaxa might do the trick. As you will notice, however, it will not. Instead you will be presented with the following error message: “Error: Failed to open binary auxfiles”. This is because while Metaxa 1.1.2 will re-create the HMM-files if needed, ITSx does not. Instead, ITSx has the option
"--reset T" which can be added to the command line to recreate the HMM-files for the current HMMER version installed (regardless of which 3.x version).
Thus, the solution for the “bad format, binary auxfiles” error is to simply add
"--reset T" (without quotes) to the ITSx command line and run the software again. You only need to do this once, unless you update HMMER and/or get the same error message again for some other reason. The Metaxa-post has been updated to clarify this as well.
An ITSx user informed me a couple of days ago of an issue that caused ITSx to sometimes accidentally remove the HMM-files in the database when multiple ITSx jobs were run in parallel. Although this issue should be relatively rare, it was also very easy to fix. Therefore, we already push out a new version of ITSx (1.0.3), which is available for download here.
In short, the bug was introduced because I overlooked this usage scenario when fixing another bug related to the HMM-files in an earlier pre-release. Let’s keep our fingers crossed that version 1.0.3 will be more long-lived than 1.0.2!
First of all, ITSx is now taken out of beta and is now considered ready for production use. We do no longer find any bugs in it, and since there’s now a wide range of people already using it for various purposes, we feel confident that any significant bugs would have been unraveled by now.
Secondly, I have also added support for the new HMMER version (3.1b) released in May in this version of ITSx. So you can now go ahead and install HMMER 3.1 if you want to try out the new HMMER beta and still be able to use ITSx.
Finally, we have also updated the manual somewhat, hopefully making it a little easier to use ITSx for a first-time user.
Version 1.0.2 of ITSx can be downloaded from here. As previously, you may still report any bugs, strange behaviors, ideas for new features, or inconsistencies with certain lineages, by mailing to “itsx” at this domain name.
As you might be aware, a new version of HMMER is out since late May. You might wonder how Metaxa (relying on HMMER3) will work if you update to the new version of HMMER, and I have finally got around to test it! The answer, according to my somewhat limited testing, is that Metaxa 1.1.2 seems to be working fine with HMMER 3.1.
You might need to go into the database directory (“metaxa_db”; should be located in the same directory as the Metaxa binaries), and remove all the files ending with suffixes .h3f .h3i .h3m and .h3p inside the “HMMs” directory. On most installation, this should not be necessary. Myself, I just plugged HMMER 3.1 in and started Metaxa, but if you get error messages complaining that “Error: bad format, binary auxfiles,
binary auxfiles are in an outdated HMMER format (3/b); please hmmpress your HMM file again”, then you should try removing the files and re-running Metaxa. This might especially be a problem on older Metaxa versions. [Update: Note that this fix will likely not work with ITSx!]
Bear in mind that I have not run thorough testing on Metaxa and HMMER 3.1, and probably won’t for the 1.1.2 version, since there’s a 2.0 version waiting just around the corner…
Additionally, if you experience problems with Megraft, you should try the same fix as for Metaxa, but with the Megraft database directory instead. Regarding ITSx, a minor update will be released very soon, which also will address HMMER 3.1b compatibility. [Update: See this post for how to work around HMMER 3.1 problems with ITSx.]
Happy barcoding everyone!
For a couple of years, I have been working with microbial ecology and diversity, and how such features can be assessed using molecular barcodes, such as the SSU (16S/18S) rRNA sequence (the Metaxa and Megraft packages). However, I have also been aiming at the ITS region, and how that can be used in barcoding (see e.g. the guidelines we published last year). It is therefore a great pleasure to introduce my next gem for community analysis; a software tool for detection and extraction of the ITS1 and ITS2 regions of ITS sequences from environmental communities. The tool is dubbed ITSx, and supersedes the more specific fungal ITS extractor written by Henrik Nilsson and colleagues. Henrik is once more the mastermind behind this completely rewritten version, in which I have done the lion’s share of the programming. Among the new features in ITSx are:
- Robust support for the Cantharellus, Craterellus, and Tulasnella genera of fungi
- Support for nineteen additional eukaryotic groups on top of the already present support for fungi (specifically these groups: Tracheophyta (vascular plants), Bryophyta (bryophytes), Marchantiophyta (liverworts), Chlorophyta (green algae), Rhodophyta (red algae), Phaeophyceae (brown algae), Metazoa (metazoans), Oomycota (oomycetes), Alveolata (alveolates), Amoebozoa (amoebozoans), Euglenozoa, Rhizaria, Bacillariophyta (diatoms), Eustigmatophyceae (eustigmatophytes), Raphidophyceae (raphidophytes), Synurophyceae (synurids), Haptophyceae (haptophytes) , Apusozoa, and Parabasalia (parabasalids))
- Multi-processor support
- Extensive output options
- Virtually zero false-positive extractions
ITSx is today moved from a private pre-release state to a public beta state. No code changes has been made since February, indicative of that the last pre-release candidate is now ready to fly on its own. As far as our testing has revealed, this version seems to be bug free. In reality though, researchers tend to find the most unexpected usage scenarios. So please, if you find any unexpected behavior in this version of ITSx, send me an e-mail and make us aware of the potential shortcomings of our software.
We expect this open-source software to boost research in microbial ecology based on barcoding of the ITS region, and hope that the research community will evaluate its performance also among the eukaryote groups that we have less experience with.
Those attending the Metagenomics lab (part of the basic NGS course for PhD students given at GU this week), can find the material for the lab on this page:
Of course, the page is open for anyone else as well, although you won’t get the support that the GU students are given.
Yesterday, our paper on Megraft – a software tool to graft ribosomal small subunit (16S/18S) fragments onto full-length SSU sequences – became available as an accepted online early article in Research in Microbiology. Megraft is built upon the notion that when examining the depth of a community sequencing effort, researchers often use rarefaction analysis of the ribosomal small subunit (SSU/16S/18S) gene in a metagenome. However, the SSU sequences in metagenomic libraries generally are present as fragmentary, non-overlapping entries, which poses a great problem for this analysis. Megraft aims to remedy this problem by grafting the input SSU fragments from the metagenome (obtained by e.g. Metaxa) onto full-length SSU sequences. The software also uses a variability model which accounts for observed and unobserved variability. This way, Megraft enables accurate assessment of species richness and sequencing depth in metagenomic datasets.
The algorithm, efficiency and accuracy of Megraft is thoroughly described in the paper. It should be noted that this is not a panacea for species richness estimates in metagenomics, but it is a huge step forward over existing approaches. Megraft shares some similarities with EMIRGE (Miller et al., 2011), which is a software package for reconstruction of full-length ribosomal genes from paired-end Illumina sequences. Megraft, however, is set apart in that it has a strong focus on rarefaction, and functions also when the number of sequences is small, which is often the case in 454 and Sanger-based metagenomics studies. Thus, EMIRGE and Megraft seek to solve a roughly similar problem, but for different sequencing technologies and sequencing scales.
Bengtsson, J., Hartmann, M., Unterseher, M., Vaishampayan, P., Abarenkov, K., Durso, L., Bik, E.M., Garey, J.R., Eriksson, K.M., Nilsson R.H. (2012). Megraft: A software package to graftribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes and similar environmental datasets. Research in Microbiology, doi: 10.1016/j.resmic.2012.07.001.
- Miller, C. S., Baker, B. J., Thomas, B. C., Singer, S. W., & Banfield, J. F. (2011). EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biology, 12(5), R44. doi:10.1186/gb-2011-12-5-r44