Metaxa2 Genome mode fixes
Yes, Saturdays are somewhat weird days for software updates, but if you’re doing weekend work anyway, why wait to push bug fixes to the community? A very minor bug-fix update to Metaxa2 was released today, bringing the software to version 2.2.3.
Two things have changed in this version, both related to the genome mode. 1) We fixed a file reading bug in the ‘genome’ mode of the software. This bug caused the last sequence in an input FASTA file not to be read unless there was a newline after it. Since the ‘genome’ mode is rarely used by most users, we suspect not a lot of users have been affected by this bug.
2) While we were at it, we changed the behavior of the ‘genome’ mode to mirror that of the ‘auto’ mode, as the strict genome mode dropped sequences shorter than 2500 bp. We considered this behavior counter-intuitive to what most users would want, and has now changed the ‘genome’ mode to behave the same as the ‘auto’ mode and not drop short sequences.
No other changes have been made in this version. The update can be found at the Metaxa2 software page.
Minor ITSx update
A new version of ITSx is released today. This minor update contains two minor bug fixes and two small new features.
The first bug was that ITSx returned empty sequences in the FASTA file for no detections for large input files. This has now been fixed.
The second bug fix is a bit more fuzzy and involved some fine-tuning of how large input files are handled in ITSx to stabilise E-value and score cut-offs.
The two new features are:
- The possbility to put the temporary directory in a custom location using the
- ITSx now warns when the input file contains sequences with identical identifiers, which usually leads to sequences being dropped from the input file.
The new update brings ITSx to version 1.1.3. Thanks for the users who have spotted bugs and suggested new features! Happy barcoding everyone!
Minor update of Metaxa2
Today, we released a minor update to Metaxa2, bringing it to version 2.2.2. The new version includes some bug fixes related to the Metaxa2 Database Repository, as well as a new “–temp” option allowing the user to specify the location for the temporary files. No other changes have been made in this version.
The update can be found at the Metaxa2 software page.
Metaxa2 update compatible with HMMER 3.3
Exactly two years after we released the Metaxa2 database builder, here’s the first update to the software. Unfortunately, it is just a boring bug fix, but the good part is that brings back compatibility with the new version of HMMER (3.3) released in November 2019 (as noted here). It seems like it is mainly the Database builder which has been impacted with by this incompatibility, but we recommend everyone to update.
We have tried to bug check this version as good as we can to make sure we did not break any features while introducing this new compatibility. We think that this version is bug free, but as we wanted to push this out quickly, please be more observant than usual to odd behaviour, and make sure to report any bugs!
The update can be downloaded here: https://microbiology.se/sw/Metaxa2_2.2.1.tar.gz
Major problem with Metaxa2 and HMMER 3.3
Update: There is now an updated version of Metaxa2 that addresses this problem. Find it here.
We have recently discovered that the new version of HMMER (3.3) released in November 2019 have introduced new restrictions that make it partially incompatible with Metaxa2. The most apparent problem is in the Database Builder software, which will not build profiles properly in most cases. Instead, HMMER will return an error and only some profiles will be created.
We do currently not know if this also affects the functionality of Metaxa2 itself. We are currently investigating this.
For now, the solution to this problem is to use the previous version of HMMER (version 3.2.1) while we investigate further. That version can be downloaded here: http://hmmer.org/download.html
I am sorry about not discovering this earlier, this only came to our attention this week!
ITSx bug fixes
ITSx has been updated with some minor bug fixes (solving bugs that caused big problems for a small subset of users).
The first bug was that the no detections file generated in a previous file was not removed before it was written to (if it happened to have the same name in a subsequent run). This could cause weird errors where sequences which were not part of the input file were reported as not detected, and subsequently inconsistent counts for the number of missing sequences. This bug should now be fixed (although I have to admit that it is hard to test for this error in all possible scenarios).
The second bug was very serious for anyone who worked with ITS sequences from Chlorophyta. The ‘-t’ option did not accept ‘G’ (the code for Chlorophyta) as an option, while it did accept ‘green algae’ or ‘chlorophyta’. The Chlorophyta profiles were also included in the default ‘all’ profiles mode, and thus this error did not manifest itself for the vast majority of users. I am sorry for the mess this must have caused for the Chlorophyta researchers using ITSx and thank the users of the software for pointing this error out.
Sorry for these bug fixes taking so long! It has been a very unusual and stressful spring and summer, and I hope to be able to be more responsive in the future. The new update brings ITSx to version 1.1.2. No other changes except the two bug fixes have been made in this version.
ITSx truncate bug fix
I just uploaded a mini update to ITSx, fixing a bug that caused the
--truncate option not to be accepted by the software in ITSx 1.1. This bug fix brings the software to version 1.1.1. No other changes have been introduced in this version. Download the update here. Happy barcoding!
Minor update to the COI database of Metaxa2
A few days ago, my attention was turned to a duplicate in the COI database bundled with Metaxa2 2.2. While this duplicate sequence should not cause any troubles for Metaxa2 itself, it has created issues for people using the database itself together with, e.g., QIIME. Therefore, I have today issued a very very minor update to the Metaxa2 2.2 package as well as the entry in the Metaxa2 Database Repository, removing the duplicate sequence. I deemed that this was not significant enough to issue a new version, particularly as no code was changed and it did not cause issues for the software itself, so the version will stay at 2.2 for the time being. Happy barcoding!
Bug hunting in the Metaxa2 beta
Due to an extremely embarrassing for-loop error in the classifier of the most recent Metaxa2 beta (beta 8), which was released a few weeks ago, the classifier often would (on certain platforms and configurations) enter an endless loop and hang. I apologize for this mistake, which has been corrected in the new beta 9 released today, available from this download link. No other changes have been made since the previous version. Thanks for your patience (and thanks Kaisa Thorell for first bringing my attention the error!)
New beta brings major Metaxa2 updates
I am very happy to announce that a first public beta version of Metaxa2 version 2.2 has been released today! This new version brings two big and a number of small improvements to the Metaxa2 software (1). The first major addition is the introduction of the Metaxa2 Database Builder, which allows the user to create custom databases for virtually any genetic barcoding region. The second addition, which is related to the first, is that the classifier has been rewritten to have a more solid mathematical foundation. I have been promising that these updates were coming “soon” for one and a half years, but finally the end-product is good enough to see some real world testing. Bear in mind though that this is still a beta version that could contain obscure bugs. Here follows a list of new features (with further elaboration on a few below):
- The Metaxa2 Database Builder
- Support for additional barcoding genes, virtually any genetic region can now be used for taxonomic classification in Metaxa2
- The Metaxa2 database repository, which can be accessed through the new metaxa2_install_database tool
- Improved classification scoring model for better clarity and sensitivity
- A bundled COI database for athropods, showing off the capabilities of the database builder
- Support for compressed input files (gzip, zip, bzip, dsrc)
- Support for auto-detection of database locations
- Added output of probable taxonomic origin for sequences with reliability scores at each rank, made possible by the updated classifier
- Added the -x option for running only the extraction without the classification step
- Improved memory handling for very large rRNA datasets in the classifier (millions of sequences)
- This update also fixes a bug in the metaxa2_rf tool that could cause bias in very skewed datasets with small numbers of taxa
The new version of Metaxa2 can be downloaded here, and for those interested I will spend the rest of this post outlining the Metaxa2 Database Builder. The information below is also available in a slightly extended version in the software manual.
The major enhancement in Metaxa2 version 2.2 is the ability to use custom databases for classification. This means that the user can now make their own database for their own barcoding region of choice, or download additional databases from the Metaxa2 Database Repository. The selection of other databases is made through the “-g” option already existing in Metaxa2. As part of these changes, we have also updated the classification scoring model for better stringency and sensitivity across multiple databases and different genes. The old scoring system can still be used by specifying the –scoring_model option to “old”.
There are two different main operating modes of the Metaxa2 Database Builder, as well as a hybrid mode combining the features of the two other modes. The divergent and conserved modes work in almost completely different ways and deal with two different types of barcoding regions. The divergent mode is designed to deal with barcoding regions that exhibit fairly large variation between taxa within the same taxonomic domain. Such regions include, e.g., the eukaryotic ITS region, or the trnL gene used for plant barcoding. In the other mode – the conserved mode – a highly conserved barcoding region is expected (at least within the different taxonomic domains). Genes that fall into this category would be, e.g., the 16S SSU rRNA, and the bacterial rpoB gene. This option would most likely also be suitable for barcoding within certain groups of e.g. plants, where similarity of the barcoding regions can be expected to be high. There is also a third mode – the hybrid mode – that incorporates features of both the other. The hybrid mode is more experimental in nature, but could be useful in situations where both the other modes perform poorer than desired.
In the divergent (default) mode, the database builder will start by clustering the input sequences at 20% identity using USEARCH (2). All clusters generated from this process are then individually aligned using MAFFT (3). Those alignments are split into two regions, which are used to build two hidden Markov models for each cluster of sequences. These models will be less precise, but more sensitive than those generated in the conserved mode. In the divergent mode, the database builder will attempt to extract full-length sequences from the input data, but this may be less successful than in the conserved mode.
In the conserved mode, on the other hand, the database builder will first extract the barcoding region from the input sequences using models built from a reference sequence provided (see above) and the Metaxa2 extractor (1). It will then align all the extracted sequences using MAFFT and determine the conservation of each position in the alignment. When the criteria for degree of conservation are met, all conserved regions are extracted individually and are then re-aligned separately using MAFFT. The re-aligned sequences are used to build hidden Markov models representing the conserved regions with HMMER (4). In this mode, the classification database will only consist of the extracted full-length sequences.
In the hybrid mode, finally, the database builder will cluster the input sequences at 20% identity using USEARCH, and then proceed with the conserved mode approach on each cluster separately .
The actual taxonomic classification in Metaxa2 is done using a sequence database. It was shown in the original Metaxa2 paper that replacing the built-in database with a generic non-processed one was detrimental to performance in terms of accuracy (1). In the database builder, we have tried to incorporate some of the aspects of the manual database curation we did for the built-in database that can be automated. By default, all these filtration steps are turned off, but enabling them might drastically increase the accuracy of classifications based on the database.
To assess the accuracy of the constructed database, the Metaxa2 Database Builder allows for testing the detection ability and classification accuracy of the constructed database. This is done by sub-dividing the database sequences into subsets and rebuilding the database using a smaller (by default 90%), randomly selected, set of the sequence data (5). The remaining sequences (10% by default) are then classified using Metaxa2 with the subset database. The number of detections, and the numbers of correctly or incorrectly classified entries are recorded and averaged over a number of iterations (10 by default). This allows for obtaining a picture of the lower end of the accuracy of the database. However, since the evaluation only uses a subset of all sequences included in the full database, the performance of the full database actually constructed is likely to be slightly better. The evaluation can be turned on using the “–evaluate T” option.
Metaxa2 2.2 also introduces the database repository, from which the user can download additional databases for Metaxa2. To download new databases from the repository, the metaxa2_install_database command is used. This is a simple piece of software but requires internet access to function.
- Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved Identification and Taxonomic Classification of Small and Large Subunit rRNA in Metagenomic Data. Molecular Ecology Resources (2015). doi: 10.1111/1755-0998.12399 [Paper link]
- Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461 (2010).
- Katoh K, Standley DM: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30, 772–780 (2013).
- Eddy SR: Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195 (2011).
- Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, 17, 4, 760–769 (2017). doi: 10.1111/1755-0998.12628