Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg | Wisconsin Institute for Discovery

Browsing Posts tagged Bugs

I am very happy to announce that a first public beta version of Metaxa2 version 2.2 has been released today! This new version brings two big and a number of small improvements to the Metaxa2 software (1). The first major addition is the introduction of the Metaxa2 Database Builder, which allows the user to create custom databases for virtually any genetic barcoding region. The second addition, which is related to the first, is that the classifier has been rewritten to have a more solid mathematical foundation. I have been promising that these updates were coming “soon” for one and a half years, but finally the end-product is good enough to see some real world testing. Bear in mind though that this is still a beta version that could contain obscure bugs. Here follows a list of new features (with further elaboration on a few below):

  • The Metaxa2 Database Builder
  • Support for additional barcoding genes, virtually any genetic region can now be used for taxonomic classification in Metaxa2
  • The Metaxa2 database repository, which can be accessed through the new metaxa2_install_database tool
  • Improved classification scoring model for better clarity and sensitivity
  • A bundled COI database for athropods, showing off the capabilities of the database builder
  • Support for compressed input files (gzip, zip, bzip, dsrc)
  • Support for auto-detection of database locations
  • Added output of probable taxonomic origin for sequences with reliability scores at each rank, made possible by the updated classifier
  • Added the -x option for running only the extraction without the classification step
  • Improved memory handling for very large rRNA datasets in the classifier (millions of sequences)
  • This update also fixes a bug in the metaxa2_rf tool that could cause bias in very skewed datasets with small numbers of taxa

The new version of Metaxa2 can be downloaded here, and for those interested I will spend the rest of this post outlining the Metaxa2 Database Builder. The information below is also available in a slightly extended version in the software manual.

The major enhancement in Metaxa2 version 2.2 is the ability to use custom databases for classification. This means that the user can now make their own database for their own barcoding region of choice, or download additional databases from the Metaxa2 Database Repository. The selection of other databases is made through the “-g” option already existing in Metaxa2. As part of these changes, we have also updated the classification scoring model for better stringency and sensitivity across multiple databases and different genes. The old scoring system can still be used by specifying the –scoring_model option to “old”.

There are two different main operating modes of the Metaxa2 Database Builder, as well as a hybrid mode combining the features of the two other modes. The divergent and conserved modes work in almost completely different ways and deal with two different types of barcoding regions. The divergent mode is designed to deal with barcoding regions that exhibit fairly large variation between taxa within the same taxonomic domain. Such regions include, e.g., the eukaryotic ITS region, or the trnL gene used for plant barcoding. In the other mode – the conserved mode – a highly conserved barcoding region is expected (at least within the different taxonomic domains). Genes that fall into this category would be, e.g., the 16S SSU rRNA, and the bacterial rpoB gene. This option would most likely also be suitable for barcoding within certain groups of e.g. plants, where similarity of the barcoding regions can be expected to be high. There is also a third mode – the hybrid mode – that incorporates features of both the other. The hybrid mode is more experimental in nature, but could be useful in situations where both the other modes perform poorer than desired.

In the divergent (default) mode, the database builder will start by clustering the input sequences at 20% identity using USEARCH (2). All clusters generated from this process are then individually aligned using MAFFT (3). Those alignments are split into two regions, which are used to build two hidden Markov models for each cluster of sequences. These models will be less precise, but more sensitive than those generated in the conserved mode. In the divergent mode, the database builder will attempt to extract full-length sequences from the input data, but this may be less successful than in the conserved mode.

In the conserved mode, on the other hand, the database builder will first extract the barcoding region from the input sequences using models built from a reference sequence provided (see above) and the Metaxa2 extractor (1). It will then align all the extracted sequences using MAFFT and determine the conservation of each position in the alignment. When the criteria for degree of conservation are met, all conserved regions are extracted individually and are then re-aligned separately using MAFFT. The re-aligned sequences are used to build hidden Markov models representing the conserved regions with HMMER (4). In this mode, the classification database will only consist of the extracted full-length sequences.

In the hybrid mode, finally, the database builder will cluster the input sequences at 20% identity using USEARCH, and then proceed with the conserved mode approach on each cluster separately .

The actual taxonomic classification in Metaxa2 is done using a sequence database. It was shown in the original Metaxa2 paper that replacing the built-in database with a generic non-processed one was detrimental to performance in terms of accuracy (1). In the database builder, we have tried to incorporate some of the aspects of the manual database curation we did for the built-in database that can be automated. By default, all these filtration steps are turned off, but enabling them might drastically increase the accuracy of classifications based on the database.

To assess the accuracy of the constructed database, the Metaxa2 Database Builder allows for testing the detection ability and classification accuracy of the constructed database. This is done by sub-dividing the database sequences into subsets and rebuilding the database using a smaller (by default 90%), randomly selected, set of the sequence data (5). The remaining sequences (10% by default) are then classified using Metaxa2 with the subset database. The number of detections, and the numbers of correctly or incorrectly classified entries are recorded and averaged over a number of iterations (10 by default). This allows for obtaining a picture of the lower end of the accuracy of the database. However, since the evaluation only uses a subset of all sequences included in the full database, the performance of the full database actually constructed is likely to be slightly better. The evaluation can be turned on using the “–evaluate T” option.

Metaxa2 2.2 also introduces the database repository, from which the user can download additional databases for Metaxa2. To download new databases from the repository, the metaxa2_install_database command is used. This is a simple piece of software but requires internet access to function.

References

  1. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved Identification and Taxonomic Classification of Small and Large Subunit rRNA in Metagenomic Data. Molecular Ecology Resources (2015). doi: 10.1111/1755-0998.12399 [Paper link]
  2. Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461 (2010).
  3. Katoh K, Standley DM: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30, 772–780 (2013).
  4. Eddy SR: Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195 (2011).
  5. Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, 17, 4, 760–769 (2017). doi: 10.1111/1755-0998.12628

I just wanted to share an experience with the FARAO software we recently published a paper about, and its compatibility with the GD and libpng libraries (used for creating PNG files). I have got questions from users about how to get this to work, and to test it out I decided to try to install it on my Mac. It turned out that it is nearly impossible to get this to work. These two packages are extremely picky with versions and dependencies. After trying for about on hour, I gave up and turned to my Linux machine. Surprisingly, I could not get it to work from scratch there either, despite that I have had it running (with some previous version combination) when we programmed and tested FARAO.

I find this extremely annoying myself, and I will try to look into other solutions for PNG or JPEG output from FARAO. In the mean time, I can only recommend to instead use the EPS output option, which produces more nice-looking figures and is considerably easier to set up. I am sorry about this and hope to be able to provide a better solution soon.

Metaxa2 has been updated again today to version 2.1.3. This update adds a few features to the Metaxa2 Diversity Tools (metaxa2_uc and metaxa2_rf). The core Metaxa2 programs remain the same as for the previous Metaxa2 versions. The new features were suggested as part of the review process of a Metaxa2-related manuscript, and we thank the anonymous reviewers for their great suggestions!

New features and bug fixes in this update:

  • Added the Chao1, iChao1 and ACE estimators in addition to the original species abundance (“Bengtsson-Palme”) model in metaxa2_rf
  • Added the Raup-Crick dissimilarity method to the metaxa2_uc tool
  • Added a warning message when data is highly skewed for metaxa2_uc
  • Improved robustness of the ‘model’ mode of metaxa2_uc for highly skewed sample groups
  • Fixed a bug causing miscalculation of Euclidean distances on binary data in metaxa2_uc

The updated version of Metaxa2 can be downloaded here.

Happy barcoding!

TriMetAss has today been updated to version 1.2. The new version addresses a number of minor issues, some of which I thought was fixed with the previous version. The update can be found here.

The main problem with the previous version of TriMetAss was that the Trinity developers had changed many options in the Trinity software, which rendered more recent versions of Trinity incompatible with TriMetAss. TriMetAss was not the only external software using Trinity that was affected by these changes. As far as my testing goes, these incompatibilities should now be fixed, by improved Trinity version determination in TriMetAss. This is still not a guarantee for future changes though, so just to make sure, use one of the Trinity versions tested with TriMetAss (versions v2.1.1 or trinityrnaseq_r2013_08_14).

This time I would like to thank Artemis Louyakis at the Univesity of Florida and Tatsuya Unno at the Jeju National University (Korea) for their input on TriMetAss.

I have today uploaded an updated version of Metaxa2 (version 2.1.2). This update primarily improves the memory performance of the Metaxa2 Diversity Tools. The core Metaxa2 programs remain the same as for the previous Metaxa2 versions.

New features and bug fixes in this update:

  • Dramatically improved memory performance of metaxa2_uc
  • Added the 'min' option to the -s flag in metaxa2_uc, which will cause the program to sample the number of entries present in the smallest sample from each sample
  • Fixes a bug that disregarded the level specified by the -l option in metaxa2_si
  • Minor updates and improvements on the manual

The updated version of Metaxa2 can be downloaded here.
Happy barcoding!

Today I have released Metaxa2 version 2.1.1, containing a fix to an embarrassing bug in the new metaxa2_uc program (part of the Metaxa2 Diversity Tools). A late change of the names of the different modes of that tool had not propagated to all parts of the code, and therefore only the “model” mode was functional in the previous version. No other changes to the Metaxa2 package has been made in this update, which can be downloaded here.

Metaxa2 has been updated to version 2.0.2 and can be downloaded from the Metaxa2 web site. The 2.0.2 update fixes two minor bugs; one causing the “.graph” file to display incorrect or no names for the regions of the LSU regions, and one causing misreporting of the number of sequences in single-end FASTQ files (paired-end files were reported correctly). The update also brings a slightly improved classifier. Thanks to Marco Severgnini for reporting the FASTQ file issue! The update is available here.

A minor bug in the “its1.full_and_partial.fasta” file has been fixed in a minor update to ITSx (1.0.11) released to day. The bug occasionally caused newline characters at the end of a sequence to be skipped and the next entry to begin at the same row. The bug only manifested itself when ITSx was used with the --partial option and only in the above mentioned FASTA file. If you have been affected by the bug, you should have noticed as the resulting FASTA file would be considered corrupted by most bioinformatics software. The updated version of ITSx can be downloaded here.

After a long delay-time in testing ITSx version 1.0.10 has been made public. The new version patches a bug causing the 3′ anchor not being properly written to file when using the “--anchor hmm” option. If a number was used for the “--anchor” option, this bug did not apply. Thus, if you have not been using the “--anchor” option together with “hmm”, you have not been affected in any way by this bug. Nevertheless, I encourage updating in case you would use the “--anchor hmm” option in the future. The update can be downloaded here. Happy barcoding!

ITSx has today been updated, bringing it to version 1.0.8. This update adds the “--only_full” option, which restricts output in the ITS1, 5.8S and ITS2 files to only the files that contain the full region, i.e. that both surrounding domains have been detected. The update also fixes a bug with the --anchor option, and can be downloaded here. Happy barcoding!