Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg | Wisconsin Institute for Discovery

Browsing Posts in Software

I am very happy to announce that a first public beta version of Metaxa2 version 2.2 has been released today! This new version brings two big and a number of small improvements to the Metaxa2 software (1). The first major addition is the introduction of the Metaxa2 Database Builder, which allows the user to create custom databases for virtually any genetic barcoding region. The second addition, which is related to the first, is that the classifier has been rewritten to have a more solid mathematical foundation. I have been promising that these updates were coming “soon” for one and a half years, but finally the end-product is good enough to see some real world testing. Bear in mind though that this is still a beta version that could contain obscure bugs. Here follows a list of new features (with further elaboration on a few below):

  • The Metaxa2 Database Builder
  • Support for additional barcoding genes, virtually any genetic region can now be used for taxonomic classification in Metaxa2
  • The Metaxa2 database repository, which can be accessed through the new metaxa2_install_database tool
  • Improved classification scoring model for better clarity and sensitivity
  • A bundled COI database for athropods, showing off the capabilities of the database builder
  • Support for compressed input files (gzip, zip, bzip, dsrc)
  • Support for auto-detection of database locations
  • Added output of probable taxonomic origin for sequences with reliability scores at each rank, made possible by the updated classifier
  • Added the -x option for running only the extraction without the classification step
  • Improved memory handling for very large rRNA datasets in the classifier (millions of sequences)
  • This update also fixes a bug in the metaxa2_rf tool that could cause bias in very skewed datasets with small numbers of taxa

The new version of Metaxa2 can be downloaded here, and for those interested I will spend the rest of this post outlining the Metaxa2 Database Builder. The information below is also available in a slightly extended version in the software manual.

The major enhancement in Metaxa2 version 2.2 is the ability to use custom databases for classification. This means that the user can now make their own database for their own barcoding region of choice, or download additional databases from the Metaxa2 Database Repository. The selection of other databases is made through the “-g” option already existing in Metaxa2. As part of these changes, we have also updated the classification scoring model for better stringency and sensitivity across multiple databases and different genes. The old scoring system can still be used by specifying the –scoring_model option to “old”.

There are two different main operating modes of the Metaxa2 Database Builder, as well as a hybrid mode combining the features of the two other modes. The divergent and conserved modes work in almost completely different ways and deal with two different types of barcoding regions. The divergent mode is designed to deal with barcoding regions that exhibit fairly large variation between taxa within the same taxonomic domain. Such regions include, e.g., the eukaryotic ITS region, or the trnL gene used for plant barcoding. In the other mode – the conserved mode – a highly conserved barcoding region is expected (at least within the different taxonomic domains). Genes that fall into this category would be, e.g., the 16S SSU rRNA, and the bacterial rpoB gene. This option would most likely also be suitable for barcoding within certain groups of e.g. plants, where similarity of the barcoding regions can be expected to be high. There is also a third mode – the hybrid mode – that incorporates features of both the other. The hybrid mode is more experimental in nature, but could be useful in situations where both the other modes perform poorer than desired.

In the divergent (default) mode, the database builder will start by clustering the input sequences at 20% identity using USEARCH (2). All clusters generated from this process are then individually aligned using MAFFT (3). Those alignments are split into two regions, which are used to build two hidden Markov models for each cluster of sequences. These models will be less precise, but more sensitive than those generated in the conserved mode. In the divergent mode, the database builder will attempt to extract full-length sequences from the input data, but this may be less successful than in the conserved mode.

In the conserved mode, on the other hand, the database builder will first extract the barcoding region from the input sequences using models built from a reference sequence provided (see above) and the Metaxa2 extractor (1). It will then align all the extracted sequences using MAFFT and determine the conservation of each position in the alignment. When the criteria for degree of conservation are met, all conserved regions are extracted individually and are then re-aligned separately using MAFFT. The re-aligned sequences are used to build hidden Markov models representing the conserved regions with HMMER (4). In this mode, the classification database will only consist of the extracted full-length sequences.

In the hybrid mode, finally, the database builder will cluster the input sequences at 20% identity using USEARCH, and then proceed with the conserved mode approach on each cluster separately .

The actual taxonomic classification in Metaxa2 is done using a sequence database. It was shown in the original Metaxa2 paper that replacing the built-in database with a generic non-processed one was detrimental to performance in terms of accuracy (1). In the database builder, we have tried to incorporate some of the aspects of the manual database curation we did for the built-in database that can be automated. By default, all these filtration steps are turned off, but enabling them might drastically increase the accuracy of classifications based on the database.

To assess the accuracy of the constructed database, the Metaxa2 Database Builder allows for testing the detection ability and classification accuracy of the constructed database. This is done by sub-dividing the database sequences into subsets and rebuilding the database using a smaller (by default 90%), randomly selected, set of the sequence data (5). The remaining sequences (10% by default) are then classified using Metaxa2 with the subset database. The number of detections, and the numbers of correctly or incorrectly classified entries are recorded and averaged over a number of iterations (10 by default). This allows for obtaining a picture of the lower end of the accuracy of the database. However, since the evaluation only uses a subset of all sequences included in the full database, the performance of the full database actually constructed is likely to be slightly better. The evaluation can be turned on using the “–evaluate T” option.

Metaxa2 2.2 also introduces the database repository, from which the user can download additional databases for Metaxa2. To download new databases from the repository, the metaxa2_install_database command is used. This is a simple piece of software but requires internet access to function.

References

  1. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved Identification and Taxonomic Classification of Small and Large Subunit rRNA in Metagenomic Data. Molecular Ecology Resources (2015). doi: 10.1111/1755-0998.12399 [Paper link]
  2. Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461 (2010).
  3. Katoh K, Standley DM: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30, 772–780 (2013).
  4. Eddy SR: Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195 (2011).
  5. Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, 17, 4, 760–769 (2017). doi: 10.1111/1755-0998.12628

ITSx in Bioconda

Comments off

Mattias de Hollander at the Netherlands Institute of Ecology kindly informed me that they recently added the ITSx 1.1b version to the Bioconda package manager. This will make it easy for Conda users to install ITSx automatically into their systems and pipelines and also for others who are using conda. The Bioconda version can be found here. I would like to thank Mattias for this initiative and hope that the Bioconda version of ITSx will find much use!

Today, I am very happy to announce that after years in the making and months in testing, the next generation of ITSx, version 1.1, is ready to step into the public light and scrutiny. I have today uploaded a public beta version of the ITSx 1.1 release, which I encourage everyone that have enjoyed using ITSx to try out.

The 1.1 release of ITSx includes a wide range of new feature, including:

  • A 2-10x performance increase (depending on the dataset), since ITSx now utilizes hmmsearch instead of hmmscan to detect the ITS regions and distributes the CPU cores better
  • Improved ITS detection among fungi and chlorophyta, by addition of new HMM-profiles
  • The HMM profile format for ITSx has been updated to HMMER3/f (thus ITSx now requires HMMER version 3.1 or later)
  • Better handling of interrupted HMMER searches
  • Added the --require_anchor option to only include sequences where the complete anchor is found in the output
  • Added the possibility for partial sequence output for the SSU, LSU and 5.8S regions
  • Fixed a bug causing problems when reading sequence data from standard input

A lot of the code has changed in this version, which means that there might still be bugs lingering in the program. Since I will be on vacation throughout July, I encourage everyone to submit bug reports and questions, but I will not promise to respond to them until in August.

I hope that you will enjoy this new ITSx release, which you can download here. Happy barcoding!

Yesterday, Molecular Ecology Resources put online an unedited version of a recent paper which I co-authored. This time, Rodney Richardson at Ohio State University has made a tremendous work of evaluating three taxonomic classification software – the RDP Naïve Bayesian Classifier, RTAX and UTAX – on a set of DNA barcoding regions commonly used for plants, namely the ITS2, and the matK, rbcL, trnL and trnH genes.

In the paper (1), we discuss the results, merits and limitations of the classifiers. In brief, we found that:

  • There is a considerable trade-off between accuracy and sensitivity for the classifiers tested, which indicates a need for improved sequence classification tools (2)
  • UTAX was superior with respect to error rate, but was exceedingly stringent and thus suffered from a low assignment rate
  • The RDP Naïve Bayesian Classifier displayed high sensitivity and low error at the family and order levels, but had a genus-level error rate of 9.6 percent
  • RTAX showed high sensitivity at all taxonomic ranks, but at the same time consistently produced the high error rates
  • The choice of locus has significant effects on the classification sensitivity of all tested tools
  • All classifiers showed strong relationships between database completeness, classification sensitivity and classification accuracy

We believe that the methods of comparison we have used are simple and robust, and thereby provides a methodological and conceptual foundation for future software evaluations. On a personal note, I will thoroughly enjoy working with Rodney and Reed again; I had a great time discussing the ins and outs of taxonomic classification with them! The paper can be found here.

References and notes

  1. Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, Early view (2016). doi: 10.1111/1755-0998.12628 [Paper link]
  2. This is something that several classifiers also showed in the evaluation we did for the Metaxa2 paper (3). Interestingly enough, Metaxa2 is better at maintaining high accuracy also as sensitivity is increased.
  3. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399 [Paper link]

I just wanted to share an experience with the FARAO software we recently published a paper about, and its compatibility with the GD and libpng libraries (used for creating PNG files). I have got questions from users about how to get this to work, and to test it out I decided to try to install it on my Mac. It turned out that it is nearly impossible to get this to work. These two packages are extremely picky with versions and dependencies. After trying for about on hour, I gave up and turned to my Linux machine. Surprisingly, I could not get it to work from scratch there either, despite that I have had it running (with some previous version combination) when we programmed and tested FARAO.

I find this extremely annoying myself, and I will try to look into other solutions for PNG or JPEG output from FARAO. In the mean time, I can only recommend to instead use the EPS output option, which produces more nice-looking figures and is considerably easier to set up. I am sorry about this and hope to be able to provide a better solution soon.

Late last year, we introduced FARAO – the Flexible All-Round Annotation Organizer – a software tool that allows visualization of annotated features on contigs. Today, the Applications Note describing the software was published as an advance access paper in Bioinformatics (1). As I have described before, storing and visualizing annotation and coverage information in FARAO has a number of advantages. FARAO is able to:

  • Integrate annotation and coverage information for the same sequence set, enabling coverage estimates of annotated features
  • Scale across millions of sequences and annotated features
  • Filter sequences, such that only entries with annotations satisfying certain given criteria will be outputted
  • Handle annotation and coverage data produced by a range of different bioinformatics tools
  • Handle custom parsers through a flexible interface, allowing for adaption of the software to virtually any bioinformatic tool not supported out of the box
  • Produce high-quality EPS output
  • Integrate with MySQL databases

I have previously used FARAO to produce annotation figures in our paper on a polluted Indian lake (2), as well as in a paper on sewage treatment plants (which is in press and should be coming out any day now). We hope that the tool will find many more uses in other projects in the future!

References

  1. Hammarén R, Pal C, Bengtsson-Palme JFARAO: The Flexible All-Round Annotation Organizer. Bioinformatics, advance access (2016). doi: 10.1093/bioinformatics/btw499 [Paper link]
  2. Bengtsson-Palme J, Boulund F, Fick J, Kristiansson E, Larsson DGJ: Shotgun metagenomics reveals a wide array of antibiotic resistance genes and mobile elements in a polluted lake in India. Frontiers in Microbiology, 5, 648 (2014). doi: 10.3389/fmicb.2014.00648 [Paper link]

Today marks the five year anniversary for the Metaxa software’s initial release. Much has happened to the software since; Metaxa started off as an rRNA extraction utility for metagenomic data (1), including coarse classification to organism/organelle type. Since it has gained full-scale taxonomic classification ability better or on par with other software packages (2), much greater speed, support for the LSU gene, gained a range of related software tools (3), and spurred development of other tools such as ITSx (4). I have also been involved in no less than four peer-reviewed publications directly related to the software (1-3,5).

But it does not end here; these five years were just the beginning. We are – in different constellations – working on further enhancements to Metaxa2, including support for more genes, an updated classification database, and better customization options. I am very much still devoted to keep Metaxa2 alive and relevant as a tool for taxonomic analysis of metagenomes, applicable whenever accuracy is a key parameter. Thanks for being part of the community for these five years!

References

  1. Bengtsson J, Eriksson KM, Hartmann M, Wang Z, Shenoy BD, Grelet G, Abarenkov K, Petri A, Alm Rosenblad M, Nilsson RH: Metaxa: A software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequencing datasets. Antonie van Leeuwenhoek, 100, 3, 471–475 (2011). doi:10.1007/s10482-011-9598-6. [Paper link]
  2. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399 [Paper link]
  3. Bengtsson-Palme J, Thorell K, Wurzbacher C, Sjöling Å, Nilsson RH: Metaxa2 Diversity Tools: Easing microbial community analysis with Metaxa2. Ecological Informatics, 33, 45–50 (2016). doi: 10.1016/j.ecoinf.2016.04.004 [Paper link]
  4. Bengtsson-Palme J, Ryberg M, Hartmann M, Branco S, Wang Z, Godhe A, De Wit P, Sánchez-García M, Ebersberger I, de Souza F, Amend AS, Jumpponen A, Unterseher M, Kristiansson E, Abarenkov K, Bertrand YJK, Sanli K, Eriksson KM, Vik U, Veldre V, Nilsson RH: Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for use in environmental sequencing. Methods in Ecology and Evolution, 4, 10, 914–919 (2013). doi: 10.1111/2041-210X.12073 [Paper link]
  5. Bengtsson-Palme J, Hartmann M, Eriksson KM, Nilsson RH: Metaxa, overview. In:Nelson K. (Ed.) Encyclopedia of Metagenomics: SpringerReference (www.springerreference.com). Springer-Verlag Berlin Heidelberg (2013). doi: 10.1007/978-1-4614-6418-1_239-6 [Link]

Yesterday, Ecological Informatics put our paper describing Metaxa2 Diversity Tools online (1). Metaxa2 Diversity Tools was introduced with Metaxa2 version 2.1 and consists of

  • metaxa2_dc – a tool for collecting several .taxonomy.txt output files into one large abundance matrix, suitable for analysis in, e.g., R
  • metaxa2_rf – generates resampling rarefaction curves (2) based on the .taxonomy.txt output
  • metaxa2_si – species inference based on guessing species data from the other species present in the .taxonomy.txt output file
  • metaxa2_uc – a tool for determining if the community composition of a sample is significantly different from others through resampling analysis

At the same time as I did this update to the web site, I also took the opportunity to update the Metaxa2 FAQ to better reflect recent updates to the Metaxa2 software.

Metaxa2 Diversity Tools
One often requested feature of Metaxa2 (3) has been the ability to make simple analyses from the data after classification. The Metaxa2 Diversity Tools included in Metaxa2 2.1 is a seed for such an effort (although not close to a full-fledged community analysis package comparable to QIIME (4) or Mothur (5)). It currently consist of four tools.

The Metaxa2 Data Collector (metaxa2_dc) is the simplest of them (but probably the most requested), designed to merge the output of several *.level_X.txt files from the Metaxa2 Taxonomic Traversal Tool into one large abundance matrix, suitable for further analysis in, for example, R. The Metaxa2 Species Inference tool (metaxa2_si) can be used to further infer taxon information on, for example, the species level at a lower reliability than what would be permitted by the Metaxa2 classifier, using a complementary algorithm. The idea is that is if only a single species is present in, e.g., a family and a read is assigned to this family, but not classified to the species level, that sequence will be inferred to the same species as the other reads, given that it has more than 97% sequence identity to its best reference match. This can be useful if the user really needs species or genus classifications but many organisms in the studied species group have similar rRNA sequences, making it hard for the Metaxa2 classifier to classify sequences to the species level.

The Metaxa2 Rarefaction analysis tool (metaxa2_rf) performs a resampling rarefaction analysis (2) based on the output from the Metaxa2 classifier, taking into account also the unclassified portion of rRNAs. The Metaxa2 Uniqueness of Community analyzer (metaxa2_uc), finally, allows analysis of whether the community composition of two or more samples or groups is significantly different. Using resampling of the community data, the null hypothesis that the taxonomic content of two communities is drawn from the same set of taxa (given certain abundances) is tested. All these tools are further described in the manual and the recent paper (1).

The latest version of Metaxa2, including the Metaxa2 Diversity Tools, can be downloaded here.

References

  1. Bengtsson-Palme J, Thorell K, Wurzbacher C, Sjöling Å, Nilsson RH: Metaxa2 Diversity Tools: Easing microbial community analysis with Metaxa2. Ecological Informatics, 33, 45–50 (2016). doi: 10.1016/j.ecoinf.2016.04.004 [Paper link]
  2. Gotelli NJ, Colwell RK: Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecology Letters, 4, 379–391 (2000). doi:10.1046/j.1461-0248.2001.00230.x
  3. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved Identification and Taxonomic Classification of Small and Large Subunit rRNA in Metagenomic Data. Molecular Ecology Resources (2015). doi: 10.1111/1755-0998.12399 [Paper link]
  4. Caporaso JG, Kuczynski J, Stombaugh J et al.: QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335–336 (2010).
  5. Schloss PD, Westcott SL, Ryabin T et al.: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology, 75, 7537–7541 (2009).

Metaxa2 has been updated again today to version 2.1.3. This update adds a few features to the Metaxa2 Diversity Tools (metaxa2_uc and metaxa2_rf). The core Metaxa2 programs remain the same as for the previous Metaxa2 versions. The new features were suggested as part of the review process of a Metaxa2-related manuscript, and we thank the anonymous reviewers for their great suggestions!

New features and bug fixes in this update:

  • Added the Chao1, iChao1 and ACE estimators in addition to the original species abundance (“Bengtsson-Palme”) model in metaxa2_rf
  • Added the Raup-Crick dissimilarity method to the metaxa2_uc tool
  • Added a warning message when data is highly skewed for metaxa2_uc
  • Improved robustness of the ‘model’ mode of metaxa2_uc for highly skewed sample groups
  • Fixed a bug causing miscalculation of Euclidean distances on binary data in metaxa2_uc

The updated version of Metaxa2 can be downloaded here.

Happy barcoding!

TriMetAss has today been updated to version 1.2. The new version addresses a number of minor issues, some of which I thought was fixed with the previous version. The update can be found here.

The main problem with the previous version of TriMetAss was that the Trinity developers had changed many options in the Trinity software, which rendered more recent versions of Trinity incompatible with TriMetAss. TriMetAss was not the only external software using Trinity that was affected by these changes. As far as my testing goes, these incompatibilities should now be fixed, by improved Trinity version determination in TriMetAss. This is still not a guarantee for future changes though, so just to make sure, use one of the Trinity versions tested with TriMetAss (versions v2.1.1 or trinityrnaseq_r2013_08_14).

This time I would like to thank Artemis Louyakis at the Univesity of Florida and Tatsuya Unno at the Jeju National University (Korea) for their input on TriMetAss.