I have today uploaded an updated version of Metaxa2 (version 2.1.2). This update primarily improves the memory performance of the Metaxa2 Diversity Tools. The core Metaxa2 programs remain the same as for the previous Metaxa2 versions.
New features and bug fixes in this update:
- Dramatically improved memory performance of metaxa2_uc
- Added the
'min'option to the
-sflag in metaxa2_uc, which will cause the program to sample the number of entries present in the smallest sample from each sample
- Fixes a bug that disregarded the level specified by the
-loption in metaxa2_si
- Minor updates and improvements on the manual
The updated version of Metaxa2 can be downloaded here.
A problem with annotating contigs from genomic and metagenomic projects is that there are few tools that allow the visualization of the annotated features, particularly if those features come from different sources. To alleviate this problem, I have (with assistance from Rickard Hammarén and Chandan Pal) over the last years developed a new annotation and read coverage visualization package – FARAO – which we today introduce to the public. FARAO has been used to produce the basis for the the contig annotation figures in my paper on the polluted Indian lake. Storing and visualizing annotation and coverage information in FARAO has a number of advantages. FARAO is able to:
- Integrate annotation and coverage information for the same sequence set, enabling coverage estimates of annotated features
- Scale across millions of sequences and annotated features
- Filter sequences, such that only entries with annotations satisfying certain given criteria will be outputted
- Handle annotation and coverage data produced by a range of different bioinformatics tools
- Handle custom parsers through a flexible interface, allowing for adaption of the software to virtually any bioinformatic tool
- Produce high-quality EPS output
- Integrate with MySQL databases
FARAO is today moved from a private pre-release state to a public beta state. It is still possible that this version contains bug that we have not discovered in our testing. Please send me an e-mail and make us aware of the potential shortcomings of our software if you find any unexpected behavior in this version of FARAO.
Today I have released Metaxa2 version 2.1.1, containing a fix to an embarrassing bug in the new metaxa2_uc program (part of the Metaxa2 Diversity Tools). A late change of the names of the different modes of that tool had not propagated to all parts of the code, and therefore only the “model” mode was functional in the previous version. No other changes to the Metaxa2 package has been made in this update, which can be downloaded here.
I am very happy to announce that Metaxa2 version 2.1 has been released today. This new version brings a lot of important improvements to the Metaxa2 software (1), in particular by the introduction of the Metaxa2 Diversity Tools. This is the list of new features (further elaboration follows below):
- The Metaxa2 Diversity Tools:
- metaxa2_dc – a tool for collecting several .taxonomy.txt output files into one large abundance matrix, suitable for analysis in, e.g., R
- metaxa2_rf – generates rarefaction curves based on the .taxonomy.txt output
- metaxa2_si – species inference based on guessing species data from the other species present in the .taxonomy.txt output file
- metaxa2_uc – a tool for determining if the community composition of a sample is significantly different from others through resampling analysis
- Added a new detection mode for detection of multiple rRNA in the same sequence, e.g. a genome
- Added the
--referenceoption to improve the use of Metaxa2 as a tool to sort out host rRNA sequences from a dataset
- Added the
--split_pairsoption causing Metaxa2 to output paired-end sequences into two separate files, which is nice for further analysis of rRNA reads
- The default setting for the
--alignoption has been changed to ‘
- Automatic detection of which BLAST package that is installed
- Fixed a bug causing the last read of paired-end FASTA input to be ignored
- Fixed an occasionally occurring BLAST+ related warning message
- Fixed a bug that could cause the classifier to crash on highly divergent BLAST matches
The new version of Metaxa2 can be downloaded here, and for those interested I will spend the rest of this post outlining the new features.
Metaxa2 Diversity Tools
One often requested feature of Metaxa2 is the ability to further make simple analysis from the data after classification. The Metaxa2 Diversity Tools included in Metaxa2 2.1 is a seed for such an effort (although not close to a full-fledge community analysis package compared to QIIME (2) or Mothur (3)). The set currently consist of four tools
The Metaxa2 Data Collector (metaxa2_dc) is the simplest of them (but probably the most requested), designed to merge the output of several *.level_X.txt files from the Metaxa2 Taxonomic Traversal Tool into one large abundance matrix, suitable for further analysis in, for example, R. The Metaxa2 Species Inference tool (metaxa2_si) can be used to further infer taxon information on, for example, the species level at a lower reliability than what would be permitted by the Metaxa2 classifier, using a complementary algorithm. The idea is that is if only a single species is present in, e.g., a family and a read is assigned to this family, but not classified to the species level, that sequence will be inferred to the same species as the other reads, given that it has more than 97% sequence identity to its best reference match. This can be useful if the user really needs species or genus classifications but many organisms in the studied species group have similar rRNA sequences, making it hard for the Metaxa2 classifier to classify sequences to the species level.
The Metaxa2 Rarefaction analysis tool (metaxa2_rf) performs a rarefaction analysis based on the output from the Metaxa2 classifier, taking into account also the unclassified portion of rRNAs. The Metaxa2 Uniqueness of Community analyzer (metaxa2_uc), finally, allows analysis of whether the community composition of two or more samples or groups is significantly different. Using resampling of the community data, the null hypothesis that the taxonomic content of two communities is drawn from the same set of taxa (given certain abundances) is tested. All these tools are further described in the manual.
The genome mode
Metaxa2 has long been said not to be useful for predicting rRNA in longer sequences, such as full genomes or chromosomes, since it has traditionally only looked for a single rRNA hit. With Metaxa2 2.1, it is now possible to use Metaxa2 on longer sequences to detect multiple rRNA occurrences. To do this, you need to change the operating mode using the new
--mode option to either ‘
auto‘ or ‘
genome‘. The auto mode will treat sequences longer than 2500 bp as “genome” sequences and look for multiple matches in these.
The reference mode
Another feature request that has been addressed in the new Metaxa2 version is the ability to filter out certain sequences from the data set. For example, you may want to exclude all rRNA sequences that are derived from to host organism, but keep the analysis of all other rRNA reads. This is now possible by supplying a file of reference rRNA sequences to exclude in FASTA format to the
Experimental Usearch support
Finally, we have toyed around with support for Usearch (4) instead of BLAST (5) as the search algorithm for the classification step. However, this is far from fine-tuned and it is included as an experimental feature that you may use on your own risk! We recommend that you not use it for classification of data for publication yet. However, we are interested in how this works for you, so if you like you may test to run the Usearch algorithm in parallel with your BLAST-based analysis and compare the results and send me your input on how it works. You can read more about using Usearch at the end of the Metaxa2 manual.
- Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved Identification and Taxonomic Classification of Small and Large Subunit rRNA in Metagenomic Data. Molecular Ecology Resources (2015). doi: 10.1111/1755-0998.12399 [Paper link]
- Caporaso JG, Kuczynski J, Stombaugh J et al.: QIIME allows analysis of high-throughput community sequencing data. Nature Methods, 7, 335–336 (2010).
- Schloss PD, Westcott SL, Ryabin T et al.: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Applied and Environmental Microbiology, 75, 7537–7541 (2009).
- Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461 (2010).
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389–3402 (1997).
- Multiple input files can now be specified by adding several -1 and -2 options.
- TriMetAss now automatically stops if the candidate reads are the same for two iterations in a row.
- Support for recent versions of Trinity that no longer contain the Trinity.pl script.
- A minor bug causing TriMetAss to use more memory than necessary has been fixed.
- Fixed the
--stop_totaloption so that TriMetAss actually uses this option (rather than
- Allowed complicated paths to be supplied for the output directory.
I would like to thank users Rickard Hammarén, Dr. Tatsuya Unno, Dr. Gisle Vestergaard and Dr. Joseph Nesme for providing me with the underlying information to provide these fixes. Thanks a lot!
Metaxa2 has been updated to version 2.0.2 and can be downloaded from the Metaxa2 web site. The 2.0.2 update fixes two minor bugs; one causing the “.graph” file to display incorrect or no names for the regions of the LSU regions, and one causing misreporting of the number of sequences in single-end FASTQ files (paired-end files were reported correctly). The update also brings a slightly improved classifier. Thanks to Marco Severgnini for reporting the FASTQ file issue! The update is available here.
Some of you who think ITSx is running slowly despite being assigned multiple CPUs, particularly on datasets with only one kind of sequences (e.g. fungal) using the
-t F option might be interested in trying out Andrew Krohn’s parallel ITSx implementation. The solution essentially employs a bash script spawning multiple ITSx instances running on different portions of the input file. Although there are some limitations to the script (e.g. you cannot select a custom name for the output and you will only get the ITS1 and ITS2 + full sequences FASTA files, as far as I understand the script), it may prove useful for many of you until we write up a proper solution to the poor multi-thread performance of ITSx (planned for version 1.1). In the coming months, I recommend that you check this solution out! See also the wiki documentation.
My speed tests shows the following (on a quite small test set of fungal ITS sequences):
ITSx parallel on 16 CPUs, all ITS types (option “
3 min, 16 sec
ITSx parallel on 16 CPUs, only fungal ITS types (option “
ITSx native on 16 CPUs, all ITS types (options “
-t all --cpu 16“):
4 min, 59 sec
ITSx native on 16 CPUs, only fungal types (options “
-t f --cpu 16“):
5 min, 50 sec
Why fungal only took longer time in the native implementation is a mystery to me, but probably shows why there is a need to rewrite the multithreading code, as we did with Metaxa a couple of years ago. Stay tuned for ITSx updates!
A minor bug in the “its1.full_and_partial.fasta” file has been fixed in a minor update to ITSx (1.0.11) released to day. The bug occasionally caused newline characters at the end of a sequence to be skipped and the next entry to begin at the same row. The bug only manifested itself when ITSx was used with the
--partial option and only in the above mentioned FASTA file. If you have been affected by the bug, you should have noticed as the resulting FASTA file would be considered corrupted by most bioinformatics software. The updated version of ITSx can be downloaded here.
With the publication of my latest paper last week (1), I also would like to highlight some of the software underpinning the findings a bit. To get around the problem that extremely common resistance genes could be present in multiple contexts and variants, causing assembler such as Velvet (2) to perform sub-optimally, we have written a software tool that utilizes Vmatch (3) and Trinity (4) to iteratively construct contigs from reads associated with resistance genes. This could of course be used in many other situations as well, when you want to specifically assemble a certain portion of a metagenome, but suspect that that portion might be found in multiple contexts.
TriMetAss is a Perl program, employing Vmatch and Trinity to construct multi-context contigs. TriMetAss uses extracted reads associated with, e.g., resistance genes as seeds for a Vmatch search against the complete set of read pairs, extracting reads matching with at least 49 bp (by default) to any of the seed reads. These reads are then assembled using Trinity. The resulting contigs are then used as seeds for another search using Vmatch to the complete set of reads, as above. All matches (including the previously matching read pairs) are again then used for a Trinity assembly. This iterative process is repeated until a stop criteria is met, e.g. when the total number of assembled nucleotides starts to drop rather than increase. The software can be downloaded here.
- Bengtsson-Palme J, Boulund F, Fick J, Kristiansson E, Larsson DGJ: Shotgun metagenomics reveals a wide array of antibiotic resistance genes and mobile elements in a polluted lake in India. Frontiers in Microbiology, 5, 648 (2014). doi: 10.3389/fmicb.2014.00648
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821–829 (2008). doi:10.1101/gr.074492.107
- Kurtz S: The Vmatch large scale sequence analysis software (2010). http://vmatch.de/
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al.: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644–652 (2011). doi:10.1038/nbt.1883
An update to Metaxa2 that has long remained in internal testing has been deemed bug-free (as far as we can tell) and has been uploaded to the Metaxa2 web site. The update brings a slightly improved classifier, and is the first release that we declare full stable, although we have found no problems with the previously available version (release candidate 3). This also means that we take a jump directly from version 2.0, release candidate 3 to version 2.0.1 without passing a final 2.0 release. The update is available here.