The person behind this is really Björn Grüning at the University of Freiburg. I am immensely thankful for the work he has put into this. Our intention to make sure that both the Galaxy version and the bioconda version are maintained in parallel to the one on this website, and continuously up to date!
Since F1000Research uses a somewhat different publication scheme than most journals, I still haven’t understood if this paper is formally published after peer review, but I start to assume it is. There have been very little changes since the last version, so hence I will be lazy and basically repost what I wrote in April when the first version (the “preprint”) was posted online. The paper (1) is the result of a workshop arranged by the JRC in Italy in 2017. It describes various challenges arising from the process of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance genes in next generation sequencing data.
The paper discusses issues about the benchmarking datasets used, testing samples, evaluation criteria for the performance of different tools, and how the benchmarking dataset should be created and distributed. Specially, we address the following questions:
- How should a benchmark strategy handle the current and expanding universe of NGS platforms?
- What should be the quality profile (in terms of read length, error rate, etc.) of in silico reference materials?
- Should different sets of reference materials be produced for each platform? In that case, how to ensure no bias is introduced in the process?
- Should in silico reference material be composed of the output of real experiments, or simulated read sets? If a combination is used, what is the optimal ratio?
- How is it possible to ensure that the simulated output has been simulated “correctly”?
- For real experiment datasets, how to avoid the presence of sensitive information?
- Regarding the quality metrics in the benchmark datasets (e.g. error rate, read quality), should these values be fixed for all datasets, or fall within specific ranges? How wide can/should these ranges be?
- How should the benchmark manage the different mechanisms by which bacteria acquire resistance?
- What is the set of resistance genes/mechanisms that need to be included in the benchmark? How should this set be agreed upon?
- Should datasets representing different sample types (e.g. isolated clones, environmental samples) be included in the same benchmark?
- Is a correct representation of different bacterial species (host genomes) important?
- How can the “true” value of the samples, against which the pipelines will be evaluated, be guaranteed?
- What is needed to demonstrate that the original sample has been correctly characterised, in case real experiments are used?
- How should the target performance thresholds (e.g. specificity, sensitivity, accuracy) for the benchmark suite be set?
- What is the impact of these performance thresholds on the required size of the sample set?
- How can the benchmark stay relevant when new resistance mechanisms are regularly characterized?
- How is the continued quality of the benchmark dataset ensured?
- Who should generate the benchmark resource?
- How can the benchmark resource be efficiently shared?
Of course, we have not answered all these questions, but I think we have come down to a decent description of the problems, which we see as an important foundation for solving these issues and implementing the benchmarking standard. Some of these issues were tackled in our review paper from last year on using metagenomics to study resistance genes in microbial communities (2). The paper also somewhat connects to the database curation paper we published in 2016 (3), although this time the strategies deal with the testing datasets rather than the actual databases. The paper is the first outcome of the workshop arranged by the JRC on “Next-generation sequencing technologies and antimicrobial resistance” held October 4-5 2017 in Ispra, Italy. You can find the paper here (it’s open access).
On another note, the new paper describing the UNITE database (4) has now got a formal issue assigned to it, as has the paper on tandem repeat barcoding in fungi published in Molecular Ecology Resources last year (5).
References and notes
- Angers-Loustau A, Petrillo M, Bengtsson-Palme J, Berendonk T, Blais B, Chan KG, Coque TM, Hammer P, Heß S, Kagkli DM, Krumbiegel C, Lanza VF, Madec J-Y, Naas T, O’Grady J, Paracchini V, Rossen JWA, Ruppé E, Vamathevan J, Venturi V, Van den Eede G: The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Research, 7, 459 (2018). doi: 10.12688/f1000research.14509.1
- Bengtsson-Palme J, Larsson DGJ, Kristiansson E: Using metagenomics to investigate human and environmental resistomes. Journal of Antimicrobial Chemotherapy, 72, 2690–2703 (2017). doi: 10.1093/jac/dkx199
- Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16, 18, 2454–2460 (2016). doi: 10.1002/pmic.201600034
- Nilsson RH, Larsson K-H, Taylor AFS, Bengtsson-Palme J, Jeppesen TS, Schigel D, Kennedy P, Picard K, Glöckner FO, Tedersoo L, Saar I, Kõljalg U, Abarenkov K: The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications. Nucleic Acids Research, 47, D1, D259–D264 (2019). doi: 10.1093/nar/gky1022
- Wurzbacher C, Larsson E, Bengtsson-Palme J, Van den Wyngaert S, Svantesson S, Kristiansson E, Kagami M, Nilsson RH: Introducing ribosomal tandem repeat barcoding for fungi. Molecular Ecology Resources, 19, 1, 118–127 (2019). doi: 10.1111/1755-0998.12944
I just uploaded a mini update to ITSx, fixing a bug that caused the
--truncate option not to be accepted by the software in ITSx 1.1. This bug fix brings the software to version 1.1.1. No other changes have been introduced in this version. Download the update here. Happy barcoding!
On Friday, Molecular Ecology Resources put online Christian Wurzbacher‘s latest paper, of which I am also a coauthor. The paper presents three sets of general primers that allow for amplification of the complete ribosomal operon from the ribosomal tandem repeats, covering all the ribosomal markers (ETS, SSU, ITS1, 5.8S, ITS2, LSU, and IGS) (1). This paper is important because it introduces a technique to utilize third generation sequencing (PacBio and Nanopore) to generate high‐quality reference data (equivalent or better than Sanger sequencing) in a high‐throughput manner. The paper shows that the quality of the Nanopore generated sequences was 99.85%, which is comparable with the 99.78% accuracy described for Sanger sequencing.
My main contribution to this paper is the consensus sequence generation script – Consension – which is available from my software page. Importantly, there are huge gaps in the reference databases we use for taxonomic classification and this method will facilitate the integration of reference data from all of the ribosomal markers. We hope that this work will stimulate large-scale generation of ribosomal reference data covering several marker genes, linking previously spread-out information together.
A few days ago I posted about that Bioinformatics had published our paper on the Metaxa2 Database Builder (1). Today, I am happy to report that PeerJ has published the first paper in which the database builder is used to create a new Metaxa2 (2) database! My colleagues at Ohio State University has used the software to build a database for the COI gene (3), which is commonly used in arthropod barcoding. The used region was extracted from COI sequences from arthropod whole mitochondrion genomes, and employed to create a database containing sequences from all major arthropod clades, including all insect orders, all arthropod classes and the Onychophora, Tardigrada and Mollusca outgroups.
Similar to what we did in our evaluation of taxonomic classifiers used on non-rRNA barcoding regions (4), we performed a cross-validation analysis to characterize the relationship between the Metaxa2 reliability score, an estimate of classification confidence, and classification error probability. We used this analysis to select a reliability score threshold which minimized error. We then estimated classification sensitivity, false discovery rate and overclassification, the propensity to classify sequences from taxa not represented in the reference database.
Since the database builder was still in its early inception stages when we started doing this work, the software itself saw several improvements because of this project. We believe that our work on the COI database, as well as on the recently released database builder software, will help researchers in designing and evaluating classification databases for metabarcoding on arthropods and beyond. The database is included in the new Metaxa2 2.2 release, and is also downloadable from the Metaxa2 Database Repository (1). The open access paper can be found here.
- Bengtsson-Palme J, Richardson RT, Meola M, Wurzbacher C, Tremblay ED, Thorell K, Kanger K, Eriksson KM, Bilodeau GJ, Johnson RM, Hartmann M, Nilsson RH: Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic and metabarcoding data using any genetic marker. Bioinformatics, advance article (2018). doi: 10.1093/bioinformatics/bty482
- Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399
- Richardson RT, Bengtsson-Palme J, Gardiner MM, Johnson RM: A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ, 6, e5126 (2018). doi: 10.7717/peerj.5126
- Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, 17, 4, 760–769 (2017). doi: 10.1111/1755-0998.12628
One of the questions I have received regarding Metaxa2 is if it is possible to use it on other DNA barcodes. My answer has been “technically, yes, but it is a very cumbersome process of creating a custom database for every additional barcode”. Not anymore, the newly introduced Metaxa2 Database Builder makes this process automatic, with the user just supplying a FASTA file of sequences from the region in question and a file containing the taxonomy information for the sequences (in GenBank, NSD XML, Metaxa2 or SILVA-style formats). The preprint (1) has been out for some time, but today Bioinformatics published the paper describing the software (2).
The paper not only details how the database builder works, but also shows that it is working on a number of different barcoding regions, albeit with different results in terms of accuracy. Still, even with seemingly high misclassification rates for some DNA barcodes, the software performs better than a simple BLAST-based taxonomic assignment (76.5% vs. 41.4% correct classifications for matK, and 76.2% vs. 45.1% for tnrL). The database builder has already found use in building a COI database for anthropods (3), and we envision a range of uses in the near future.
As the paper is now published, I have also moved the Metaxa2 software (4) from beta-status to a full-worthy version 2.2 update. Hopefully, this release should be bug free, but my experience is that when the community gets their hands of the software they tend to discover things our team has missed. I would like to thank the entire team working on this, particularly Rodney Richardson (who initiated this entire thing) and Henrik Nilsson. The software can be downloaded here. Happy barcoding!
- Bengtsson-Palme J, Richardson RT, Meola M, Wurzbacher C, Tremblay ED, Thorell K, Kanger K, Eriksson KM, Bilodeau GJ, Johnson RM, Hartmann M, Nilsson RH: Taxonomic identification from metagenomic or metabarcoding data using any genetic marker. bioRxiv 253377 (2018). doi: 10.1101/253377 [Link]
- Bengtsson-Palme J, Richardson RT, Meola M, Wurzbacher C, Tremblay ED, Thorell K, Kanger K, Eriksson KM, Bilodeau GJ, Johnson RM, Hartmann M, Nilsson RH: Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic and metabarcoding data using any genetic marker. Bioinformatics, advance article (2018). doi: 10.1093/bioinformatics/bty482 [Paper link]
- Richardson RT, Bengtsson-Palme J, Gardiner MM, Johnson RM: A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ Preprints, 6, e26662v1 (2018). doi: 10.7287/peerj.preprints.26662v1 [Link]
- Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399 [Paper link]