Category: Bioinformatics

Mumame – Quantifying mutations in metagenomes

Let me get straight to something somewhat besides the point here: summer students can achieve amazing things! One such student I had the pleasure to work with this summer is Shruthi Magesh, and a preprint based on work she did with me at the Wisconsin Institute for Discovery this summer just got published on bioRxiv (1). The preprint describes a software tool called Mumame, which uses database information on mutations in DNA or protein sequences to search metagenomic datasets and quantifies the relative proportion of resistance mutations over wild type sequences.

In the preprint (1), we first of all show that Mumame works on amplicon data where we already knew the true outcome (2). Second, we show that we can detect differences in mutation frequencies in controlled experiments (2,3). Lastly, we use the tool to gain some further information about resistance patterns in sediments from polluted environments in India (4,5). Together these analyses show that one of the most central aspects for Mumame to be able to find mutations is having a very high number of sequenced reads in all libraries (preferably more than 50 million per library), because these mutations are generally rare – even in polluted environments and microcosms exposed to antibiotics. We expect Mumame to be a useful addition to metagenomic studies of e.g. antibiotic resistance, and to increase the detail by which metagenomes can be screened for phenotypically important differences.

While I did write the code for the software (with a lot of input from Viktor Jonsson, who also is a coauthor on the preprint, on the statistical analysis), Shruthi did the software testing and evaluations, and the paper would not have been possible hadn’t she wanted a bioinformatic summer project related to metagenomics, aside from her laboratory work. The resulting preprint is available from bioRxiv and the Mumame software is freely available from this site.

References

  1. Magesh S, Jonsson V, Bengtsson-Palme JQuantifying point-mutations in metagenomic data. bioRxiv, 438572 (2018). doi: 10.1101/438572 [Link]
  2. Kraupner N, Ebmeyer S, Bengtsson-Palme J, Fick J, Kristiansson E, Flach C-F, Larsson DGJ: Selective concentration for ciprofloxacin in Escherichia coli grown in complex aquatic bacterial biofilms. Environment International, 116, 255–268 (2018). doi: 10.1016/j.envint.2018.04.029 [Paper link]
  3. Lundström S, Östman M, Bengtsson-Palme J, Rutgersson C, Thoudal M, Sircar T, Blanck H, Eriksson KM, Tysklind M, Flach C-F, Larsson DGJ: Minimal selective concentrations of tetracycline in complex aquatic bacterial biofilms. Science of the Total Environment, 553, 587–595 (2016). doi: 10.1016/j.scitotenv.2016.02.103 [Paper link]
  4. Bengtsson-Palme J, Boulund F, Fick J, Kristiansson E, Larsson DGJ: Shotgun metagenomics reveals a wide array of antibiotic resistance genes and mobile elements in a polluted lake in India. Frontiers in Microbiology, 5, 648 (2014). doi: 10.3389/fmicb.2014.00648 [Paper link]
  5. Kristiansson E, Fick J, Janzon A, Grabic R, Rutgersson C, Weijdegård B, Söderström H, Larsson DGJ: Pyrosequencing of antibiotic-contaminated river sediments reveals high levels of resistance and gene transfer elements. PLoS ONE, Volume 6, e17038 (2011). doi:10.1371/journal.pone.0017038.

Published paper: Ribosomal tandem repeat barcoding for fungi

On Friday, Molecular Ecology Resources put online Christian Wurzbacher‘s latest paper, of which I am also a coauthor. The paper presents three sets of general primers that allow for amplification of the complete ribosomal operon from the ribosomal tandem repeats, covering all the ribosomal markers (ETS, SSU, ITS1, 5.8S, ITS2, LSU, and IGS) (1). This paper is important because it introduces a technique to utilize third generation sequencing (PacBio and Nanopore) to generate high‐quality reference data (equivalent or better than Sanger sequencing) in a high‐throughput manner. The paper shows that the quality of the Nanopore generated sequences was 99.85%, which is comparable with the 99.78% accuracy described for Sanger sequencing.

My main contribution to this paper is the consensus sequence generation script – Consension – which is available from my software page. Importantly, there are huge gaps in the reference databases we use for taxonomic classification and this method will facilitate the integration of reference data from all of the ribosomal markers. We hope that this work will stimulate large-scale generation of ribosomal reference data covering several marker genes, linking previously spread-out information together.

Reference

  1. Wurzbacher C, Larsson E, Bengtsson-Palme J, Van den Wyngaert S, Svantesson S, Kristiansson E, Kagami M, Nilsson RH: Introducing ribosomal tandem repeat barcoding for fungi. Molecular Ecology Resources, Accepted article (2018). doi: 10.1111/1755-0998.12944 [Paper link]

DAIRYdb added to Metaxa2

Last week, I uploaded a new database to the Metaxa2 Database Repository, called DAIRYdb. DAIRYdb (1) is a manually curated reference database for 16S rRNA amplicon sequences from dairy products. Significant efforts have been put into improving annotation algorithms, such as Metaxa2 (2), while less attention has been put into curation of reliable and consistent databases (3). Previous studies have shown that databases restricted to the studied environment improve unambiguous taxonomy annotation to the species level, thanks to consistent taxonomy, lack of blanks and reduced competition between different reference taxonomies (4-5). The usage of DAIRYdb in combination with different classification tools allows taxonomy annotation accuracy of over 90% at species level for microbiome samples from dairy products, where species identification is mandatory due to the affiliation to few closely related genera of most dominant lactic acid bacteria.

The database can be added to your Metaxa2 (version 2.2 or later) installation by using the following command:

metaxa2_install_database -g SSU_DAIRYdb_v1.1.2

Further adaptations of the DAIRYdb can be found on GitHub and the preprint has been deposited in BioRxiv (1). DAIRYdb was developed by Marco Meola, Etienne Rifa and their collaborators, who also provided most of the text for this post. Thanks Marco for this excellent addition to the database collection!

References

  1. Meola M, Rifa E, Shani N, Delbes C, Berthoud H, Chassard C: DAIRYdb: A manually curated gold standard reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products. bioRxiv, 386151 (2018). doi: 10.1101/386151
  2. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399
  3. Edgar RC: Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences. PeerJ, 6, e4652 (2018). doi: 10.7717/peerj.4652
  4. Ritari J, Salojärvi J, Last L, de Vos WM: Improved taxonomic assignment of human intestinal 16S rRNA sequences by a dedicated reference database. BMC Genomics, 16, 1, 1056 (2015). doi: 10.1186/s12864-015-2265-y
  5. Newton ILG, Roeselers G: The effect of training set on the classification of honey bee gut microbiota using the naïve bayesian classifier. BMC Microbiology, 12, 1, 221 (2012). doi: 10.1186/1471-2180-12-221

Published paper: Predicting the uncharacterized resistome

Over the weekend, Microbiome put online my most recent paper (1) – a project which started as an idea I got when I finished up my PhD thesis in 2016. One of my main points in the thesis (2), which was also made again on our recent review on environmental factors influencing resistance development (3), is that the greatest risks associated with antibiotic resistance in the environment may not be the resistance genes already circulating in pathogens (which are relatively easily quantified), but the ones associated with recruitment of novel resistance genes from bacteria in the environment (2-4). The latter genes are, however, impossible to quantify due to the fact that they are unknown. But what if we could use knowledge of the diversity and abundance of known resistance genes to estimate the same properties of the yet uncharacterized resistome? That would be a great advantage in e.g. ranking of risk environments, as then some property that is easily monitored can be used to inform risk management of both known and unknown resistance factors.

This just published paper explores this possibility, by quantifying the abundance and diversity of resistance genes in 1109 metagenomes from various environments (1). I have taken two different approaches. First, I took out smaller subsets of genes from the reference database (in this case Resqu, a database of antibiotic resistance genes with verified resistance functions, detected on mobile genetic elements), and used those subsets to estimate resistome diversity and abundance in the 1109 metagenomes. Then these predictions were compared to the results of the entire database. I then, in a second step, investigated if these predictions could be extended to a set of truly novel resistance genes, i.e. the resistance genes present in the FARME database, collecting data from functional metagenomics inserts (5,6).

The results show that generally the diversity and abundance of known antibiotic resistance genes can be used to predict the same properties of undescribed resistance genes (see figure above). However, the extent of this predictability is, importantly, dependent on the type of environment investigated. The study also shows that carefully selected small sets of resistance genes can describe total resistance gene diversity remarkably well. This means that knowledge gained from large-scale quantifications of known resistance genes can be utilized as a proxy for unknown resistance factors. This is important for current and proposed monitoring efforts for environmental antibiotic resistance (7-11) and has implications for the design of risk ranking strategies and the choices of measures and methods for describing resistance gene abundance and diversity in the environment.

The study also investigated which diversity measures were best suited to estimate total diversity. Surprisingly, some diversity measures described the total diversity of resistance genes remarkably bad. Most prominently, the Simpson diversity index consistently showed poor performance, and while the Shannon index performed relatively better, there is still no reason to select the Shannon index over normalized (rarefied) richness of resistance genes. The ACE estimator fluctuated substantially compared to the other diversity measures, while the Chao1 estimator more consistently showed performance very similar to richness. Therefore, either richness or the Chao1 estimator should be used for ranking resistance gene diversity, while the Shannon, Simpson, and ACE measures should be avoided.

Importantly, this study implies that the recruitment of novel antibiotic resistance genes from the environment to human pathogens is essentially random. Therefore, when ranking risks associated with antibiotic resistance in environmental settings, the knowledge gained from large-scale quantification of known resistance genes can be utilized as a proxy for the unknown resistance factors (although this proxy is not perfect). Thus, high-risk environments for resistance development and dissemination would, for example, be aquaculture, animal husbandry, discharges from antibiotic manufacturing, and untreated sewage (3,8,12-15). Further attention should probably be paid to antibiotic contaminated soils, as this study points to soils as a vast source of resistance genes not yet encountered in human pathogens. This has also been suggested previously by others (16-19). The results of this study can be used to guide monitoring efforts for environmental antibiotic resistance, to design risk ranking strategies, and to choose appropriate measures and methods for describing resistance gene abundance and diversity in the environment. The entire open access paper is available here.

References

  1. Bengtsson-Palme J: The diversity of uncharacterized antibiotic resistance genes can be predicted from known gene variants – but not always. Microbiome, 6, 125 (2018). doi: 10.1186/s40168-018-0508-2
  2. Bengtsson-Palme J: Antibiotic resistance in the environment: a contribution from metagenomic studies. Doctoral thesis (medicine), Department of Infectious Diseases, Institute of Biomedicine, Sahlgrenska Academy, University of Gothenburg, 2016. [Link]
  3. Bengtsson-Palme J, Kristiansson E, Larsson DGJ: Environmental factors influencing the development and spread of antibiotic resistance. FEMS Microbiology Reviews, 42, 1, 68–80 (2018). doi: 10.1093/femsre/fux053
  4. Bengtsson-Palme J, Larsson DGJ: Antibiotic resistance genes in the environment: prioritizing risks. Nature Reviews Microbiology, 13, 369 (2015). doi: 10.1038/nrmicro3399-c1
  5. Wallace JC, Port JA, Smith MN, Faustian EM: FARME DB: a functional antibiotic resistance element database. Database, 2017, baw165 (2017).
  6. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM: Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chemical Biology, 5, R245–249 (1998).
  7. Berendonk TU, Manaia CM, Merlin C, Fatta-Kassinos D, Cytryn E, Walsh F, et al.: Tackling antibiotic resistance: the environmental framework. Nature Reviews Microbiology, 13, 310–317 (2015).
  8. Pruden A, Larsson DGJ, Amézquita A, Collignon P, Brandt KK, Graham DW, et al.: Management options for reducing the release of antibiotics and antibiotic resistance genes to the environment. Environmental Health Perspectives, 121, 878–885 (2013).
  9. Review on Antimicrobial Resistance: Antimicrobials in agriculture and the environment: reducing unnecessary use and waste. O’Neill J, ed. London: Wellcome Trust & HM Government (2015).
  10. Angers-Loustau A, Petrillo M, Bengtsson-Palme J, Berendonk T, Blais B, Chan KG, Coque TM, Hammer P, Heß S, Kagkli DM, Krumbiegel C, Lanza VF, Madec J-Y, Naas T, O’Grady J, Paracchini V, Rossen JWA, Ruppé E, Vamathevan J, Venturi V, Van den Eede G: The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Research, 7, 459 (2018). doi: 10.12688/f1000research.14509.1
  11. Larsson DGJ, Andremont A, Bengtsson-Palme J, Brandt KK, de Roda Husman AM, Fagerstedt P, Fick J, Flach C-F, Gaze WH, Kuroda M, Kvint K, Laxminarayan R, Manaia CM, Nielsen KM, Ploy M-C, Segovia C, Simonet P, Smalla K, Snape J, Topp E, van Hengel A, Verner-Jeffreys DW, Virta MPJ, Wellington EM, Wernersson A-S: Critical knowledge gaps and research needs related to the environmental dimensions of antibiotic resistance. Environment International, 117, 132–138 (2018). doi: 10.1016/j.envint.2018.04.041
  12. Allen HK, Donato J, Wang HH, Cloud-Hansen KA, Davies J, Handelsman J: Call of the wild: antibiotic resistance genes in natural environments. Nature Reviews Microbiology, 8, 251–259 (2010).
  13. Graham DW, Collignon P, Davies J, Larsson DGJ, Snape J: Underappreciated role of regionally poor water quality on globally increasing antibiotic resistance. Environmental Science & Technology, 48,11746–11747 (2014).
  14. Larsson DGJ: Pollution from drug manufacturing: review and perspectives. Philosophical Transactions of the Royal Society of London, Series B Biological Sciences, 369, 20130571 (2014).
  15. Cabello FC, Godfrey HP, Buschmann AH, Dölz HJ: Aquaculture as yet another environmental gateway to the development and globalisation of antimicrobial resistance. Lancet Infectious Diseases, 16, e127–133 (2016).
  16. Forsberg KJ, Reyes A, Wang B, Selleck EM, Sommer MOA, Dantas G: The shared antibiotic resistome of soil bacteria and human pathogens. Science, 337, 1107–1111 (2012).
  17. Allen HK, Moe LA, Rodbumrer J, Gaarder A, Handelsman J: Functional metagenomics reveals diverse beta-lactamases in a remote Alaskan soil. ISME Journal, 3, 243–251 (2009).
  18. Riesenfeld CS, Goodman RM, Handelsman J: Uncultured soil bacteria are a reservoir of new antibiotic resistance genes. Environmental Microbiology, 6, 981–989 (2004).
  19. McGarvey KM, Queitsch K, Fields S: Wide variation in antibiotic resistance proteins identified by functional metagenomic screening of a soil DNA library. Applied and Environmental Microbiology, 78, 1708–1714 (2012).

Published paper: A Metaxa2 database for the arthropod COI locus

A few days ago I posted about that Bioinformatics had published our paper on the Metaxa2 Database Builder (1). Today, I am happy to report that PeerJ has published the first paper in which the database builder is used to create a new Metaxa2 (2) database! My colleagues at Ohio State University has used the software to build a database for the COI gene (3), which is commonly used in arthropod barcoding. The used region was extracted from COI sequences from arthropod whole mitochondrion genomes, and employed to create a database containing sequences from all major arthropod clades, including all insect orders, all arthropod classes and the Onychophora, Tardigrada and Mollusca outgroups.

Similar to what we did in our evaluation of taxonomic classifiers used on non-rRNA barcoding regions (4), we performed a cross-validation analysis to characterize the relationship between the Metaxa2 reliability score, an estimate of classification confidence, and classification error probability. We used this analysis to select a reliability score threshold which minimized error. We then estimated classification sensitivity, false discovery rate and overclassification, the propensity to classify sequences from taxa not represented in the reference database.

Since the database builder was still in its early inception stages when we started doing this work, the software itself saw several improvements because of this project. We believe that our work on the COI database, as well as on the recently released database builder software, will help researchers in designing and evaluating classification databases for metabarcoding on arthropods and beyond. The database is included in the new Metaxa2 2.2 release, and is also downloadable from the Metaxa2 Database Repository (1). The open access paper can be found here.

References

  1. Bengtsson-Palme J, Richardson RT, Meola M, Wurzbacher C, Tremblay ED, Thorell K, Kanger K, Eriksson KM, Bilodeau GJ, Johnson RM, Hartmann M, Nilsson RH: Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic and metabarcoding data using any genetic marker. Bioinformatics, advance article (2018). doi: 10.1093/bioinformatics/bty482
  2. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399
  3. Richardson RT, Bengtsson-Palme J, Gardiner MM, Johnson RM: A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ, 6, e5126 (2018). doi: 10.7717/peerj.5126
  4. Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, 17, 4, 760–769 (2017). doi: 10.1111/1755-0998.12628

Published paper: Metaxa2 Database Builder

One of the questions I have received regarding Metaxa2 is if it is possible to use it on other DNA barcodes. My answer has been “technically, yes, but it is a very cumbersome process of creating a custom database for every additional barcode”. Not anymore, the newly introduced Metaxa2 Database Builder makes this process automatic, with the user just supplying a FASTA file of sequences from the region in question and a file containing the taxonomy information for the sequences (in GenBank, NSD XML, Metaxa2 or SILVA-style formats). The preprint (1) has been out for some time, but today Bioinformatics published the paper describing the software (2).

The paper not only details how the database builder works, but also shows that it is working on a number of different barcoding regions, albeit with different results in terms of accuracy. Still, even with seemingly high misclassification rates for some DNA barcodes, the software performs better than a simple BLAST-based taxonomic assignment (76.5% vs. 41.4% correct classifications for matK, and 76.2% vs. 45.1% for tnrL). The database builder has already found use in building a COI database for anthropods (3), and we envision a range of uses in the near future.

As the paper is now published, I have also moved the Metaxa2 software (4) from beta-status to a full-worthy version 2.2 update. Hopefully, this release should be bug free, but my experience is that when the community gets their hands of the software they tend to discover things our team has missed. I would like to thank the entire team working on this, particularly Rodney Richardson (who initiated this entire thing) and Henrik Nilsson. The software can be downloaded here. Happy barcoding!

References

  1. Bengtsson-Palme J, Richardson RT, Meola M, Wurzbacher C, Tremblay ED, Thorell K, Kanger K, Eriksson KM, Bilodeau GJ, Johnson RM, Hartmann M, Nilsson RH: Taxonomic identification from metagenomic or metabarcoding data using any genetic marker. bioRxiv 253377 (2018). doi: 10.1101/253377 [Link]
  2. Bengtsson-Palme J, Richardson RT, Meola M, Wurzbacher C, Tremblay ED, Thorell K, Kanger K, Eriksson KM, Bilodeau GJ, Johnson RM, Hartmann M, Nilsson RH: Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic and metabarcoding data using any genetic marker. Bioinformatics, advance article (2018). doi: 10.1093/bioinformatics/bty482 [Paper link]
  3. Richardson RT, Bengtsson-Palme J, Gardiner MM, Johnson RM: A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ Preprints, 6, e26662v1 (2018). doi: 10.7287/peerj.preprints.26662v1 [Link]
  4. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399 [Paper link]

Published paper: A novel Na-binding site in sialic acid symporters

I have been quite occupied with other things the last couple of days, so I am late on the ball here. Anyway, on May 1st, Nature Communications published a paper on the protein structure of SiaT, a sialic acid transporter from Proteus mirabilis (1). Many pathogens use sialic acids as an energy source or as an external coating to evade the immune defense (2). Therefore, many bacteria that colonize sialylated environments have transporters which specifically import sialic acids. SiaT is one of those transporters, belonging to the sodium solute symporter (SSS) family (3) (with for some weird reason is associated with the Pfam family “SSF”, an eternal source of confusion in discussions within this project). The SSS proteins use Na+ gradients to drive the import of desired substrates (4). Based on the protein structure, our team found that SiaT binds two Na+ ions. One binds to the conserved, well-known, Na2 site, but the other Na+ binds to a new position, which we term Na3. This position (this is where my part of the work comes in) is conserved in many SSS family members. We finally used functional and molecular dynamics studies to validate the substrate-binding site and demonstrate that both Na+ sites regulate N-acetylneuraminic acid transport.

As I hinted, i am not venturing into protein structures – that part of this work has been performed by an excellent team associated with Dr. Rosmarie Friemann. Instead, my part is essentially summarized in these two sentences of the manuscript: “We analysed all SSS sequences that contained the primary Na2 site (21,467) to determine the degree of conservation of the Na3 site, allowing for threonine at either Ser345 or Ser346. Na3 is present in 19.6% (4212) of these sequences including hSGLT1, which transports two Na+, but not vSGLT or hSGLT2, which transport only one Na+” (1). That’s a few months of works condensed into 55 words. Still, the exciting thing here is that we find an evolutionary conserved Na-binding site, which has so far eluded detection.

The results of this work provides a better understanding of how secondary active transporters harness additional energy from ion gradients. It may be possible to exploit differences in this mechanism between different SSS family members (and other transporters with the LeuT fold) to develop new antimicrobials, something that is urgently needed in the face of the rapidly increasing antibiotic resistance.

The structure of Proteus mirabilis SiaT

References

  1. Wahlgren WY°, North RA°, Dunevall E°, Paz A, Scalise M, Bisognano P, Bengtsson-Palme J, Goyal P, Claesson E, Caing-Carlsson R, Andersson R, Beis K, Nilsson U, Farewell A, Pochini L, Indiveri C, Grabe M, Dobson RCJ, Abramson J, Ramaswamy S, Friemann R: Substrate-bound outward-open structure of a Na+-coupled sialic acid symporter reveals a novel Na+ site. Nature Communications, 9, 1753 (2018). doi: 10.1038/s41467-018-04045-7
  2. Vimr ER, Kalivoda KA, Deszo EL, Steenburgen SM: Diversity of microbial sialic acid metabolism. Microbiology and Molecular Biology Reviews, 68, 132–153 (2004).
  3. North RA, Horne CR, Davies JS, Remus DM, Muscroft-Taylor AC, Goyal P, Wahlgren WY, Ramaswamy S, Friemann R, Dobson RCJ: “Just a spoonful of sugar…”: import of sialic acid across bacterial cell membranes. Biophysical Reviews, 10, 219–227 (2017).
  4. Severi E, Hosie AH, Hawkhead JA, Thomas GH: Characterization of a novel sialic acid transporter of the sodium solute symporter (SSS) family and in vivo comparison with known bacterial sialic acid transporters. FEMS Microbiology Letters, 304, 47–54 (2010).

New preprint: benchmarking resistance gene identification

This weekend, F1000Research put online the non-peer-reviewed version of the paper resulting from a workshop arranged by the JRC in Italy last year (1). (I will refer to this as a preprint, but at F1000Research the line is quite blurry between preprint and published paper.) The paper describes various challenges arising from the process of designing a benchmark strategy for bioinformatics pipelines (2) in the identification of antimicrobial resistance genes in next generation sequencing data.

The paper discusses issues about the benchmarking datasets used, testing samples, evaluation criteria for the performance of different tools, and how the benchmarking dataset should be created and distributed. Specially, we address the following questions:

  • How should a benchmark strategy handle the current and expanding universe of NGS platforms?
  • What should be the quality profile (in terms of read length, error rate, etc.) of in silico reference materials?
  • Should different sets of reference materials be produced for each platform? In that case, how to ensure no bias is introduced in the process?
  • Should in silico reference material be composed of the output of real experiments, or simulated read sets? If a combination is used, what is the optimal ratio?
  • How is it possible to ensure that the simulated output has been simulated “correctly”?
  • For real experiment datasets, how to avoid the presence of sensitive information?
  • Regarding the quality metrics in the benchmark datasets (e.g. error rate, read quality), should these values be fixed for all datasets, or fall within specific ranges? How wide can/should these ranges be?
  • How should the benchmark manage the different mechanisms by which bacteria acquire resistance?
  • What is the set of resistance genes/mechanisms that need to be included in the benchmark? How should this set be agreed upon?
  • Should datasets representing different sample types (e.g. isolated clones, environmental samples) be included in the same benchmark?
  • Is a correct representation of different bacterial species (host genomes) important?
  • How can the “true” value of the samples, against which the pipelines will be evaluated, be guaranteed?
  • What is needed to demonstrate that the original sample has been correctly characterised, in case real experiments are used?
  • How should the target performance thresholds (e.g. specificity, sensitivity, accuracy) for the benchmark suite be set?
  • What is the impact of these performance thresholds on the required size of the sample set?
  • How can the benchmark stay relevant when new resistance mechanisms are regularly characterized?
  • How is the continued quality of the benchmark dataset ensured?
  • Who should generate the benchmark resource?
  • How can the benchmark resource be efficiently shared?

Of course, we have not answered all these questions, but I think we have come down to a decent description of the problems, which we see as an important foundation for solving these issues and implementing the benchmarking standard. Some of these issues were tackled in our review paper from last year on using metagenomics to study resistance genes in microbial communities (3). The paper also somewhat connects to the database curation paper we published in 2016 (4), although this time the strategies deal with the testing datasets rather than the actual databases. The paper is the first outcome of the workshop arranged by the JRC on “Next-generation sequencing technologies and antimicrobial resistance” held October 4-5 last year in Ispra, Italy. You can find the paper here (it’s open access).

References and notes

  1. Angers-Loustau A, Petrillo M, Bengtsson-Palme J, Berendonk T, Blais B, Chan KG, Coque TM, Hammer P, Heß S, Kagkli DM, Krumbiegel C, Lanza VF, Madec J-Y, Naas T, O’Grady J, Paracchini V, Rossen JWA, Ruppé E, Vamathevan J, Venturi V, Van den Eede G: The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Research, 7, 459 (2018). doi: 10.12688/f1000research.14509.1
  2. You may remember that I hate the term “pipeline” for bioinformatics protocols. I would have preferred if it was called workflows or similar, but the term “pipeline” has taken hold and I guess this is a battle where I have essentially lost. The bioinformatics workflows will be known as pipelines, for better and worse.
  3. Bengtsson-Palme J, Larsson DGJ, Kristiansson E: Using metagenomics to investigate human and environmental resistomes. Journal of Antimicrobial Chemotherapy, 72, 2690–2703 (2017). doi: 10.1093/jac/dkx199
  4. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16, 18, 2454–2460 (2016). doi: 10.1002/pmic.201600034

Published paper: Annotating fungi from the built environment part II

MycoKeys earlier this week published a paper describing the results of a workshop in Aberdeen in April last year, where we refined annotations for fungal ITS sequences from the built environment (1). This was a follow-up on a workshop in May 2016 (2) and the results have been implemented in the UNITE database and shared with other online resources. The paper has also been highlighted at microBEnet. I have very little time to further comment on this at this very moment, but I believe, as I wrote last time, that distributed initiatives like this (and the ones I have been involved in in the past (3,4)) serve a very important purpose for establishing better annotation of sequence data (5). The full paper can be found here.

References

  1. Nilsson RH, Taylor AFS, Adams RI, Baschien C, Bengtsson-Palme J, Cangren P, Coleine C, Daniel H-M, Glassman SI, Hirooka Y, Irinyi L, Iršenaite R, Martin-Sánchez PM, Meyer W, Oh S-O, Sampaio JP, Seifert KA, Sklenár F, Stubbe D, Suh S-O, Summerbell R, Svantesson S, Unterseher M, Visagie CM, Weiss M, Woudenberg J, Wurzbacher C, Van den Wyngaert S, Yilmaz N, Yurkov A, Kõljalg U, Abarenkov K: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from an April 10-11, 2017 workshop (Aberdeen, UK). MycoKeys, 28, 65–82 (2018). doi: 10.3897/mycokeys.28.20887 [Paper link]
  2. Abarenkov K, Adams RI, Laszlo I, Agan A, Ambrioso E, Antonelli A, Bahram M, Bengtsson-Palme J, Bok G, Cangren P, Coimbra V, Coleine C, Gustafsson C, He J, Hofmann T, Kristiansson E, Larsson E, Larsson T, Liu Y, Martinsson S, Meyer W, Panova M, Pombubpa N, Ritter C, Ryberg M, Svantesson S, Scharn R, Svensson O, Töpel M, Untersehrer M, Visagie C, Wurzbacher C, Taylor AFS, Kõljalg U, Schriml L, Nilsson RH: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1–15 (2016). doi: 10.3897/mycokeys.16.10000
  3. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  4. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  5. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034

Bug hunting in the Metaxa2 beta

Due to an extremely embarrassing for-loop error in the classifier of the most recent Metaxa2 beta (beta 8), which was released a few weeks ago, the classifier often would (on certain platforms and configurations) enter an endless loop and hang. I apologize for this mistake, which has been corrected in the new beta 9 released today, available from this download link. No other changes have been made since the previous version. Thanks for your patience (and thanks Kaisa Thorell for first bringing my attention the error!)