Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg | Wisconsin Institute for Discovery

I have been quite occupied with other things the last couple of days, so I am late on the ball here. Anyway, on May 1st, Nature Communications published a paper on the protein structure of SiaT, a sialic acid transporter from Proteus mirabilis (1). Many pathogens use sialic acids as an energy source or as an external coating to evade the immune defense (2). Therefore, many bacteria that colonize sialylated environments have transporters which specifically import sialic acids. SiaT is one of those transporters, belonging to the sodium solute symporter (SSS) family (3) (with for some weird reason is associated with the Pfam family “SSF”, an eternal source of confusion in discussions within this project). The SSS proteins use Na+ gradients to drive the import of desired substrates (4). Based on the protein structure, our team found that SiaT binds two Na+ ions. One binds to the conserved, well-known, Na2 site, but the other Na+ binds to a new position, which we term Na3. This position (this is where my part of the work comes in) is conserved in many SSS family members. We finally used functional and molecular dynamics studies to validate the substrate-binding site and demonstrate that both Na+ sites regulate N-acetylneuraminic acid transport.

As I hinted, i am not venturing into protein structures – that part of this work has been performed by an excellent team associated with Dr. Rosmarie Friemann. Instead, my part is essentially summarized in these two sentences of the manuscript: “We analysed all SSS sequences that contained the primary Na2 site (21,467) to determine the degree of conservation of the Na3 site, allowing for threonine at either Ser345 or Ser346. Na3 is present in 19.6% (4212) of these sequences including hSGLT1, which transports two Na+, but not vSGLT or hSGLT2, which transport only one Na+” (1). That’s a few months of works condensed into 55 words. Still, the exciting thing here is that we find an evolutionary conserved Na-binding site, which has so far eluded detection.

The results of this work provides a better understanding of how secondary active transporters harness additional energy from ion gradients. It may be possible to exploit differences in this mechanism between different SSS family members (and other transporters with the LeuT fold) to develop new antimicrobials, something that is urgently needed in the face of the rapidly increasing antibiotic resistance.

The structure of Proteus mirabilis SiaT

References

  1. Wahlgren WY°, North RA°, Dunevall E°, Paz A, Scalise M, Bisognano P, Bengtsson-Palme J, Goyal P, Claesson E, Caing-Carlsson R, Andersson R, Beis K, Nilsson U, Farewell A, Pochini L, Indiveri C, Grabe M, Dobson RCJ, Abramson J, Ramaswamy S, Friemann R: Substrate-bound outward-open structure of a Na+-coupled sialic acid symporter reveals a novel Na+ site. Nature Communications, 9, 1753 (2018). doi: 10.1038/s41467-018-04045-7
  2. Vimr ER, Kalivoda KA, Deszo EL, Steenburgen SM: Diversity of microbial sialic acid metabolism. Microbiology and Molecular Biology Reviews, 68, 132–153 (2004).
  3. North RA, Horne CR, Davies JS, Remus DM, Muscroft-Taylor AC, Goyal P, Wahlgren WY, Ramaswamy S, Friemann R, Dobson RCJ: “Just a spoonful of sugar…”: import of sialic acid across bacterial cell membranes. Biophysical Reviews, 10, 219–227 (2017).
  4. Severi E, Hosie AH, Hawkhead JA, Thomas GH: Characterization of a novel sialic acid transporter of the sodium solute symporter (SSS) family and in vivo comparison with known bacterial sialic acid transporters. FEMS Microbiology Letters, 304, 47–54 (2010).

My colleagues in Gothenburg have published a new paper in Environment International, in which I was involved in the bioinformatics analyses. In the paper, for which Nadine Kraupner did the lion’s share of the work, we establish minimal selective concentrations (MSCs) for resistance to the antibiotic ciprofloxacin in Escherichia coli grown in complex microbial communities (1). We also determine the community responses at the taxonomic and resistance gene levels. Nadine has made use of Sara Lundström’s aquarium system (2) to grow biofilms in the exposure of sublethal levels of antibiotics. Using the system, we find that 1 μg/L ciprofloxacin selects for the resistance gene qnrD, while 10 μg/L ciprofloxacin is required to detect changes of phenotypic resistance. In short, the different endpoints studied (and their corresponding MSCs) were:

  • CFU counts from test tubes, grown on R2A plates with 2 mg/L ciprofloxain – MSC = 5 μg/L
  • CFU counts from aquaria, grown on R2A plates with 0.25 or 2 mg/L ciprofloxain – MSC = 10 μg/L
  • Chromosomal resistance mutations – MSC ~ 10 μg/L
  • Increased resistance gene abundances, metagenomics – MSC range: 1 μg/L
  • Changes to taxonomic diversity1 µg/L
  • Changes to taxonomic community composition – MSC ~ 1-10 μg/L

We have previously reported a predicted no-effect concentration for resistance of 0.064 µg/L for ciprofloxacin (3), which corresponds fairly well with the MSCs determined experimentally here, being around a factor of ten off. However, we cannot exclude that in other experimental systems, the selective effects of ciprofloxacin could be even lower and thus the predicted PNEC may still be relevant. The selective concentrations we report for ciprofloxacin are close to those that have been reported in sewage treatment plants (3-5), suggesting the possibility for weak selection of resistance. Several recent reports have underscored the need to populate the this far conceptual models for resistance development in the environment with actual numbers (6-10). Determining selective concentrations for different antibiotics in actual community settings is an important step on the road towards building accurate quantitative models for resistance emergence and propagation.

References

  1. Kraupner N, Ebmeyer S, Bengtsson-Palme J, Fick J, Kristiansson E, Flach C-F, Larsson DGJ: Selective concentration for ciprofloxacin in Escherichia coli grown in complex aquatic bacterial biofilms. Environment International, 116, 255–268 (2018). doi: 10.1016/j.envint.2018.04.029 [Paper link]
  2. Lundström SV, Östman M, Bengtsson-Palme J, Rutgersson C, Thoudal M, Sircar T, Blanck H, Eriksson KM, Tysklind M, Flach C-F, Larsson DGJ: Minimal selective concentrations of tetracycline in complex aquatic bacterial biofilms. Science of the Total Environment, 553, 587–595 (2016). doi: 10.1016/j.scitotenv.2016.02.103 [Paper link]
  3. Bengtsson-Palme J, Larsson DGJ: Concentrations of antibiotics predicted to select for resistant bacteria: Proposed limits for environmental regulation. Environment International, 86, 140-149 (2016). doi: 10.1016/j.envint.2015.10.015
  4. Michael I, Rizzo L, McArdell CS, Manaia CM, Merlin C, Schwartz T, Dagot C, Fatta-Kassinos D: Urban wastewater treatment plants as hotspots for the release of antibiotics in the environment: a review. Water Research, 47, 957–995 (2013). doi:10.1016/j.watres.2012.11.027
  5. Bengtsson-Palme J, Hammarén R, Pal C, Östman M, Björlenius B, Flach C-F, Kristiansson E, Fick J, Tysklind M, Larsson DGJ: Elucidating selection processes for antibiotic resistance in sewage treatment plants using metagenomics. Science of the Total Environment, 572, 697–712 (2016). doi: 10.1016/j.scitotenv.2016.06.228
  6. Ågerstrand M, Berg C, Björlenius B, Breitholtz M, Brunstrom B, Fick J, Gunnarsson L, Larsson DGJ, Sumpter JP, Tysklind M, Rudén C: Improving environmental risk assessment of human pharmaceuticals. Environmental Science and Technology (2015). doi:10.1021/acs.est.5b00302
  7. Bengtsson-Palme J, Kristiansson E, Larsson DGJ: Environmental factors influencing the development and spread of antibiotic resistance. FEMS Microbiology Reviews, 42, 1, 68–80 (2018). doi: 10.1093/femsre/fux053
  8. Joint Programming Initiative on Antimicrobial Resistance: JPIAMR Workshop on Environmental Dimensions of AMR: Summary and recommendations. JPIAMR (2017). [Link]
  9. Angers A, Petrillo P, Patak, A, Querci M, Van den Eede G: The Role and Implementation of Next-Generation Sequencing Technologies in the Coordinated Action Plan against Antimicrobial Resistance. JRC Conference and Workshop Report, EUR 28619 (2017). doi: 10.2760/745099
  10. Larsson DGJ, Andremont A, Bengtsson-Palme J, Brandt KK, de Roda Husman AM, Fagerstedt P, Fick J, Flach C-F, Gaze WH, Kuroda M, Kvint K, Laxminarayan R, Manaia CM, Nielsen KM, Ploy M-C, Segovia C, Simonet P, Smalla K, Snape J, Topp E, van Hengel A, Verner-Jeffreys DW, Virta MPJ, Wellington EM, Wernersson A-S: Critical knowledge gaps and research needs related to the environmental dimensions of antibiotic resistance. Environment International, in press (2018). doi: 10.1016/j.envint.2018.04.041

This weekend, F1000Research put online the non-peer-reviewed version of the paper resulting from a workshop arranged by the JRC in Italy last year (1). (I will refer to this as a preprint, but at F1000Research the line is quite blurry between preprint and published paper.) The paper describes various challenges arising from the process of designing a benchmark strategy for bioinformatics pipelines (2) in the identification of antimicrobial resistance genes in next generation sequencing data.

The paper discusses issues about the benchmarking datasets used, testing samples, evaluation criteria for the performance of different tools, and how the benchmarking dataset should be created and distributed. Specially, we address the following questions:

  • How should a benchmark strategy handle the current and expanding universe of NGS platforms?
  • What should be the quality profile (in terms of read length, error rate, etc.) of in silico reference materials?
  • Should different sets of reference materials be produced for each platform? In that case, how to ensure no bias is introduced in the process?
  • Should in silico reference material be composed of the output of real experiments, or simulated read sets? If a combination is used, what is the optimal ratio?
  • How is it possible to ensure that the simulated output has been simulated “correctly”?
  • For real experiment datasets, how to avoid the presence of sensitive information?
  • Regarding the quality metrics in the benchmark datasets (e.g. error rate, read quality), should these values be fixed for all datasets, or fall within specific ranges? How wide can/should these ranges be?
  • How should the benchmark manage the different mechanisms by which bacteria acquire resistance?
  • What is the set of resistance genes/mechanisms that need to be included in the benchmark? How should this set be agreed upon?
  • Should datasets representing different sample types (e.g. isolated clones, environmental samples) be included in the same benchmark?
  • Is a correct representation of different bacterial species (host genomes) important?
  • How can the “true” value of the samples, against which the pipelines will be evaluated, be guaranteed?
  • What is needed to demonstrate that the original sample has been correctly characterised, in case real experiments are used?
  • How should the target performance thresholds (e.g. specificity, sensitivity, accuracy) for the benchmark suite be set?
  • What is the impact of these performance thresholds on the required size of the sample set?
  • How can the benchmark stay relevant when new resistance mechanisms are regularly characterized?
  • How is the continued quality of the benchmark dataset ensured?
  • Who should generate the benchmark resource?
  • How can the benchmark resource be efficiently shared?

Of course, we have not answered all these questions, but I think we have come down to a decent description of the problems, which we see as an important foundation for solving these issues and implementing the benchmarking standard. Some of these issues were tackled in our review paper from last year on using metagenomics to study resistance genes in microbial communities (3). The paper also somewhat connects to the database curation paper we published in 2016 (4), although this time the strategies deal with the testing datasets rather than the actual databases. The paper is the first outcome of the workshop arranged by the JRC on “Next-generation sequencing technologies and antimicrobial resistance” held October 4-5 last year in Ispra, Italy. You can find the paper here (it’s open access).

References and notes

  1. Angers-Loustau A, Petrillo M, Bengtsson-Palme J, Berendonk T, Blais B, Chan KG, Coque TM, Hammer P, Heß S, Kagkli DM, Krumbiegel C, Lanza VF, Madec J-Y, Naas T, O’Grady J, Paracchini V, Rossen JWA, Ruppé E, Vamathevan J, Venturi V, Van den Eede G: The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Research, 7, 459 (2018). doi: 10.12688/f1000research.14509.1
  2. You may remember that I hate the term “pipeline” for bioinformatics protocols. I would have preferred if it was called workflows or similar, but the term “pipeline” has taken hold and I guess this is a battle where I have essentially lost. The bioinformatics workflows will be known as pipelines, for better and worse.
  3. Bengtsson-Palme J, Larsson DGJ, Kristiansson E: Using metagenomics to investigate human and environmental resistomes. Journal of Antimicrobial Chemotherapy, 72, 2690–2703 (2017). doi: 10.1093/jac/dkx199
  4. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16, 18, 2454–2460 (2016). doi: 10.1002/pmic.201600034

I was recently invited to review a manuscript for a journal (1). After half the time to review deadline had passed, I received a mail stating that “In the interest of your time and the authors’ time, I am making a decision without the benefit of your input.” While I do understand that big journals receive many submissions and that the editor made this decision in the interest of time, I think that it should also be kept in mind that I had already spent approximately three hours scrutinizing the manuscript. Due to the decision of the editor, these are now three hours of work down the drain.

Furthermore, I was not informed what the decision was, and my access to the paper in the reviewing system was revoked. This means that I don’t even know if my opinions are concordant with the rest of the reviewers or not. Perhaps I had actually spotted crucial errors that the other reviewers had missed? Or maybe the paper was rejected, and my input was therefore no longer needed. I don’t know, because I was not informed.

These days, I receive many requests to perform manuscript reviews. A journal treating its reviewers like this causes me to lose all willingness to review for that journal again. To me, making a decision to dismiss reviewers without even asking them if they are about to submit comments, signals that a journal does not value its peer reviewers, and that I can spend my time better elsewhere. Similar to authors and editors, I do not want to waste my time on tasks that end up being of no use.

On the upside, this decision by the editor has freed some time for me to write up this rant, including the following advice: If you are an editor of a journal and you want to keep the reviewers (who, I remind you, work for free and are largely unrecognized for their work) happy, try to avoid pissing them off by dismissing their work. It does not hurt to ask them if they are about to submit their comments, or if they – given that a sufficient number of review reports have been submitted – would rather withdraw from the review process. This may add an extra day or two to the process, but I think that in the long run both authors, editors and reviewers would agree that the overall quality of the peer review process would benefit from those few extra days.

Footnotes

  1. I am not going to name the journal here, nor the identity of the editor, because that is not my point. I am not after singling out certain people here, but I want to address an overall behavior that annoys me. That said, the last three papers I have been a co-author on in this journal took five to seven weeks from the authors correction of the proofs until publication online. I find it stunning that with these delays, the journal dismisses the reviewers it has invited because they don’t produce peer reviews quicker than the deadline proposed by the journal. The real bottleneck in this process is not extensive review times, at least not in my experience.

I’ve been having a very intense start of the year with the move to the US and getting the family accustomed to Madison (which has taken time and energy, but gone really well). I just wanted to make you aware of that I have started posting at the Wisconsin Blog again and hope to be sharing research related stuff from my year in the US there. For more personal stuff, our family has set up a blog (in Swedish) at this address: https://palmeiamerikat.blogspot.com. You are very welcome to follow our adventure there!

MycoKeys earlier this week published a paper describing the results of a workshop in Aberdeen in April last year, where we refined annotations for fungal ITS sequences from the built environment (1). This was a follow-up on a workshop in May 2016 (2) and the results have been implemented in the UNITE database and shared with other online resources. The paper has also been highlighted at microBEnet. I have very little time to further comment on this at this very moment, but I believe, as I wrote last time, that distributed initiatives like this (and the ones I have been involved in in the past (3,4)) serve a very important purpose for establishing better annotation of sequence data (5). The full paper can be found here.

References

  1. Nilsson RH, Taylor AFS, Adams RI, Baschien C, Bengtsson-Palme J, Cangren P, Coleine C, Daniel H-M, Glassman SI, Hirooka Y, Irinyi L, Iršenaite R, Martin-Sánchez PM, Meyer W, Oh S-O, Sampaio JP, Seifert KA, Sklenár F, Stubbe D, Suh S-O, Summerbell R, Svantesson S, Unterseher M, Visagie CM, Weiss M, Woudenberg J, Wurzbacher C, Van den Wyngaert S, Yilmaz N, Yurkov A, Kõljalg U, Abarenkov K: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from an April 10-11, 2017 workshop (Aberdeen, UK). MycoKeys, 28, 65–82 (2018). doi: 10.3897/mycokeys.28.20887 [Paper link]
  2. Abarenkov K, Adams RI, Laszlo I, Agan A, Ambrioso E, Antonelli A, Bahram M, Bengtsson-Palme J, Bok G, Cangren P, Coimbra V, Coleine C, Gustafsson C, He J, Hofmann T, Kristiansson E, Larsson E, Larsson T, Liu Y, Martinsson S, Meyer W, Panova M, Pombubpa N, Ritter C, Ryberg M, Svantesson S, Scharn R, Svensson O, Töpel M, Untersehrer M, Visagie C, Wurzbacher C, Taylor AFS, Kõljalg U, Schriml L, Nilsson RH: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1–15 (2016). doi: 10.3897/mycokeys.16.10000
  3. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  4. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  5. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034

Due to an extremely embarrassing for-loop error in the classifier of the most recent Metaxa2 beta (beta 8), which was released a few weeks ago, the classifier often would (on certain platforms and configurations) enter an endless loop and hang. I apologize for this mistake, which has been corrected in the new beta 9 released today, available from this download link. No other changes have been made since the previous version. Thanks for your patience (and thanks Kaisa Thorell for first bringing my attention the error!)

I am very happy to announce that a first public beta version of Metaxa2 version 2.2 has been released today! This new version brings two big and a number of small improvements to the Metaxa2 software (1). The first major addition is the introduction of the Metaxa2 Database Builder, which allows the user to create custom databases for virtually any genetic barcoding region. The second addition, which is related to the first, is that the classifier has been rewritten to have a more solid mathematical foundation. I have been promising that these updates were coming “soon” for one and a half years, but finally the end-product is good enough to see some real world testing. Bear in mind though that this is still a beta version that could contain obscure bugs. Here follows a list of new features (with further elaboration on a few below):

  • The Metaxa2 Database Builder
  • Support for additional barcoding genes, virtually any genetic region can now be used for taxonomic classification in Metaxa2
  • The Metaxa2 database repository, which can be accessed through the new metaxa2_install_database tool
  • Improved classification scoring model for better clarity and sensitivity
  • A bundled COI database for athropods, showing off the capabilities of the database builder
  • Support for compressed input files (gzip, zip, bzip, dsrc)
  • Support for auto-detection of database locations
  • Added output of probable taxonomic origin for sequences with reliability scores at each rank, made possible by the updated classifier
  • Added the -x option for running only the extraction without the classification step
  • Improved memory handling for very large rRNA datasets in the classifier (millions of sequences)
  • This update also fixes a bug in the metaxa2_rf tool that could cause bias in very skewed datasets with small numbers of taxa

The new version of Metaxa2 can be downloaded here, and for those interested I will spend the rest of this post outlining the Metaxa2 Database Builder. The information below is also available in a slightly extended version in the software manual.

The major enhancement in Metaxa2 version 2.2 is the ability to use custom databases for classification. This means that the user can now make their own database for their own barcoding region of choice, or download additional databases from the Metaxa2 Database Repository. The selection of other databases is made through the “-g” option already existing in Metaxa2. As part of these changes, we have also updated the classification scoring model for better stringency and sensitivity across multiple databases and different genes. The old scoring system can still be used by specifying the –scoring_model option to “old”.

There are two different main operating modes of the Metaxa2 Database Builder, as well as a hybrid mode combining the features of the two other modes. The divergent and conserved modes work in almost completely different ways and deal with two different types of barcoding regions. The divergent mode is designed to deal with barcoding regions that exhibit fairly large variation between taxa within the same taxonomic domain. Such regions include, e.g., the eukaryotic ITS region, or the trnL gene used for plant barcoding. In the other mode – the conserved mode – a highly conserved barcoding region is expected (at least within the different taxonomic domains). Genes that fall into this category would be, e.g., the 16S SSU rRNA, and the bacterial rpoB gene. This option would most likely also be suitable for barcoding within certain groups of e.g. plants, where similarity of the barcoding regions can be expected to be high. There is also a third mode – the hybrid mode – that incorporates features of both the other. The hybrid mode is more experimental in nature, but could be useful in situations where both the other modes perform poorer than desired.

In the divergent (default) mode, the database builder will start by clustering the input sequences at 20% identity using USEARCH (2). All clusters generated from this process are then individually aligned using MAFFT (3). Those alignments are split into two regions, which are used to build two hidden Markov models for each cluster of sequences. These models will be less precise, but more sensitive than those generated in the conserved mode. In the divergent mode, the database builder will attempt to extract full-length sequences from the input data, but this may be less successful than in the conserved mode.

In the conserved mode, on the other hand, the database builder will first extract the barcoding region from the input sequences using models built from a reference sequence provided (see above) and the Metaxa2 extractor (1). It will then align all the extracted sequences using MAFFT and determine the conservation of each position in the alignment. When the criteria for degree of conservation are met, all conserved regions are extracted individually and are then re-aligned separately using MAFFT. The re-aligned sequences are used to build hidden Markov models representing the conserved regions with HMMER (4). In this mode, the classification database will only consist of the extracted full-length sequences.

In the hybrid mode, finally, the database builder will cluster the input sequences at 20% identity using USEARCH, and then proceed with the conserved mode approach on each cluster separately .

The actual taxonomic classification in Metaxa2 is done using a sequence database. It was shown in the original Metaxa2 paper that replacing the built-in database with a generic non-processed one was detrimental to performance in terms of accuracy (1). In the database builder, we have tried to incorporate some of the aspects of the manual database curation we did for the built-in database that can be automated. By default, all these filtration steps are turned off, but enabling them might drastically increase the accuracy of classifications based on the database.

To assess the accuracy of the constructed database, the Metaxa2 Database Builder allows for testing the detection ability and classification accuracy of the constructed database. This is done by sub-dividing the database sequences into subsets and rebuilding the database using a smaller (by default 90%), randomly selected, set of the sequence data (5). The remaining sequences (10% by default) are then classified using Metaxa2 with the subset database. The number of detections, and the numbers of correctly or incorrectly classified entries are recorded and averaged over a number of iterations (10 by default). This allows for obtaining a picture of the lower end of the accuracy of the database. However, since the evaluation only uses a subset of all sequences included in the full database, the performance of the full database actually constructed is likely to be slightly better. The evaluation can be turned on using the “–evaluate T” option.

Metaxa2 2.2 also introduces the database repository, from which the user can download additional databases for Metaxa2. To download new databases from the repository, the metaxa2_install_database command is used. This is a simple piece of software but requires internet access to function.

References

  1. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved Identification and Taxonomic Classification of Small and Large Subunit rRNA in Metagenomic Data. Molecular Ecology Resources (2015). doi: 10.1111/1755-0998.12399 [Paper link]
  2. Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26, 2460–2461 (2010).
  3. Katoh K, Standley DM: MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30, 772–780 (2013).
  4. Eddy SR: Accelerated profile HMM searches. PLoS Computational Biology, 7, e1002195 (2011).
  5. Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, 17, 4, 760–769 (2017). doi: 10.1111/1755-0998.12628