Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg | Wisconsin Institute for Discovery

Browsing Posts tagged ITS

In the 2019 database issue, Nucleic Acids Research will include a new paper on the UNITE database for molecular identification of fungi (1). I have been involved in the development of UNITE in different ways since 2012, most prominently via the ITSx (2) and Atosh software which are ticking under the hood of the database.

In this update paper, we introduce a redesigned handling of unclassifiable species hypotheses, integration with the taxonomic backbone of the Global Biodiversity Information Facility, and support for an unlimited number of parallel taxonomic classification systems. The database now contains around one million fungal ITS sequences that can be used for reference, which are clustered into roughly 459,000 species hypotheses (3). Each species hypothesis is assigned a digital object identifier (DOI), which enables unambiguous reference across studies. The paper is available as open access and the UNITE database is available open source from here.

References

  1. Nilsson RH, Larsson K-H, Taylor AFS, Bengtsson-Palme J, Jeppesen TS, Schigel D, Kennedy P, Picard K, Glöckner FO, Tedersoo L, Saar I, Kõljalg U, Abarenkov K: The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications. Nucleic Acids Research, Advance article, gky1022 (2018). doi: 10.1093/nar/gky1022
  2. Bengtsson-Palme J, Ryberg M, Hartmann M, Branco S, Wang Z, Godhe A, De Wit P, Sánchez-García M, Ebersberger I, de Souza F, Amend AS, Jumpponen A, Unterseher M, Kristiansson E, Abarenkov K, Bertrand YJK, Sanli K, Eriksson KM, Vik U, Veldre V, Nilsson RH: Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for use in environmental sequencing. Methods in Ecology and Evolution, 4, 10, 914–919 (2013). doi: 10.1111/2041-210X.12073
  3. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481

On Friday, Molecular Ecology Resources put online Christian Wurzbacher‘s latest paper, of which I am also a coauthor. The paper presents three sets of general primers that allow for amplification of the complete ribosomal operon from the ribosomal tandem repeats, covering all the ribosomal markers (ETS, SSU, ITS1, 5.8S, ITS2, LSU, and IGS) (1). This paper is important because it introduces a technique to utilize third generation sequencing (PacBio and Nanopore) to generate high‐quality reference data (equivalent or better than Sanger sequencing) in a high‐throughput manner. The paper shows that the quality of the Nanopore generated sequences was 99.85%, which is comparable with the 99.78% accuracy described for Sanger sequencing.

My main contribution to this paper is the consensus sequence generation script – Consension – which is available from my software page. Importantly, there are huge gaps in the reference databases we use for taxonomic classification and this method will facilitate the integration of reference data from all of the ribosomal markers. We hope that this work will stimulate large-scale generation of ribosomal reference data covering several marker genes, linking previously spread-out information together.

Reference

  1. Wurzbacher C, Larsson E, Bengtsson-Palme J, Van den Wyngaert S, Svantesson S, Kristiansson E, Kagami M, Nilsson RH: Introducing ribosomal tandem repeat barcoding for fungi. Molecular Ecology Resources, Accepted article (2018). doi: 10.1111/1755-0998.12944 [Paper link]

One of the questions I have received regarding Metaxa2 is if it is possible to use it on other DNA barcodes. My answer has been “technically, yes, but it is a very cumbersome process of creating a custom database for every additional barcode”. Not anymore, the newly introduced Metaxa2 Database Builder makes this process automatic, with the user just supplying a FASTA file of sequences from the region in question and a file containing the taxonomy information for the sequences (in GenBank, NSD XML, Metaxa2 or SILVA-style formats). The preprint (1) has been out for some time, but today Bioinformatics published the paper describing the software (2).

The paper not only details how the database builder works, but also shows that it is working on a number of different barcoding regions, albeit with different results in terms of accuracy. Still, even with seemingly high misclassification rates for some DNA barcodes, the software performs better than a simple BLAST-based taxonomic assignment (76.5% vs. 41.4% correct classifications for matK, and 76.2% vs. 45.1% for tnrL). The database builder has already found use in building a COI database for anthropods (3), and we envision a range of uses in the near future.

As the paper is now published, I have also moved the Metaxa2 software (4) from beta-status to a full-worthy version 2.2 update. Hopefully, this release should be bug free, but my experience is that when the community gets their hands of the software they tend to discover things our team has missed. I would like to thank the entire team working on this, particularly Rodney Richardson (who initiated this entire thing) and Henrik Nilsson. The software can be downloaded here. Happy barcoding!

References

  1. Bengtsson-Palme J, Richardson RT, Meola M, Wurzbacher C, Tremblay ED, Thorell K, Kanger K, Eriksson KM, Bilodeau GJ, Johnson RM, Hartmann M, Nilsson RH: Taxonomic identification from metagenomic or metabarcoding data using any genetic marker. bioRxiv 253377 (2018). doi: 10.1101/253377 [Link]
  2. Bengtsson-Palme J, Richardson RT, Meola M, Wurzbacher C, Tremblay ED, Thorell K, Kanger K, Eriksson KM, Bilodeau GJ, Johnson RM, Hartmann M, Nilsson RH: Metaxa2 Database Builder: Enabling taxonomic identification from metagenomic and metabarcoding data using any genetic marker. Bioinformatics, advance article (2018). doi: 10.1093/bioinformatics/bty482 [Paper link]
  3. Richardson RT, Bengtsson-Palme J, Gardiner MM, Johnson RM: A reference cytochrome c oxidase subunit I database curated for hierarchical classification of arthropod metabarcoding data. PeerJ Preprints, 6, e26662v1 (2018). doi: 10.7287/peerj.preprints.26662v1 [Link]
  4. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399 [Paper link]

MycoKeys earlier this week published a paper describing the results of a workshop in Aberdeen in April last year, where we refined annotations for fungal ITS sequences from the built environment (1). This was a follow-up on a workshop in May 2016 (2) and the results have been implemented in the UNITE database and shared with other online resources. The paper has also been highlighted at microBEnet. I have very little time to further comment on this at this very moment, but I believe, as I wrote last time, that distributed initiatives like this (and the ones I have been involved in in the past (3,4)) serve a very important purpose for establishing better annotation of sequence data (5). The full paper can be found here.

References

  1. Nilsson RH, Taylor AFS, Adams RI, Baschien C, Bengtsson-Palme J, Cangren P, Coleine C, Daniel H-M, Glassman SI, Hirooka Y, Irinyi L, Iršenaite R, Martin-Sánchez PM, Meyer W, Oh S-O, Sampaio JP, Seifert KA, Sklenár F, Stubbe D, Suh S-O, Summerbell R, Svantesson S, Unterseher M, Visagie CM, Weiss M, Woudenberg J, Wurzbacher C, Van den Wyngaert S, Yilmaz N, Yurkov A, Kõljalg U, Abarenkov K: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from an April 10-11, 2017 workshop (Aberdeen, UK). MycoKeys, 28, 65–82 (2018). doi: 10.3897/mycokeys.28.20887 [Paper link]
  2. Abarenkov K, Adams RI, Laszlo I, Agan A, Ambrioso E, Antonelli A, Bahram M, Bengtsson-Palme J, Bok G, Cangren P, Coimbra V, Coleine C, Gustafsson C, He J, Hofmann T, Kristiansson E, Larsson E, Larsson T, Liu Y, Martinsson S, Meyer W, Panova M, Pombubpa N, Ritter C, Ryberg M, Svantesson S, Scharn R, Svensson O, Töpel M, Untersehrer M, Visagie C, Wurzbacher C, Taylor AFS, Kõljalg U, Schriml L, Nilsson RH: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1–15 (2016). doi: 10.3897/mycokeys.16.10000
  3. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  4. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  5. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034

ITSx in Bioconda

Comments off

Mattias de Hollander at the Netherlands Institute of Ecology kindly informed me that they recently added the ITSx 1.1b version to the Bioconda package manager. This will make it easy for Conda users to install ITSx automatically into their systems and pipelines and also for others who are using conda. The Bioconda version can be found here. I would like to thank Mattias for this initiative and hope that the Bioconda version of ITSx will find much use!

Today, I am very happy to announce that after years in the making and months in testing, the next generation of ITSx, version 1.1, is ready to step into the public light and scrutiny. I have today uploaded a public beta version of the ITSx 1.1 release, which I encourage everyone that have enjoyed using ITSx to try out.

The 1.1 release of ITSx includes a wide range of new feature, including:

  • A 2-10x performance increase (depending on the dataset), since ITSx now utilizes hmmsearch instead of hmmscan to detect the ITS regions and distributes the CPU cores better
  • Improved ITS detection among fungi and chlorophyta, by addition of new HMM-profiles
  • The HMM profile format for ITSx has been updated to HMMER3/f (thus ITSx now requires HMMER version 3.1 or later)
  • Better handling of interrupted HMMER searches
  • Added the --require_anchor option to only include sequences where the complete anchor is found in the output
  • Added the possibility for partial sequence output for the SSU, LSU and 5.8S regions
  • Fixed a bug causing problems when reading sequence data from standard input

A lot of the code has changed in this version, which means that there might still be bugs lingering in the program. Since I will be on vacation throughout July, I encourage everyone to submit bug reports and questions, but I will not promise to respond to them until in August.

I hope that you will enjoy this new ITSx release, which you can download here. Happy barcoding!

Yesterday, Molecular Ecology Resources put online an unedited version of a recent paper which I co-authored. This time, Rodney Richardson at Ohio State University has made a tremendous work of evaluating three taxonomic classification software – the RDP Naïve Bayesian Classifier, RTAX and UTAX – on a set of DNA barcoding regions commonly used for plants, namely the ITS2, and the matK, rbcL, trnL and trnH genes.

In the paper (1), we discuss the results, merits and limitations of the classifiers. In brief, we found that:

  • There is a considerable trade-off between accuracy and sensitivity for the classifiers tested, which indicates a need for improved sequence classification tools (2)
  • UTAX was superior with respect to error rate, but was exceedingly stringent and thus suffered from a low assignment rate
  • The RDP Naïve Bayesian Classifier displayed high sensitivity and low error at the family and order levels, but had a genus-level error rate of 9.6 percent
  • RTAX showed high sensitivity at all taxonomic ranks, but at the same time consistently produced the high error rates
  • The choice of locus has significant effects on the classification sensitivity of all tested tools
  • All classifiers showed strong relationships between database completeness, classification sensitivity and classification accuracy

We believe that the methods of comparison we have used are simple and robust, and thereby provides a methodological and conceptual foundation for future software evaluations. On a personal note, I will thoroughly enjoy working with Rodney and Reed again; I had a great time discussing the ins and outs of taxonomic classification with them! The paper can be found here.

References and notes

  1. Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, Early view (2016). doi: 10.1111/1755-0998.12628 [Paper link]
  2. This is something that several classifiers also showed in the evaluation we did for the Metaxa2 paper (3). Interestingly enough, Metaxa2 is better at maintaining high accuracy also as sensitivity is increased.
  3. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399 [Paper link]

MycoKeys today put a paper online which I was involved in. The paper describes the results of a workshop in May, when we added and refined annotations for fungal ITS sequences according to the MIxS-Built Environment annotation standard (1). Fungi have been associated with a range of unwanted effects in the built environment, including asthma, decay of building materials, and food spoilage. However, the state of the metadata annotation of fungal DNA sequences from the built environment is very much incomplete in public databases. The workshop aimed to ease a little part of this problem, by distributing the re-annotation of public fungal ITS sequences across 36 persons. In total, we added or changed of 45,488 data points drawing from published literature, including addition of 8,430 instances of countries of collection, 5,801 instances of building types, and 3,876 instances of surface-air contaminants. The results have been implemented in the UNITE database and shared with other online resources. I believe, that distributed initiatives like this (and the ones I have been involved in in the past (2,3)) serve a very important purpose for establishing better annotation of sequence data, an issue I have brought up also for sequences outside of barcoding genes (4). The full paper can be found here.

References

  1. Abarenkov K, Adams RI, Laszlo I, Agan A, Ambrioso E, Antonelli A, Bahram M, Bengtsson-Palme J, Bok G, Cangren P, Coimbra V, Coleine C, Gustafsson C, He J, Hofmann T, Kristiansson E, Larsson E, Larsson T, Liu Y, Martinsson S, Meyer W, Panova M, Pombubpa N, Ritter C, Ryberg M, Svantesson S, Scharn R, Svensson O, Töpel M, Untersehrer M, Visagie C, Wurzbacher C, Taylor AFS, Kõljalg U, Schriml L, Nilsson RH: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1–15 (2016). doi: 10.3897/mycokeys.16.10000
  2. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  3. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  4. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034

Some of you who think ITSx is running slowly despite being assigned multiple CPUs, particularly on datasets with only one kind of sequences (e.g. fungal) using the -t F option might be interested in trying out Andrew Krohn’s parallel ITSx implementation. The solution essentially employs a bash script spawning multiple ITSx instances running on different portions of the input file. Although there are some limitations to the script (e.g. you cannot select a custom name for the output and you will only get the ITS1 and ITS2 + full sequences FASTA files, as far as I understand the script), it may prove useful for many of you until we write up a proper solution to the poor multi-thread performance of ITSx (planned for version 1.1). In the coming months, I recommend that you check this solution out! See also the wiki documentation.

My speed tests shows the following (on a quite small test set of fungal ITS sequences):
ITSx parallel on 16 CPUs, all ITS types (option “-t all“):
3 min, 16 sec
ITSx parallel on 16 CPUs, only fungal ITS types (option “-t f“):
54 sec
ITSx native on 16 CPUs, all ITS types (options “-t all --cpu 16“):
4 min, 59 sec
ITSx native on 16 CPUs, only fungal types (options “-t f --cpu 16“):
5 min, 50 sec

Why fungal only took longer time in the native implementation is a mystery to me, but probably shows why there is a need to rewrite the multithreading code, as we did with Metaxa a couple of years ago. Stay tuned for ITSx updates!

A couple of days ago, a paper I have co-authored describing an ITS sequence dataset for chimera control in fungi went online as an advance online publication in Microbes and Environments. There are several software tools available for chimera detection (e.g. Henrik Nilsson‘s fungal chimera checker (1) and UCHIME (2)), but these generally rely on the presence of a chimera-free reference dataset. Until now, there was no such dataset is for the fungal ITS region, and we in this paper (3) introduce a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database (4). This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. We estimated the dataset performance on a large set of artificial chimeras to be above 99.5%, and also used the dataset to remove nearly 1,000 chimeric fungal ITS sequences from the UNITE database. The dataset can be downloaded from the UNITE repository. Thereby, it is also possible for users to curate the dataset in the future through the UNITE interactive editing tools.

References:

  1. Nilsson RH, Abarenkov K, Veldre V, Nylinder S, Wit P de, Brosché S, Alfredsson JF, Ryberg M, Kristiansson E: An open source chimera checker for the fungal ITS region. Molecular Ecology Resources, 10, 1076–1081 (2010).
  2. Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics, 27, 16, 2194-2200 (2011). doi:10.1093/bioinformatics/btr381
  3. Nilsson RH, Tedersoo L, Ryberg M, Kristiansson E, Hartmann M, Unterseher M, Porter TM, Bengtsson-Palme J, Walker D, de Sousa F, Gamper HA, Larsson E, Larsson K-H, Kõljalg U, Edgar R, Abarenkov K: A comprehensive, automatically updated fungal ITS sequence dataset for reference-based chimera control in environmental sequencing efforts. Microbes and Environments, Advance Online Publication (2015). doi: 10.1264/jsme2.ME14121
  4. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481