A couple of days ago, a paper I have co-authored describing an ITS sequence dataset for chimera control in fungi went online as an advance online publication in Microbes and Environments. There are several software tools available for chimera detection (e.g. Henrik Nilsson‘s fungal chimera checker (1) and UCHIME (2)), but these generally rely on the presence of a chimera-free reference dataset. Until now, there was no such dataset is for the fungal ITS region, and we in this paper (3) introduce a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database (4). This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. We estimated the dataset performance on a large set of artificial chimeras to be above 99.5%, and also used the dataset to remove nearly 1,000 chimeric fungal ITS sequences from the UNITE database. The dataset can be downloaded from the UNITE repository. Thereby, it is also possible for users to curate the dataset in the future through the UNITE interactive editing tools.
- Nilsson RH, Abarenkov K, Veldre V, Nylinder S, Wit P de, Brosché S, Alfredsson JF, Ryberg M, Kristiansson E: An open source chimera checker for the fungal ITS region. Molecular Ecology Resources, 10, 1076–1081 (2010).
- Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics, 27, 16, 2194-2200 (2011). doi:10.1093/bioinformatics/btr381
- Nilsson RH, Tedersoo L, Ryberg M, Kristiansson E, Hartmann M, Unterseher M, Porter TM, Bengtsson-Palme J, Walker D, de Sousa F, Gamper HA, Larsson E, Larsson K-H, Kõljalg U, Edgar R, Abarenkov K: A comprehensive, automatically updated fungal ITS sequence dataset for reference-based chimera control in environmental sequencing efforts. Microbes and Environments, Advance Online Publication (2015). doi: 10.1264/jsme2.ME14121
- Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
My colleague Henrik Nilsson has been interviewed by the ResearchGate news team about the recent effort to better annotate ITS data for plant pathogenic fungi. It’s an interesting read, and I think Henrik nicely underscores why large-scale efforts for improving and correcting sequence annotations are important. You can read the interview here, and the paper they talk about is referenced below.
Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, Volume 67, Issue 1 (2014), 11–19. doi: 10.1007/s13225-014-0291-8 [Paper link]
A minor bug in the “its1.full_and_partial.fasta” file has been fixed in a minor update to ITSx (1.0.11) released to day. The bug occasionally caused newline characters at the end of a sequence to be skipped and the next entry to begin at the same row. The bug only manifested itself when ITSx was used with the
--partial option and only in the above mentioned FASTA file. If you have been affected by the bug, you should have noticed as the resulting FASTA file would be considered corrupted by most bioinformatics software. The updated version of ITSx can be downloaded here.
After a long delay-time in testing ITSx version 1.0.10 has been made public. The new version patches a bug causing the 3′ anchor not being properly written to file when using the “
--anchor hmm” option. If a number was used for the “
--anchor” option, this bug did not apply. Thus, if you have not been using the “
--anchor” option together with “hmm”, you have not been affected in any way by this bug. Nevertheless, I encourage updating in case you would use the “
--anchor hmm” option in the future. The update can be downloaded here. Happy barcoding!
Another paper I have made a contribution to have just recently been published in Molecular Ecology Resources. The paper (1), which is lead-authored by Xin-Cun Wang and Chang Liu at the Institute of Medicinal Plant Development in Beijing, investigates the usability of the ITS1 and ITS2 as separate barcodes across the Eukaryotes. The study is a large scale meta-analysis comparing available high-quality sequence data in as many taxonomic groups at possible from three different aspects: PCR amplification, DNA sequencing efficiency and species discrimination ability. Specifically, we have looked for the presence of DNA barcoding gaps, species discrimination efficiency, sequence length distribution, GC content distribution and primer universality, using bioinformatic approaches. We found that the ITS1 had significantly higher efficiencies than the ITS2 in 17 of 47 families and 20 of 49 investigated genera, which was markedly better than the performance of ITS2. We conclude that, in general, ITS1 represents a better DNA barcode than ITS2 for a majority of eukaryotic taxonomic groups. This of course doesn’t mean that using the ITS2 or the ITS region in its entirety should be dismissed, but our results can serve as a ground for making informed decisions about which region to choose for your amplicon sequencing project. The results complement what have previously been observed for e.g. fungi, where the difference between ITS1 and ITS2 were much less pronounced (2).
- Wang X-C, Liu C, Huang L, Bengtsson-Palme J, Chen H, Zhang J-H, Cai D, Li J-Q: ITS1: A DNA barcode better than ITS2 in eukaryotes? Molecular Ecology Resources. Early view. doi: 10.1111/1755-0998.12325 [Paper link]
- Blaalid R, Kumar S, Nilsson RH, Abarenkov K, Kirk PM, Kauserud H: ITS1 versus ITS2 as DNA metabarcodes for fungi. Molecular Ecology Resources. Volume 13, Issue2, Page 218-224. doi: 10.1111/1755-0998.12065 [Paper link]
I and one of the other developers of ITSx had a discussion a while ago about that using the
--anchor option should output the “anchor sequences” around the ITS regions also for the full-length output file (given that the
--truncate option is activated). I have today changed ITSx to employ this behaviour, updating it to version 1.0.9. The update also improves sensitivity when using the
--anchor HMM option slightly, and can be downloaded here. Happy barcoding!
ITSx has today been updated, bringing it to version 1.0.8. This update adds the “
--only_full” option, which restricts output in the ITS1, 5.8S and ITS2 files to only the files that contain the full region, i.e. that both surrounding domains have been detected. The update also fixes a bug with the
--anchor option, and can be downloaded here. Happy barcoding!
Another paper I have co-authored related to the UNITE database for fungal rDNA ITS sequences is now published as an Online Early article in Fungal Diversity. The paper describes an effort to improve the annotation of ITS sequences from fungal plant pathogens. Why is this important? Well, apart from fungal plant pathogens being responsible for great economic losses in agriculture, the paper is also conceptually important as it shows that together we can accomplish a substantial improvement to the metadata in sequence databases. In this work we have hunted down high-quality reference sequences for various plant pathogenic fungi, and re-annotated incorrectly or insufficiently annotated ITS sequences from the same fungal lineages. In total, the 59 authors have made 31,954 changes to UNITE database data, on average 540 changes per author. While one, or a few, persons could not feasibly have made this effort alone, this work shows that in larger consortia vast improvements can be made to the quality of databases, by distributing the work among many scientists. In many ways, this relates to proposals to “wikify” GenBank, and after Rfam and Pfam it might now be time to take the user-contribution model to, at least, the RefSeq portion of GenBank, which despite its description as being “comprehensive, integrated, non-redundant, [and] well-annotated” still contains errors and examples of non-usable annotation. More on that at a later point…
Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity Online early (2014). doi: 10.1007/s13225-014-0291-8 [Paper link]
Last week, I was informed by an ITSx user that the software behaved strangely when input files containing extremely long sequence identifiers were used. The bug is not likely to have affected a majority of users, but in any case it is now fixed, and ITSx can now handle sequence identifiers of any length. The new update brings ITSx to version 1.0.7, and it can be downloaded here. Happy barcoding!
A user informed me of unexpected behavior regarding potentially chimeric sequences in ITSx, and indeed it turned out to contain a bug that over-reported potential chimeras. This bug is totally unrelated to the new version released this week, and exists in all prior ITSx versions. I strongly encourage everyone to update to ITSx 1.0.6.
I would also like to underscore that ITSx is not a chimera-checker. It detects when sequences look unusual, but all such cases should be further investigated. If you follow this practice, you will see that in some cases ITSx might have over-reported chimeras, and in some instances it will have been correct in its suspicions (and thereby you would be largely unaffected by this bug).