Another paper I have co-authored related to the UNITE database for fungal rDNA ITS sequences is now published as an Online Early article in Fungal Diversity. The paper describes an effort to improve the annotation of ITS sequences from fungal plant pathogens. Why is this important? Well, apart from fungal plant pathogens being responsible for great economic losses in agriculture, the paper is also conceptually important as it shows that together we can accomplish a substantial improvement to the metadata in sequence databases. In this work we have hunted down high-quality reference sequences for various plant pathogenic fungi, and re-annotated incorrectly or insufficiently annotated ITS sequences from the same fungal lineages. In total, the 59 authors have made 31,954 changes to UNITE database data, on average 540 changes per author. While one, or a few, persons could not feasibly have made this effort alone, this work shows that in larger consortia vast improvements can be made to the quality of databases, by distributing the work among many scientists. In many ways, this relates to proposals to “wikify” GenBank, and after Rfam and Pfam it might now be time to take the user-contribution model to, at least, the RefSeq portion of GenBank, which despite its description as being “comprehensive, integrated, non-redundant, [and] well-annotated” still contains errors and examples of non-usable annotation. More on that at a later point…
Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity Online early (2014). doi: 10.1007/s13225-014-0291-8 [Paper link]
It seems like our paper on the recently launched database on resistance genes against antibacterial biocides and metals (BacMet) has gone online as an advance access paper in Nucleic Acids Research today. Chandan Pal – the first author of the paper, and one of my close colleagues as well as my roommate at work – has made a tremendous job taking the database from a list of genes and references, to a full-fledged browsable and searchable database with a really nice interface. I have contributed along the process, and wrote the lion’s share of the code for the BacMet-Scan tool that can be downloaded along with the database files.
BacMet is a curated source of bacterial resistance genes against antibacterial biocides and metals. All gene entries included have at least one experimentally confirmed resistance gene with references in scientific literature. However, we have also made a homology-based prediction of genes that are likely to share the same resistance function (the BacMet predicted dataset). We believe that the BacMet database will make it possible to better understand co- and cross-resistance of biocides and metals to antibiotics within bacterial genomes and in complex microbial communities from different environments.
The database can be easily accessed here: http://bacmet.biomedicine.gu.se, and use of the database in scientific work can cite the following paper, which recently appeared in Nucleic Acids Research:
Pal C, Bengtsson-Palme J, Rensing C, Kristiansson E, Larsson DGJ: BacMet: Antibacterial Biocide and Metal Resistance Genes Database. Nucleic Acids Research. Database issue, advance access. doi: 10.1093/nar/gkt1252 [Paper link]
I am happy to inform you that our paper on ITSx now is out online in Methods in Ecology and Evolution issue 4.10. Meanwhile, I am slowly getting my stuff together on an update that will bring some minor requested features. The publication brings the proper citation of the ITSx paper to be:
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A., De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend, A. S., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand, Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., Veldre, V., Nilsson, R. H. (2013), Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods in Ecology and Evolution, 4: 914–919. doi: 10.1111/2041-210X.12073
I have recently started to receive requests for full-text versions of my publications on ResearchGate. That’s great, but I have yet to figure out how to send them over, without breaking any agreements. As I am in a somewhat intensive work-period at the moment, please forgive me for not spending time on ResearchGate right now. And if you would like full-text versions of my publications, please send me an e-mail! I’ll be glad to help!
Our paper on the most recent developments of the UNITE database for fungal rDNA ITS sequences has just been published as an Early view article in Molecular Ecology. In this paper, we aim to ease two of the major problems facing the identification of newly generated fungal ITS sequences: the lack of a sufficiently goof reference dataset, and the lack of a way to refer to fungal species without a Latin name. As part of a solution, we have introduced the term species hypothesis for all fungal species represented by at least two ITS sequences. The UNITE database has an easy-to-use web-based sequence management system, and we encourage everybody that can improve on the annotations or metadata of a fungal lineage to do so.
My main contribution on this paper has been to tailor ITSx functionality for the UNITE database, so that ITS data could be more easily processed for the Species Hypotheses.
Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Accepted in Molecular Ecology. doi: 10.1111/mec.12481 [Paper link]
The paper describing our software tool ITSx has now gone online as an Early View paper on the Methods in Ecology and Evolution website. The software just recently left its beta-status behind, and with the paper out as well, we hope that as many people as possible will find use for the software in barcoding efforts of the ITS region. If you’re not familiar with the software – or its predecessor; the fungal ITS Extractor – here is a brief description of what it does:
ITSx is a Perl-based software tool that extracts the ITS1, 5.8S and ITS2 sequences – as well as full-length ITS sequences – from high-throughput sequencing data sets. To achieve this, we use carefully crafted hidden Markov models (HMMs), computed from large alignments of a total of 20 groups of eukaryotes. Testing has shown that ITSx has close to 100% detection accuracy, and virtually zero false-positive extractions. Additionally, it supports multiple processor cores, and is therefore suitable for running also on very large datasets. It is also able to eliminate non-ITS sequences from a given input dataset.
While ITSx supports extractions of ITS sequences from at least 20 different eukaryotic lineages, we ourselves have considerably less experience with many of the eukaryote groups outside of the fungi. We therefore release ITSx with the intent that the research community will evaluate its performance also in other parts of the eukaryote tree, and if necessary contribute data required to address also those lineages in a thorough way.
The ITSx paper can at the moment be cited as:
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A., De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend, A. S., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand, Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., Veldre, V., Nilsson, R. H. (2013), Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods in Ecology and Evolution. doi: 10.1111/2041-210X.12073
A long time ago, we (Martin Eriksson, Martin Hartmann, Henrik Nilsson and me) were invited to write an overview on Metaxa for the Encyclopedia of Metagenomics. I guess the workload for pulling such a project off is huge, so there’s no surprise that it has taken a while for it to be accepted, but now it is available for consumption.
Meanwhile, Metaxa have been getting regular updates, and I hope to soon be able to show you a new major update to the software, bringing it up to the next generation of metagenomics. More on that soon.
I was creating the diagram below an upcoming presentation, and I realized that the exponential growth in published metagenomics papers might be coming to an end. Interestingly enough the small drop in pace the recent years (701 -> 983 -> 1148) reminds me of the Hype Cycle, where we would (if my projection holds) have reached the “Peak of Inflated Expectations”, which means that we will see a rapid drop in the number of metagenomics publications in the next few years, as the field moves on.
The thought is interesting, but it seems a little bit early to draw any conclusions from the number of publications, yet. It is still kind of strange to note, though, that more than 20% of metagenomics publications (740/3547) are review papers. Come on, let’s do some science first and then review it… Anyway, it’ll be interesting to see what 2013 has in store for us.
I have co-authored a paper together with, among others, Henrik Nilsson that was published today in MycoKeys. The paper deals with checking quality of DNA sequences prior to using them for research purposes. In our opinion, a lot of the software available for sequence quality management is rather complex and resource intensive. Not everyone have the skills to master such software, and in addition computational resources might also be scarce. Luckily, there’s a lot that can be done in quality control of DNA sequences just using manual means and a web browser. This paper puts these means together into one comprehensible and easy-to-digest document. Our targeted audience is primaily biologists who do not have a strong background in computer science, and still have a dataset requiring DNA sequence quality control.
We have chosen to focus on the fungal ITS barcoding region, but the guidelines should be pretty general and applicable to most groups of organisms. In very short our five guidelines spells:
- ￼￼￼Establish that the sequences come from the intended gene or marker
Can be done using a multiple alignment of the sequences and verifying that they all feature some suitable, conserved sub-region (the 5.8S gene in the ITS case)
- Establish that all sequences are given in the correct (5’ to 3’) orientation
Examine the alignment for any sequences that do not align at all to the others; re-orient these; re-run the alignment step; and examine them again
- Establish that there are no (at least bad cases of) chimeras in the dataset
Run the sequences through BLAST in one of the large sequence databases, e.g. at NCBI (or in the ITS case, use the UNITE database), to verify that the best match comprises more or less the full length of the query sequences
- Establish that there are no other major technical errors in the sequences
Examine the BLAST results carefully, particularly the graphical overview and the pairwise alignment, for anomalies (there are some nice figures in the paper on how it should and should not look like)
- Establish that any taxonomic annotations given to the sequences make sense
Examine the BLAST hit list to see that the species names produced make sense
A much more thorough description of these guidelines can be found in the paper itself, which is available under open access from MycoKeys. There’s simply no reason not to go there and at least take a look at it. Happy quality control!
Nilsson RH, Tedersoo L, Abarenkov K, Ryberg M, Kristiansson E, Hartmann M, Schoch CL, Nylander JAA, Bergsten J, Porter TM, Jumpponen A, Vaishampayan P, Ovaskainen O, Hallenberg N, Bengtsson-Palme J, Eriksson KM, Larsson K-H, Larsson E, Kõljalg U: Five simple guidelines for establishing basic authenticity and reliability of newly generated fungal ITS sequences. MycoKeys. Issue 4 (2012), 37–63. doi: 10.3897/mycokeys.4.3606 [Paper link]
Bengtsson J, Hartmann M, Unterseher M, Vaishampayan P, Abarenkov K, Durso L, Bik EM, Garey JR, Eriksson KM, Nilsson RH: Megraft: A software package to graft ribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes. Research in Microbiology. Volume 163, Issues 6–7 (2012), 407–412, doi: 10.1016/j.resmic.2012.07.001. [Paper link]
Megraft is currently at version 1.0.1, but I have a slightly updated version in the pipeline which will be made available later this fall.