Another paper I have co-authored related to the UNITE database for fungal rDNA ITS sequences is now published as an Online Early article in Fungal Diversity. The paper describes an effort to improve the annotation of ITS sequences from fungal plant pathogens. Why is this important? Well, apart from fungal plant pathogens being responsible for great economic losses in agriculture, the paper is also conceptually important as it shows that together we can accomplish a substantial improvement to the metadata in sequence databases. In this work we have hunted down high-quality reference sequences for various plant pathogenic fungi, and re-annotated incorrectly or insufficiently annotated ITS sequences from the same fungal lineages. In total, the 59 authors have made 31,954 changes to UNITE database data, on average 540 changes per author. While one, or a few, persons could not feasibly have made this effort alone, this work shows that in larger consortia vast improvements can be made to the quality of databases, by distributing the work among many scientists. In many ways, this relates to proposals to “wikify” GenBank, and after Rfam and Pfam it might now be time to take the user-contribution model to, at least, the RefSeq portion of GenBank, which despite its description as being “comprehensive, integrated, non-redundant, [and] well-annotated” still contains errors and examples of non-usable annotation. More on that at a later point…
Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity Online early (2014). doi: 10.1007/s13225-014-0291-8 [Paper link]
I got informed by a colleague that today is Taxonomist Appreciation Day! This is a very important day; quoting from the original post:
We need active work on taxonomy and systematics if our work is going to progress, and if we are to apply our findings. Without taxonomists, entire fields wouldn’t exist. We’d be working in darkness. (…) Taxonomists and systematists often work in obscurity, and some of the most painstaking projects come to fruition after long years with only a small dose of the recognition that is required.
So, send your favorite taxonomist(s) some love today, and remember they are the foundation for much of what we bioinformaticians do!
The new version of Metaxa – Metaxa2 – which I first started talking about more than 1.5 years ago, has finally been determined to be so stable that we can officially release it! The release come around the same time as we submitted a paper describing the changes in it, but I will briefly go through the changes here:
- Metaxa2 now handles extraction and classification of LSU rRNA sequences in addition to SSU rRNA
- The classification engine has been completely redesigned, and now enables accurate taxonomic classifications down to the genus – or in some cases – species level
- The classification database has been updated, and is now based on the SILVA 111 release
- The Metaxa2 Taxonomic Traversal Tool – metaxa2_ttt – has been added to the package, to ease the counting of rRNA sequences in different organism groups (at various taxonomic levels)
- Metaxa2 adds support for paired-end libraries
- It is now possible to directly input of sequences in FASTQ-format to Metaxa2
- The support for libraries with short read lengths (~100 bp) has been vastly improved (and is now assumed to be the case for default settings)
- Metaxa2 can do quality pre-filtering of reads in FASTQ-format
- Metaxa2 adds support for the modern BLAST+ package (although the old blastall version is still default)
- Compatibility with the HMMER 3.1 beta
Metaxa2 brings together a large set of features that we have been gradually incorporating since 2011, many of which have been dependent on each other. Most of the new features and changes are thoroughly explained in the manual. While we hope Metaxa2 is bug free, there will likely be bugs caused by usage scenarios we have not envisioned. I therefore encourage anyone who come across some unexpected behavior to send me an e-mail. Especially, I would like to know about how the software performs using HMMER 3.1 and BLAST+, where testing has been limited compared to older parts of the code.
We hope that you will find Metaxa2 useful, and that it will bring taxonomic assessment of metagenomes another step forward! Metaxa2 can be downloaded here.
I am happy to inform you that our paper on ITSx now is out online in Methods in Ecology and Evolution issue 4.10. Meanwhile, I am slowly getting my stuff together on an update that will bring some minor requested features. The publication brings the proper citation of the ITSx paper to be:
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A., De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend, A. S., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand, Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., Veldre, V., Nilsson, R. H. (2013), Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods in Ecology and Evolution, 4: 914–919. doi: 10.1111/2041-210X.12073
Our paper on the most recent developments of the UNITE database for fungal rDNA ITS sequences has just been published as an Early view article in Molecular Ecology. In this paper, we aim to ease two of the major problems facing the identification of newly generated fungal ITS sequences: the lack of a sufficiently goof reference dataset, and the lack of a way to refer to fungal species without a Latin name. As part of a solution, we have introduced the term species hypothesis for all fungal species represented by at least two ITS sequences. The UNITE database has an easy-to-use web-based sequence management system, and we encourage everybody that can improve on the annotations or metadata of a fungal lineage to do so.
My main contribution on this paper has been to tailor ITSx functionality for the UNITE database, so that ITS data could be more easily processed for the Species Hypotheses.
Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Accepted in Molecular Ecology. doi: 10.1111/mec.12481 [Paper link]
The paper describing our software tool ITSx has now gone online as an Early View paper on the Methods in Ecology and Evolution website. The software just recently left its beta-status behind, and with the paper out as well, we hope that as many people as possible will find use for the software in barcoding efforts of the ITS region. If you’re not familiar with the software – or its predecessor; the fungal ITS Extractor – here is a brief description of what it does:
ITSx is a Perl-based software tool that extracts the ITS1, 5.8S and ITS2 sequences – as well as full-length ITS sequences – from high-throughput sequencing data sets. To achieve this, we use carefully crafted hidden Markov models (HMMs), computed from large alignments of a total of 20 groups of eukaryotes. Testing has shown that ITSx has close to 100% detection accuracy, and virtually zero false-positive extractions. Additionally, it supports multiple processor cores, and is therefore suitable for running also on very large datasets. It is also able to eliminate non-ITS sequences from a given input dataset.
While ITSx supports extractions of ITS sequences from at least 20 different eukaryotic lineages, we ourselves have considerably less experience with many of the eukaryote groups outside of the fungi. We therefore release ITSx with the intent that the research community will evaluate its performance also in other parts of the eukaryote tree, and if necessary contribute data required to address also those lineages in a thorough way.
The ITSx paper can at the moment be cited as:
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A., De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend, A. S., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand, Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., Veldre, V., Nilsson, R. H. (2013), Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods in Ecology and Evolution. doi: 10.1111/2041-210X.12073
For a couple of years, I have been working with microbial ecology and diversity, and how such features can be assessed using molecular barcodes, such as the SSU (16S/18S) rRNA sequence (the Metaxa and Megraft packages). However, I have also been aiming at the ITS region, and how that can be used in barcoding (see e.g. the guidelines we published last year). It is therefore a great pleasure to introduce my next gem for community analysis; a software tool for detection and extraction of the ITS1 and ITS2 regions of ITS sequences from environmental communities. The tool is dubbed ITSx, and supersedes the more specific fungal ITS extractor written by Henrik Nilsson and colleagues. Henrik is once more the mastermind behind this completely rewritten version, in which I have done the lion’s share of the programming. Among the new features in ITSx are:
- Robust support for the Cantharellus, Craterellus, and Tulasnella genera of fungi
- Support for nineteen additional eukaryotic groups on top of the already present support for fungi (specifically these groups: Tracheophyta (vascular plants), Bryophyta (bryophytes), Marchantiophyta (liverworts), Chlorophyta (green algae), Rhodophyta (red algae), Phaeophyceae (brown algae), Metazoa (metazoans), Oomycota (oomycetes), Alveolata (alveolates), Amoebozoa (amoebozoans), Euglenozoa, Rhizaria, Bacillariophyta (diatoms), Eustigmatophyceae (eustigmatophytes), Raphidophyceae (raphidophytes), Synurophyceae (synurids), Haptophyceae (haptophytes) , Apusozoa, and Parabasalia (parabasalids))
- Multi-processor support
- Extensive output options
- Virtually zero false-positive extractions
ITSx is today moved from a private pre-release state to a public beta state. No code changes has been made since February, indicative of that the last pre-release candidate is now ready to fly on its own. As far as our testing has revealed, this version seems to be bug free. In reality though, researchers tend to find the most unexpected usage scenarios. So please, if you find any unexpected behavior in this version of ITSx, send me an e-mail and make us aware of the potential shortcomings of our software.
We expect this open-source software to boost research in microbial ecology based on barcoding of the ITS region, and hope that the research community will evaluate its performance also among the eukaryote groups that we have less experience with.
One thing that I find slightly annoying is when people do not get the basic concepts right – or when debatable concepts are used without discussion of their implications. This further annoys me when it is done by senior scientists, who should know better. Sometimes, I guess this happens out of ignorance, and sometimes to be able to stick your subject to a certain buzzword concept. Neither is good, even though the former reason is little more forgivable then the latter. One area where this problem becomes agonizingly evident is when molecular biologists or medical scientists moves into ecology, as has happened with the advent of metagenomics. When the study of the human gut microflora turned into a large-scale sequencing effort, people who had previously studied bacteria grown on plates started facing a world of community ecology. However, I get the impression that way too often these people do not ask ecologists for advice, or even read up on the ecological literature. Which, I suppose, is the reason why medical scientists can talk about how the human gut microflora can “evolve” into a stable community a couple years after birth, even though words such as “development” or “succession” would be much more accurate to describe this change.
The marker gene flaw
To set what I mean straight, let us compare the human gut to a forest. If an open field is left to itself, larger plants will slowly inhabit it, and gradually different species will replace each other, until we have a fully developed forest. Similarly, the human gut microflora is at birth rather unstable, but stabilizes relatively quickly and within a few years we have a microbial community with “adult-like” characteristics. To arrive at this conclusion, scientists generally use the 16S (small sub-unit) genetic marker to study the bacterial species diversity. This works in pretty much the same way as going out into the forest and count trees of different kinds.
Now, if I went out into the forest once and counted the tree species, waited for 50 years and then did the same thing again, I would presumably see that the forest species composition had changed. However, if I called this “evolution”, fellow scientists would laugh at me. Raspberry bushes do not evolve into birches, and birches do not evolve into firs. Instead, ecologists talk about “succession”; a progressive transformation of a community, going on until a stable community is formed. The concept of succession seems well-suited also to describe what is happening in the human gut, and should of course also be used in that setting. The most likely driver of the functional community changes is not that some bacterial species have evolved new functions, but rather that bacterial species performing these new functions have outcompeted the once previously present.
In fact, I would argue that it is impossible to study evolution through a genetic marker such as the 16S gene (except in the rare case when you study evolution of the 16S gene itself). Instead, the only thing we could assess using a marker gene is how the copy number of the different gene variants change over time (or space, or conditions). The copy number tells us about the species composition of the community at a given time, which can be used to measure successional changes. However, evolutionary changes would require heritable changes in the characteristics of biological populations, i.e. that their genetic material change in some way. Unless that change happens in the marker gene of choice, we cannot measure it, and the alterations of composition we measure will only reflect differences in species abundances. These differences might have arisen from genetic (i.e. evolutionary) changes, but we cannot assess that.
What are we studying with metagenomics?
This brings us to the next problem, which is not only a problem of semantics and me getting annoyed, but a problem with real implications. What are we really studying using metagenomics? When we apply an environmental sequencing approach to a microbial community, we get a snapshot of the genetic material at a given time and site; at specific conditions. Usually, we aim to characterize the community from a taxonomic or functional perspective, and we often have some other community which we want to compare to. However, if we only collect data from different communities at one time point, or if we only study a community before and after exposure, we have no way of telling if differences stem from selective pressures or from more a random succession progress. As most microbial habitats are not as well studied as the human gut, we know little about microbial community assembly and succession.
Also, in ecology a disturbance to a particular community is generally considered as a starting point for a new succession process. This process may, or may not, return the community to the same stable state. However, if the disturbance was of permanent nature, the new community will have to adapt to the new conditions, and the stable state will likely not have the same species distribution. Such an adaption could be caused by genetic changes (which would clearly be an evolutionary process), or by simple replacement of sensitive species with tolerant ones. The latter would be a selective process, but not necessarily an evolutionary one. If the selection does not alter the genetic material within the species, but only the species composition, I would argue that this is also a case of succession.
Complications with resistance
This complicates the work with metagenomic data. If we study antibiotic resistance genes, and say that bacteria in an environment have evolved antibiotic resistance, we base that assertion on that genes responsible for resistance have either evolved within the present bacteria, or have (more likely) been transferred into the genomes of the bacteria via horizontal gene transfer. However, if the resistance profile we see is simply caused by a replacement of sensitive species with resistant ones, we have not really discovered something new evolving, but are only witnessing spread of already resistant bacteria. In the gut, this would be a problem by itself, but say that we do the same study in the open environment. We already know that environmental bacteria have contained resistance genes for ages, so the real threat to human health here would be a spread from naturally resistant bacteria to human pathogens. However, as mentioned earlier, without extremely well thought-through methodology we cannot really see such transmissions of resistance genes. Here, the search for mobile elements, and large-scale takes on community composition vs. resistance profiles in contaminated and non-polluted areas can play a huge role in shedding light on the question of spreading. However, this will require larger and better planned experiments using metagenomics than what is generally performed at the moment. The questions of microbial community assembly, dispersal, succession and adaption are still largely unanswered, and our metagenomic and environmental sequencing approaches have just started to tinker around with the lid of the jar.
I proudly announce that today Metaxa has been officially released. Metaxa is a a software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequence datasets. We have been working on Metaxa for quite some time, and it has now been in beta for about two months. However, it seems to be stable enough for public consumption. In addition, the software package is today presented in a talk at the SocBiN conference in Helsinki.
A more thorough post on the rationale behind Metaxa, and how it works will follow when I am not occupied by the SocBiN conference. A paper on Metaxa is to be published in the journal Antonie van Leeuwenhoek. The software can be downloaded from here.
If you did not already know, or at least suspected, that pesticides used in agriculture could have a negative impact on species diversity, there is now proof. In this article:
- Geiger et al. “Persistent negative effects of pesticides on biodiversity and biological control potential on European farmland“. Basic and Applied Ecology, Volume 11, Issue 2, March 2010.
the result of a joint study in eight European countries, we present that biodiversity indeed takes a strike by the use of pesticides, at several levels. Also, actions are needed for a change in the structure of the large-scale agriculture. And why do I say we? This isn’t exactly microbiology, is it? Well, this is the first publication related to the field assistant work I did during the Summers of 2007 and 2008. There is more in the pipeline, but this first publication at least shows that there are considerable risks with the way we use weed control.