Another paper I have made a contribution to have just recently been published in Molecular Ecology Resources. The paper (1), which is lead-authored by Xin-Cun Wang and Chang Liu at the Institute of Medicinal Plant Development in Beijing, investigates the usability of the ITS1 and ITS2 as separate barcodes across the Eukaryotes. The study is a large scale meta-analysis comparing available high-quality sequence data in as many taxonomic groups at possible from three different aspects: PCR amplification, DNA sequencing efficiency and species discrimination ability. Specifically, we have looked for the presence of DNA barcoding gaps, species discrimination efficiency, sequence length distribution, GC content distribution and primer universality, using bioinformatic approaches. We found that the ITS1 had significantly higher efficiencies than the ITS2 in 17 of 47 families and 20 of 49 investigated genera, which was markedly better than the performance of ITS2. We conclude that, in general, ITS1 represents a better DNA barcode than ITS2 for a majority of eukaryotic taxonomic groups. This of course doesn’t mean that using the ITS2 or the ITS region in its entirety should be dismissed, but our results can serve as a ground for making informed decisions about which region to choose for your amplicon sequencing project. The results complement what have previously been observed for e.g. fungi, where the difference between ITS1 and ITS2 were much less pronounced (2).
- Wang X-C, Liu C, Huang L, Bengtsson-Palme J, Chen H, Zhang J-H, Cai D, Li J-Q: ITS1: A DNA barcode better than ITS2 in eukaryotes? Molecular Ecology Resources. Early view. doi: 10.1111/1755-0998.12325 [Paper link]
- Blaalid R, Kumar S, Nilsson RH, Abarenkov K, Kirk PM, Kauserud H: ITS1 versus ITS2 as DNA metabarcodes for fungi. Molecular Ecology Resources. Volume 13, Issue2, Page 218-224. doi: 10.1111/1755-0998.12065 [Paper link]
I just got word from BMC Genomics that my most recent paper has just been published (in provisional form; we still have not seen the edited proofs). In this paper (1), which I have co-authored with Anders Blomberg, Magnus Alm Rosenblad and Mikael Molin, we utilize metagenomic data from the GOS-expedition (2) together with fully sequenced bacterial genomes to show that:
- Detoxification genes in general are underrepresented in marine planktonic bacteria
- Surprisingly, the detoxification that show a differential distribution are more abundant in open ocean water than closer to the coast
- Peroxidases and peroxiredoxins seem to be the main line of defense against oxidative stress for bacteria in the marine milieu, rather than e.g. catalases
- The abundance of detoxification genes does not seem to increase with estimated pollution.
From this we conclude that other selective pressures than pollution likely play the largest role in shaping marine planktonic bacterial communities, such as for example nutrient limitations. This suggests substantial streamlining of gene copy number and genome sizes, in line with observations made in previous studies (3). Along the same lines, our findings indicate that the majority of marine bacteria would have a low capacity to adapt to increased pollution, which is relevant as large amounts of human pollutants and waste end up in the oceans every year. The study exemplifies the use of metagenomics data in ecotoxicology, and how we can examine anthropogenic consequences on life in the sea using approaches derived from genomics. You can read the paper in its entirety here.
- Bengtsson-Palme J, Alm Rosenblad M, Molin M, Blomberg A: Metagenomics reveals that detoxification systems are underrepresented in marine bacterial communities. BMC Genomics. Volume 15, Issue 749 (2014). doi: 10.1186/1471-2164-15-749 [Paper link]
- Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, Jaroszewski L, Cieplak P, Miller CS, Li H, Mashiyama ST, Joachimiak MP, Van Belle C, Chandonia J-M, Soergel DA, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R, Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, et al: The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biology. 5:e16 (2007).
- Yooseph S, Nealson KH, Rusch DB, McCrow JP, Dupont CL, Kim M, Johnson J, Montgomery R, Ferriera S, Beeson KY, Williamson SJ, Tovchigrechko A, Allen AE, Zeigler LA, Sutton G, Eisenstadt E, Rogers Y-H, Friedman R, Frazier M, Venter JC: Genomic and functional adaptation in surface ocean planktonic prokaryotes. Nature. 468:60–66 (2010).
In an interesting development, Nature Publishing Group has launched a new initiative: Scientific Data – a online-only open access journal that publishes data sets without the demand of testing scientific hypotheses in connection to the data. That is, the data itself is seen as the valuable product, not any findings that might result from it. There is an immediate upside of this; large scientific data sets might be accessible to the research community in a way that enables proper credit for the sample collection effort. Since there is no demand for a full analysis of the data, the data itself might quicker be of use to others, without worrying that someone else might steal the bang of the data per se. I also see a possible downside, though. It would be easy to hold on to the data until you have analyzed it yourself, and then release it separately just about when you submit the paper on the analysis, generating extra papers and citation counts. I don’t know if this is necessarily bad, but it seems it could contribute to “publishing unit dilution”. Nevertheless, I believe that this is overall a good initiative, although how well it actually works will be up to us – the scientific community. Some info copied from the journal website:
Scientific Data’s main article-type is the Data Descriptor: peer-reviewed, scientific publications that provide an in-depth look at research datasets. Data Descriptors are a combination of traditional scientific publication content and structured information curated in-house, and are designed to maximize reuse and enable searching, linking and data mining. (…) Scientific Data aims to address the increasing need to make research data more available, citable, discoverable, interpretable, reusable and reproducible. We understand that wider data-sharing requires credit mechanisms that reward scientists for releasing their data, and peer evaluation mechanisms that account for data quality and ensure alignment with community standards.
Another paper I have co-authored related to the UNITE database for fungal rDNA ITS sequences is now published as an Online Early article in Fungal Diversity. The paper describes an effort to improve the annotation of ITS sequences from fungal plant pathogens. Why is this important? Well, apart from fungal plant pathogens being responsible for great economic losses in agriculture, the paper is also conceptually important as it shows that together we can accomplish a substantial improvement to the metadata in sequence databases. In this work we have hunted down high-quality reference sequences for various plant pathogenic fungi, and re-annotated incorrectly or insufficiently annotated ITS sequences from the same fungal lineages. In total, the 59 authors have made 31,954 changes to UNITE database data, on average 540 changes per author. While one, or a few, persons could not feasibly have made this effort alone, this work shows that in larger consortia vast improvements can be made to the quality of databases, by distributing the work among many scientists. In many ways, this relates to proposals to “wikify” GenBank, and after Rfam and Pfam it might now be time to take the user-contribution model to, at least, the RefSeq portion of GenBank, which despite its description as being “comprehensive, integrated, non-redundant, [and] well-annotated” still contains errors and examples of non-usable annotation. More on that at a later point…
Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity Online early (2014). doi: 10.1007/s13225-014-0291-8 [Paper link]
It seems like our paper on the recently launched database on resistance genes against antibacterial biocides and metals (BacMet) has gone online as an advance access paper in Nucleic Acids Research today. Chandan Pal – the first author of the paper, and one of my close colleagues as well as my roommate at work – has made a tremendous job taking the database from a list of genes and references, to a full-fledged browsable and searchable database with a really nice interface. I have contributed along the process, and wrote the lion’s share of the code for the BacMet-Scan tool that can be downloaded along with the database files.
BacMet is a curated source of bacterial resistance genes against antibacterial biocides and metals. All gene entries included have at least one experimentally confirmed resistance gene with references in scientific literature. However, we have also made a homology-based prediction of genes that are likely to share the same resistance function (the BacMet predicted dataset). We believe that the BacMet database will make it possible to better understand co- and cross-resistance of biocides and metals to antibiotics within bacterial genomes and in complex microbial communities from different environments.
The database can be easily accessed here: http://bacmet.biomedicine.gu.se, and use of the database in scientific work can cite the following paper, which recently appeared in Nucleic Acids Research:
Pal C, Bengtsson-Palme J, Rensing C, Kristiansson E, Larsson DGJ: BacMet: Antibacterial Biocide and Metal Resistance Genes Database. Nucleic Acids Research. Database issue, advance access. doi: 10.1093/nar/gkt1252 [Paper link]
I am happy to inform you that our paper on ITSx now is out online in Methods in Ecology and Evolution issue 4.10. Meanwhile, I am slowly getting my stuff together on an update that will bring some minor requested features. The publication brings the proper citation of the ITSx paper to be:
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A., De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend, A. S., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand, Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., Veldre, V., Nilsson, R. H. (2013), Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods in Ecology and Evolution, 4: 914–919. doi: 10.1111/2041-210X.12073
I have recently started to receive requests for full-text versions of my publications on ResearchGate. That’s great, but I have yet to figure out how to send them over, without breaking any agreements. As I am in a somewhat intensive work-period at the moment, please forgive me for not spending time on ResearchGate right now. And if you would like full-text versions of my publications, please send me an e-mail! I’ll be glad to help!
Our paper on the most recent developments of the UNITE database for fungal rDNA ITS sequences has just been published as an Early view article in Molecular Ecology. In this paper, we aim to ease two of the major problems facing the identification of newly generated fungal ITS sequences: the lack of a sufficiently goof reference dataset, and the lack of a way to refer to fungal species without a Latin name. As part of a solution, we have introduced the term species hypothesis for all fungal species represented by at least two ITS sequences. The UNITE database has an easy-to-use web-based sequence management system, and we encourage everybody that can improve on the annotations or metadata of a fungal lineage to do so.
My main contribution on this paper has been to tailor ITSx functionality for the UNITE database, so that ITS data could be more easily processed for the Species Hypotheses.
Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Accepted in Molecular Ecology. doi: 10.1111/mec.12481 [Paper link]
The paper describing our software tool ITSx has now gone online as an Early View paper on the Methods in Ecology and Evolution website. The software just recently left its beta-status behind, and with the paper out as well, we hope that as many people as possible will find use for the software in barcoding efforts of the ITS region. If you’re not familiar with the software – or its predecessor; the fungal ITS Extractor – here is a brief description of what it does:
ITSx is a Perl-based software tool that extracts the ITS1, 5.8S and ITS2 sequences – as well as full-length ITS sequences – from high-throughput sequencing data sets. To achieve this, we use carefully crafted hidden Markov models (HMMs), computed from large alignments of a total of 20 groups of eukaryotes. Testing has shown that ITSx has close to 100% detection accuracy, and virtually zero false-positive extractions. Additionally, it supports multiple processor cores, and is therefore suitable for running also on very large datasets. It is also able to eliminate non-ITS sequences from a given input dataset.
While ITSx supports extractions of ITS sequences from at least 20 different eukaryotic lineages, we ourselves have considerably less experience with many of the eukaryote groups outside of the fungi. We therefore release ITSx with the intent that the research community will evaluate its performance also in other parts of the eukaryote tree, and if necessary contribute data required to address also those lineages in a thorough way.
The ITSx paper can at the moment be cited as:
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A., De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend, A. S., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand, Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., Veldre, V., Nilsson, R. H. (2013), Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods in Ecology and Evolution. doi: 10.1111/2041-210X.12073
A long time ago, we (Martin Eriksson, Martin Hartmann, Henrik Nilsson and me) were invited to write an overview on Metaxa for the Encyclopedia of Metagenomics. I guess the workload for pulling such a project off is huge, so there’s no surprise that it has taken a while for it to be accepted, but now it is available for consumption.
Meanwhile, Metaxa have been getting regular updates, and I hope to soon be able to show you a new major update to the software, bringing it up to the next generation of metagenomics. More on that soon.