Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg

Browsing Posts tagged Bioinformatics

Yesterday, Molecular Ecology Resources put online an unedited version of a recent paper which I co-authored. This time, Rodney Richardson at Ohio State University has made a tremendous work of evaluating three taxonomic classification software – the RDP Naïve Bayesian Classifier, RTAX and UTAX – on a set of DNA barcoding regions commonly used for plants, namely the ITS2, and the matK, rbcL, trnL and trnH genes.

In the paper (1), we discuss the results, merits and limitations of the classifiers. In brief, we found that:

  • There is a considerable trade-off between accuracy and sensitivity for the classifiers tested, which indicates a need for improved sequence classification tools (2)
  • UTAX was superior with respect to error rate, but was exceedingly stringent and thus suffered from a low assignment rate
  • The RDP Naïve Bayesian Classifier displayed high sensitivity and low error at the family and order levels, but had a genus-level error rate of 9.6 percent
  • RTAX showed high sensitivity at all taxonomic ranks, but at the same time consistently produced the high error rates
  • The choice of locus has significant effects on the classification sensitivity of all tested tools
  • All classifiers showed strong relationships between database completeness, classification sensitivity and classification accuracy

We believe that the methods of comparison we have used are simple and robust, and thereby provides a methodological and conceptual foundation for future software evaluations. On a personal note, I will thoroughly enjoy working with Rodney and Reed again; I had a great time discussing the ins and outs of taxonomic classification with them! The paper can be found here.

References and notes

  1. Richardson RT, Bengtsson-Palme J, Johnson RM: Evaluating and Optimizing the Performance of Software Commonly Used for the Taxonomic Classification of DNA Sequence Data. Molecular Ecology Resources, Early view (2016). doi: 10.1111/1755-0998.12628 [Paper link]
  2. This is something that several classifiers also showed in the evaluation we did for the Metaxa2 paper (3). Interestingly enough, Metaxa2 is better at maintaining high accuracy also as sensitivity is increased.
  3. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399 [Paper link]

SBW2016

Comments off

Today the 15th Swedish Bioinformatics Workshop starts in Linköping. For the first time in many many years, I will not be there, which feels super-strange. (I even took part in organizing the workshop two years ago, in Gothenburg.) I would have thoroughly enjoyed the schedule (the workshops look amazing), and hope that everyone will have a very good time there without me. I do miss you! Please make sure I get another opportunity next year!

Late yesterday, Microbiome put online our most recent work, covering the resistomes to antibiotics, biocides and metals across a vast range of environments. In the paper (1), we perform the largest characterization of resistance genes, mobile genetic elements (MGEs) and bacterial taxonomic compositions to date, covering 864 different metagenomes from humans (350), animals (145) and external environments such as soil, water, sewage, and air (369 in total). All the investigated metagenomes were sequenced to at least 10 million reads each, using Illumina technology, making the results more comparable across environments than in previous studies (2-4).

We found that the environment types had clear differences both in terms of resistance profiles and bacterial community composition. Humans and animals hosted microbial communities with limited taxonomic diversity as well as low abundance and diversity of biocide/metal resistance genes and MGEs. On the contrary, the abundance of ARGs was relatively high in humans and animals. External environments, on the other hand, showed high taxonomic diversity and high diversity of biocide/metal resistance genes and MGEs. Water, sediment and soil generally carried low relative abundance and few varieties of known ARGs, whereas wastewater and sludge were on par with the human gut. The environments with the largest relative abundance and diversity of ARGs, including genes encoding resistance to last resort antibiotics, were those subjected to industrial antibiotic pollution and air samples from a Beijing smog event.

A paper investigating this vast amount of data is of course hard to describe in a blog post, so I strongly suggest the interested reader to head over to Microbiome’s page and read the full paper (1). However, here’s a ver short summary of the findings:

  • The median relative abundance of ARGs across all environments was 0.035 copies per bacterial 16S rRNA
  • Antibiotic-polluted environments have (by far) the highest abundances of ARGs
  • Urban air samples carried high abundance and diversity of ARGs
  • Human microbiota has high abundance and diversity of known ARGs, but low taxonomic diversity compared to the external environment
  • The human and animal resistomes are dominated by tetracycline resistance genes
  • Over half of the ARGs were only detected in external environments, while 20.5 % were found in human, animal and at least one of the external environments
  • Biocide and metal resistance genes are more common in external environments than in the human microbiota
  • Human microbiota carries low abundance and richness of MGEs compared to most external environments

Importantly, less than 1.5 % of all detected ARGs were found exclusively in the human microbiome. At the same time, 57.5 % of the known ARGs were only detected in metagenomes from environmental samples, despite that the majority of the investigated ARGs were initially encountered in pathogens. Still, our analysis suggests that most of these genes are relatively rare in the human microbiota. Environmental samples generally contained a wider distribution of resistance genes to a more diverse set of antibiotics classes. For example, the relative abundance of beta-lactam resistance genes was much larger in external environments than in human and animal microbiomes. This suggests that the external environment harbours many more varieties of resistance genes than the ones currently known from the clinic. Indeed, functional metagenomics has resulted in the discovery of many novel ARGs in external environments (e.g. 5). This all fits well with an overall much higher taxonomic diversity of environmental microbial communities. In terms of consequences associated with the potential transfer of ARGs to human pathogens, we argue that unknown resistance genes are of greater concern than those already known to circulate among human-associated bacteria (6).

This study describes the potential for many external environments, including those subjected to pharmaceutical pollution, air and wastewater/sludge, to serve as hotspots for resistance development and/or transmission of ARGs. In addition, our results indicate that these environments may play important roles in the mobilization of yet unknown ARGs and their further transmission to human pathogens. To provide guidance for risk-reducing actions we – based on this study – suggest strict regulatory measures of waste discharges from pharmaceutical industries and encourage more attention to air in the transmission of antibiotic resistance (1).

References

  1. Pal C, Bengtsson-Palme J, Kristiansson E, Larsson DGJ: The structure and diversity of human, animal and environmental resistomes. Microbiome, 4, 54 (2016). doi: 10.1186/s40168-016-0199-5
  2. Durso LM, Miller DN, Wienhold BJ. Distribution and quantification of antibiotic resistant genes and bacteria across agricultural and non-agricultural metagenomes. PLoS One. 2012;7:e48325.
  3. Nesme J, Delmont TO, Monier J, Vogel TM. Large-scale metagenomic-based study of antibiotic resistance in the environment. Curr Biol. 2014;24:1096–100.
  4. Fitzpatrick D, Walsh F. Antibiotic resistance genes across a wide variety of metagenomes. FEMS Microbiol Ecol. 2016. doi:10.1093/femsec/fiv168.
  5. Allen HK, Moe LA, Rodbumrer J, Gaarder A, Handelsman J. Functional metagenomics reveals diverse β-lactamases in a remote Alaskan soil. ISME J. 2009;3:243–51.
  6. Bengtsson-Palme J, Larsson DGJ: Antibiotic resistance genes in the environment: prioritizing risks. Nature Reviews Microbiology, 13, 369 (2015). doi: 10.1038/nrmicro3399-c1

I am part of the organizing committee for the newly invented annual meeting for GOTBIN – the Gothenburg Bioinformatics Network. We will arrange a meeting on December 6 to get the networking activities for 2017 kickstarted, and every bioinformatician in Gothenburg is invited!

GOTBIN was launched to bridge and bring together all researchers in Gothenburg who fully or partially dealt with bioinformatics in their research. Through the network it should be possible to quickly find other local researchers tackling the same research problems as you are; to find appropriate resources to run your analyses; and to discuss research or infrastructure problems as they arise. To keep the network alive and kicking, it is crucial to keep relations active. Furthermore, it is also crucial to interact with key persons in the GOTBIN network to keep the lists of active researchers, resources and discussion forums up to date.

To facilitate future communication, we invite everyone who works with bioinformatics in Gothenburg to participate in a get-together workshop. The purpose of this workshop is to find out who is working with what and where, and to better get to know each other. This is a great opportunity to meet your next collaboration partner, post-doc or supervisor! The event will take place in Birgit Thilander at Medicinareberget (just next to the large lecture hall Arvid Carlsson) on December 6th, from 9.00 to 12.00. Fika will be provided and we will arrange ice-breaker activities suitable for the number of participants, so please register at: https://goo.gl/forms/KYdiiZMBDf0F9hvp2 by the 16th of November.

We hope to find everyone with a research interest in bioinformatics there and that this will be the launch of the next era of GOTBIN in 2017! See you there!

I just want to highlight that the paper on strategies to improve database accuracy and usability we recently published in Proteomics (1) has been included in their most recent issue, which is a special issue focusing on Data Quality Issues in Proteomics. I highly recommend reading our paper (of course) and many of the other in the special issue. Happy reading!

On another note, I will be giving a talk next Wednesday (October 5th) on a seminar day on next generation sequencing in clinical microbiology, titled “Antibiotic resistance in the clinic and the environment – There and back again“. You are very welcome to the lecture hall at floor 3 in our building at Guldhedsgatan 10A here in Gothenburg if you are interested! (Bear in mind though that it all starts at 8.15 in the morning.)

Finally, it seems that I am going to the Next Generation Sequencing Congress in London this year, which will be very fun! Hope to see some of you dealing with sequencing there!

References

  1. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16, 18, 2454–2460 (2016). doi: 10.1002/pmic.201600034 [Paper link]

MycoKeys today put a paper online which I was involved in. The paper describes the results of a workshop in May, when we added and refined annotations for fungal ITS sequences according to the MIxS-Built Environment annotation standard (1). Fungi have been associated with a range of unwanted effects in the built environment, including asthma, decay of building materials, and food spoilage. However, the state of the metadata annotation of fungal DNA sequences from the built environment is very much incomplete in public databases. The workshop aimed to ease a little part of this problem, by distributing the re-annotation of public fungal ITS sequences across 36 persons. In total, we added or changed of 45,488 data points drawing from published literature, including addition of 8,430 instances of countries of collection, 5,801 instances of building types, and 3,876 instances of surface-air contaminants. The results have been implemented in the UNITE database and shared with other online resources. I believe, that distributed initiatives like this (and the ones I have been involved in in the past (2,3)) serve a very important purpose for establishing better annotation of sequence data, an issue I have brought up also for sequences outside of barcoding genes (4). The full paper can be found here.

References

  1. Abarenkov K, Adams RI, Laszlo I, Agan A, Ambrioso E, Antonelli A, Bahram M, Bengtsson-Palme J, Bok G, Cangren P, Coimbra V, Coleine C, Gustafsson C, He J, Hofmann T, Kristiansson E, Larsson E, Larsson T, Liu Y, Martinsson S, Meyer W, Panova M, Pombubpa N, Ritter C, Ryberg M, Svantesson S, Scharn R, Svensson O, Töpel M, Untersehrer M, Visagie C, Wurzbacher C, Taylor AFS, Kõljalg U, Schriml L, Nilsson RH: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1–15 (2016). doi: 10.3897/mycokeys.16.10000
  2. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  3. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  4. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034

I just wanted to share an experience with the FARAO software we recently published a paper about, and its compatibility with the GD and libpng libraries (used for creating PNG files). I have got questions from users about how to get this to work, and to test it out I decided to try to install it on my Mac. It turned out that it is nearly impossible to get this to work. These two packages are extremely picky with versions and dependencies. After trying for about on hour, I gave up and turned to my Linux machine. Surprisingly, I could not get it to work from scratch there either, despite that I have had it running (with some previous version combination) when we programmed and tested FARAO.

I find this extremely annoying myself, and I will try to look into other solutions for PNG or JPEG output from FARAO. In the mean time, I can only recommend to instead use the EPS output option, which produces more nice-looking figures and is considerably easier to set up. I am sorry about this and hope to be able to provide a better solution soon.

I am happy to announce that our Viewpoint article on strategies for improving sequence databases has now been published in the journal Proteomics. The paper (1) defines some central problems hampering genomic, proteomic and metagenomic analyses and suggests five strategies to improve the situation:

  1. Clearly separate experimentally verified and unverified sequence entries
  2. Enable a system for tracing the origins of annotations
  3. Separate entries with high-quality, informative annotation from less useful ones
  4. Integrate automated quality-control software whenever such tools exist
  5. Facilitate post-submission editing of annotations and metadata associated with sequences

The paper is not long, so I encourage you to read it in its entirety. We believe that spreading this knowledge and pushing solutions to problems related to poor annotation metadata is vastly important in this era of big data. Although we specifically address protein-coding genes in this paper, the same logic also applies to other types of biological sequences. In this way the paper is related to my previous work with Henrik Nilsson on improving annotation data for taxonomic barcoding genes (2-4). This paper was one of the main end-results of the GoBiG network, and the backstory on the paper follows below the references…

References

  1. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034
  2. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  3. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  4. Nilsson RH, Tedersoo L, Ryberg M, Kristiansson E, Hartmann M, Unterseher M, Porter TM, Bengtsson-Palme J, Walker D, de Sousa F, Gamper HA, Larsson E, Larsson K-H, Kõljalg U, Edgar R, Abarenkov K: A comprehensive, automatically updated fungal ITS sequence dataset for reference-based chimera control in environmental sequencing efforts. Microbes and Environments, 30, 2, 145–150 (2015). doi: 10.1264/jsme2.ME14121

Backstory
In June 2013, the Gothenburg Bioinformatics Group for junior scientists (GoBiG) arranged a workshop with two themes: “Parallelized quantification of genes in large metagenomic datasets” and “Assigning functional predictions to NGS data”. The following discussion on how to database quality influenced results and what could be done to improve the situation was rather intense, and several good ideas were thrown around. I took notes from the meeting, and in the evening I put them down during a warm summer night at the balcony. In fact, the notes were good enough to be an early embryo for a manuscript. So I sent it to some of the most active GoBiG members (Kaisa Thorell and Fredrik Boulund), who were positive regarding the idea to turn it into a manuscript. I wrote it together more properly and we decided that everyone who contributed with ideas at the meeting would be invited to become co-authors. We submitted the manuscript in early 2014, only to see it (rather brutally) rejected. At that point most of us were sucked up in their own projects, so nothing happened to this manuscript for over a year. Then we decided to give it another go, updated the manuscript heavily and changed a few parts to better reflect the current database situation (at this point, e.g., UniProt had already started implementing some of our suggested ideas). Still, some of the proposed strategies were more radical in 2013 than they would be now, more than three years later. We asked the Proteomics editors if they would be interested in the manuscript, and they turned out to be very positive. Indeed, the entire experience with the editors at Proteomics has been very pleasant. I am very thankful to the GoBiG team for this time, and to the editors at Proteomics who saw the value of this manuscript.

Late last year, we introduced FARAO – the Flexible All-Round Annotation Organizer – a software tool that allows visualization of annotated features on contigs. Today, the Applications Note describing the software was published as an advance access paper in Bioinformatics (1). As I have described before, storing and visualizing annotation and coverage information in FARAO has a number of advantages. FARAO is able to:

  • Integrate annotation and coverage information for the same sequence set, enabling coverage estimates of annotated features
  • Scale across millions of sequences and annotated features
  • Filter sequences, such that only entries with annotations satisfying certain given criteria will be outputted
  • Handle annotation and coverage data produced by a range of different bioinformatics tools
  • Handle custom parsers through a flexible interface, allowing for adaption of the software to virtually any bioinformatic tool not supported out of the box
  • Produce high-quality EPS output
  • Integrate with MySQL databases

I have previously used FARAO to produce annotation figures in our paper on a polluted Indian lake (2), as well as in a paper on sewage treatment plants (which is in press and should be coming out any day now). We hope that the tool will find many more uses in other projects in the future!

References

  1. Hammarén R, Pal C, Bengtsson-Palme JFARAO: The Flexible All-Round Annotation Organizer. Bioinformatics, advance access (2016). doi: 10.1093/bioinformatics/btw499 [Paper link]
  2. Bengtsson-Palme J, Boulund F, Fick J, Kristiansson E, Larsson DGJ: Shotgun metagenomics reveals a wide array of antibiotic resistance genes and mobile elements in a polluted lake in India. Frontiers in Microbiology, 5, 648 (2014). doi: 10.3389/fmicb.2014.00648 [Paper link]

Today marks the five year anniversary for the Metaxa software’s initial release. Much has happened to the software since; Metaxa started off as an rRNA extraction utility for metagenomic data (1), including coarse classification to organism/organelle type. Since it has gained full-scale taxonomic classification ability better or on par with other software packages (2), much greater speed, support for the LSU gene, gained a range of related software tools (3), and spurred development of other tools such as ITSx (4). I have also been involved in no less than four peer-reviewed publications directly related to the software (1-3,5).

But it does not end here; these five years were just the beginning. We are – in different constellations – working on further enhancements to Metaxa2, including support for more genes, an updated classification database, and better customization options. I am very much still devoted to keep Metaxa2 alive and relevant as a tool for taxonomic analysis of metagenomes, applicable whenever accuracy is a key parameter. Thanks for being part of the community for these five years!

References

  1. Bengtsson J, Eriksson KM, Hartmann M, Wang Z, Shenoy BD, Grelet G, Abarenkov K, Petri A, Alm Rosenblad M, Nilsson RH: Metaxa: A software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequencing datasets. Antonie van Leeuwenhoek, 100, 3, 471–475 (2011). doi:10.1007/s10482-011-9598-6. [Paper link]
  2. Bengtsson-Palme J, Hartmann M, Eriksson KM, Pal C, Thorell K, Larsson DGJ, Nilsson RH: Metaxa2: Improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15, 6, 1403–1414 (2015). doi: 10.1111/1755-0998.12399 [Paper link]
  3. Bengtsson-Palme J, Thorell K, Wurzbacher C, Sjöling Å, Nilsson RH: Metaxa2 Diversity Tools: Easing microbial community analysis with Metaxa2. Ecological Informatics, 33, 45–50 (2016). doi: 10.1016/j.ecoinf.2016.04.004 [Paper link]
  4. Bengtsson-Palme J, Ryberg M, Hartmann M, Branco S, Wang Z, Godhe A, De Wit P, Sánchez-García M, Ebersberger I, de Souza F, Amend AS, Jumpponen A, Unterseher M, Kristiansson E, Abarenkov K, Bertrand YJK, Sanli K, Eriksson KM, Vik U, Veldre V, Nilsson RH: Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for use in environmental sequencing. Methods in Ecology and Evolution, 4, 10, 914–919 (2013). doi: 10.1111/2041-210X.12073 [Paper link]
  5. Bengtsson-Palme J, Hartmann M, Eriksson KM, Nilsson RH: Metaxa, overview. In:Nelson K. (Ed.) Encyclopedia of Metagenomics: SpringerReference (www.springerreference.com). Springer-Verlag Berlin Heidelberg (2013). doi: 10.1007/978-1-4614-6418-1_239-6 [Link]