Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg | Wisconsin Institute for Discovery

Browsing Posts tagged Genome sequencing

This weekend, F1000Research put online the non-peer-reviewed version of the paper resulting from a workshop arranged by the JRC in Italy last year (1). (I will refer to this as a preprint, but at F1000Research the line is quite blurry between preprint and published paper.) The paper describes various challenges arising from the process of designing a benchmark strategy for bioinformatics pipelines (2) in the identification of antimicrobial resistance genes in next generation sequencing data.

The paper discusses issues about the benchmarking datasets used, testing samples, evaluation criteria for the performance of different tools, and how the benchmarking dataset should be created and distributed. Specially, we address the following questions:

  • How should a benchmark strategy handle the current and expanding universe of NGS platforms?
  • What should be the quality profile (in terms of read length, error rate, etc.) of in silico reference materials?
  • Should different sets of reference materials be produced for each platform? In that case, how to ensure no bias is introduced in the process?
  • Should in silico reference material be composed of the output of real experiments, or simulated read sets? If a combination is used, what is the optimal ratio?
  • How is it possible to ensure that the simulated output has been simulated “correctly”?
  • For real experiment datasets, how to avoid the presence of sensitive information?
  • Regarding the quality metrics in the benchmark datasets (e.g. error rate, read quality), should these values be fixed for all datasets, or fall within specific ranges? How wide can/should these ranges be?
  • How should the benchmark manage the different mechanisms by which bacteria acquire resistance?
  • What is the set of resistance genes/mechanisms that need to be included in the benchmark? How should this set be agreed upon?
  • Should datasets representing different sample types (e.g. isolated clones, environmental samples) be included in the same benchmark?
  • Is a correct representation of different bacterial species (host genomes) important?
  • How can the “true” value of the samples, against which the pipelines will be evaluated, be guaranteed?
  • What is needed to demonstrate that the original sample has been correctly characterised, in case real experiments are used?
  • How should the target performance thresholds (e.g. specificity, sensitivity, accuracy) for the benchmark suite be set?
  • What is the impact of these performance thresholds on the required size of the sample set?
  • How can the benchmark stay relevant when new resistance mechanisms are regularly characterized?
  • How is the continued quality of the benchmark dataset ensured?
  • Who should generate the benchmark resource?
  • How can the benchmark resource be efficiently shared?

Of course, we have not answered all these questions, but I think we have come down to a decent description of the problems, which we see as an important foundation for solving these issues and implementing the benchmarking standard. Some of these issues were tackled in our review paper from last year on using metagenomics to study resistance genes in microbial communities (3). The paper also somewhat connects to the database curation paper we published in 2016 (4), although this time the strategies deal with the testing datasets rather than the actual databases. The paper is the first outcome of the workshop arranged by the JRC on “Next-generation sequencing technologies and antimicrobial resistance” held October 4-5 last year in Ispra, Italy. You can find the paper here (it’s open access).

References and notes

  1. Angers-Loustau A, Petrillo M, Bengtsson-Palme J, Berendonk T, Blais B, Chan KG, Coque TM, Hammer P, Heß S, Kagkli DM, Krumbiegel C, Lanza VF, Madec J-Y, Naas T, O’Grady J, Paracchini V, Rossen JWA, Ruppé E, Vamathevan J, Venturi V, Van den Eede G: The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Research, 7, 459 (2018). doi: 10.12688/f1000research.14509.1
  2. You may remember that I hate the term “pipeline” for bioinformatics protocols. I would have preferred if it was called workflows or similar, but the term “pipeline” has taken hold and I guess this is a battle where I have essentially lost. The bioinformatics workflows will be known as pipelines, for better and worse.
  3. Bengtsson-Palme J, Larsson DGJ, Kristiansson E: Using metagenomics to investigate human and environmental resistomes. Journal of Antimicrobial Chemotherapy, 72, 2690–2703 (2017). doi: 10.1093/jac/dkx199
  4. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16, 18, 2454–2460 (2016). doi: 10.1002/pmic.201600034

Mitochondrial DNA Part B today published a mitochondrial genome announcement paper (1) in which I was involved in doing the assemblies and annotating them. The paper describes the mitogenome of Calanus glacialis, a marine planktonic copepod, which is a keystone species in the Arctic Ocean. The mitogenome is 20,674 bp long, and includes 13 protein-coding genes, 2 rRNA genes and 22 tRNA genes. While this is of course note a huge paper, we believe that this new resource will be of interest in understanding the structure and dynamics of C. glacialis populations. The main work in this paper has been carried out by Marvin Choquet at Nord University in Bodø, Norway. So hats off to him for great work, thanks Marvin! The paper can be read here.

Reference

  1. Choquet M, Alves Monteiro HJ, Bengtsson-Palme J, Hoarau G: The complete mitochondrial genome of the copepod Calanus glacialis. Mitochondrial DNA Part B, 2, 2, 506–507 (2017). doi: 10.1080/23802359.2017.1361357 [Paper link]

As the 8th Next Generation Sequencing Congress in London is drawing to a close as I write this, I have a few reflections that might warrant sharing. The first thing that has been apparent this year compared to the two previous times I have visited the event (in 2012 and 2013) is that there was very little talk about where Illumina sequencing is heading next. Instead the discussion was about the applications of Illumina sequencing in the clinical setting; so apparently this is now so mainstream that we only expect slow progress towards longer reads. Apart from that, Illumina is a completed, mature technology. Instead, the flashlight is now pointing entirely towards long-read sequencing (PacBio, NanoPore) as the next big thing. However, the excitement around these technologies has also sort of faded compared to in 2013 when they were soon-to-arrive. Indeed, it seems like there’s not much to be excited about in the sequencing field at the moment, or at least Oxford Global (who are hosting the conference) has failed to get these technologies here.

What also strikes me is the vast amounts of talk about RNAseq of cancer cells. The scope of this event has narrowed dramatically in the past three years. Which makes me substantially less interested in returning next year. If there is not much to be excited about, and the focus is only on cancer sequencing – despite the human microbiota being a very hot topic at the moment – what is the reason for non-cancer researchers to come to the event? There will need to be a stark shift towards another direction of this event if the arrangers want it to remain a broad NGS event. Otherwise, they may just as well go all in and rename the event the Next Generation Sequencing of Cancer Congress. But I hope they choose to widen the scope again; conferences discussing technology as a foundation for a variety of applications are important meeting points and spawning grounds for novel ideas.

I am happy to announce that our Viewpoint article on strategies for improving sequence databases has now been published in the journal Proteomics. The paper (1) defines some central problems hampering genomic, proteomic and metagenomic analyses and suggests five strategies to improve the situation:

  1. Clearly separate experimentally verified and unverified sequence entries
  2. Enable a system for tracing the origins of annotations
  3. Separate entries with high-quality, informative annotation from less useful ones
  4. Integrate automated quality-control software whenever such tools exist
  5. Facilitate post-submission editing of annotations and metadata associated with sequences

The paper is not long, so I encourage you to read it in its entirety. We believe that spreading this knowledge and pushing solutions to problems related to poor annotation metadata is vastly important in this era of big data. Although we specifically address protein-coding genes in this paper, the same logic also applies to other types of biological sequences. In this way the paper is related to my previous work with Henrik Nilsson on improving annotation data for taxonomic barcoding genes (2-4). This paper was one of the main end-results of the GoBiG network, and the backstory on the paper follows below the references…

References

  1. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034
  2. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  3. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  4. Nilsson RH, Tedersoo L, Ryberg M, Kristiansson E, Hartmann M, Unterseher M, Porter TM, Bengtsson-Palme J, Walker D, de Sousa F, Gamper HA, Larsson E, Larsson K-H, Kõljalg U, Edgar R, Abarenkov K: A comprehensive, automatically updated fungal ITS sequence dataset for reference-based chimera control in environmental sequencing efforts. Microbes and Environments, 30, 2, 145–150 (2015). doi: 10.1264/jsme2.ME14121

Backstory
In June 2013, the Gothenburg Bioinformatics Group for junior scientists (GoBiG) arranged a workshop with two themes: “Parallelized quantification of genes in large metagenomic datasets” and “Assigning functional predictions to NGS data”. The following discussion on how to database quality influenced results and what could be done to improve the situation was rather intense, and several good ideas were thrown around. I took notes from the meeting, and in the evening I put them down during a warm summer night at the balcony. In fact, the notes were good enough to be an early embryo for a manuscript. So I sent it to some of the most active GoBiG members (Kaisa Thorell and Fredrik Boulund), who were positive regarding the idea to turn it into a manuscript. I wrote it together more properly and we decided that everyone who contributed with ideas at the meeting would be invited to become co-authors. We submitted the manuscript in early 2014, only to see it (rather brutally) rejected. At that point most of us were sucked up in their own projects, so nothing happened to this manuscript for over a year. Then we decided to give it another go, updated the manuscript heavily and changed a few parts to better reflect the current database situation (at this point, e.g., UniProt had already started implementing some of our suggested ideas). Still, some of the proposed strategies were more radical in 2013 than they would be now, more than three years later. We asked the Proteomics editors if they would be interested in the manuscript, and they turned out to be very positive. Indeed, the entire experience with the editors at Proteomics has been very pleasant. I am very thankful to the GoBiG team for this time, and to the editors at Proteomics who saw the value of this manuscript.

Yesterday, a paper I co-authored with my colleagues Chandan Pal, Erik Kristiansson and Joakim Larsson on the co-occurences of resistance genes against antibiotics, biocides and metals in bacterial genomes and plasmids became published in BMC Genomics. In this paper (1) we utilize the publicly available, fully sequenced, genomes and plasmids in GenBank to investigate the co-occurence network of resistance genes, to better understand risks for co-selection for resistance against different types of compounds. In short, the findings of the paper are that:

  • ARGs are associated with BMRG-carrying bacteria and the co-selection potential of biocides and metals is specific towards certain antibiotics
  • Clinically important genera host the largest numbers of ARGs and BMRGs and those also have the highest co-selection potential
  • Bacteria isolated from human and domestic animal origins have the highest co-selection potential
  • Plasmids with co-selection potential tend to be conjugative and carry toxin-antitoxin systems
  • Mercury and QACs are potential co-selectors of ARGs on plasmids, however BMRGs are common on chromosomes and could still have indirect co-selection potential
  • 14 percent of bacteria and more than 70% of the plasmids completely lacked resistance genes

This analysis was possible thanks to the BacMet database of antibacterial biocide and metal resistance genes, published about two years ago (2). The visualization of the plasmid co-occurence network we ended up with can be seen below. Note the strong connection between the mercury resistance mer operon and the antibiotic resistance genes to the right.

On a side note, it is interesting to note that the underrepresentation of detoxification systems in marine environments we noted last year (3) still seems to hold for genomes (and particularly plasmids), supporting the genome streamlining hypothesis (4).

References:

  1. Pal C, Bengtsson-Palme J, Kristiansson E, Larsson DGJ: Co-occurrence of resistance genes to antibiotics, biocides and metals reveals novel insights into their co-selection potential. BMC Genomics, 16, 964 (2015). doi: 10.1186/s12864-015-2153-5 [Paper link]
  2. Pal C, Bengtsson-Palme J, Rensing C, Kristiansson E, Larsson DGJ: BacMet: Antibacterial Biocide and Metal Resistance Genes Database. Nucleic Acids Research, 42, D1, D737-D743 (2014). doi: 10.1093/nar/gkt1252 [Paper link]
  3. Bengtsson-Palme J, Alm Rosenblad M, Molin M, Blomberg A: Metagenomics reveals that detoxification systems are underrepresented in marine bacterial communities. BMC Genomics, 15, 749 (2014). doi: 10.1186/1471-2164-15-749 [Paper link]
  4. Giovannoni SJ, Cameron TJ, Temperton B: Implications of streamlining theory for microbial ecology. ISME Journal, 8, 1553-1565 (2014).

The paper we published in August on travelers carrying resistance genes with them in their gut microbiota has now been typeset and got proper volume and issue numbers assigned to it in Antimicrobial Agents and Chemotherapy. Take a look at it, I personally think it’s quite good-looking.

Also, if you understand Swedish, here is an interview with me broadcasted on Swedish Radio last month about this study and the consequences of it.

The new citation for the paper is:

  • Bengtsson-Palme J, Angelin M, Huss M, Kjellqvist S, Kristiansson E, Palmgren H, Larsson DGJ, Johansson A: The human gut microbiome as a transporter of antibiotic resistance genes between continents. Antimicrobial Agents and Chemotherapy, 59, 10, 6551-6560 (2015). doi: 10.1128/AAC.00933-15 [Paper link]

Earlier today, my most recent paper (1) became available online, describing resistance gene patterns in the gut microbiota of Swedes before and after travel to the Indian peninsula and central Africa. In this work, we have used metagenomic sequencing of the intestinal microbiome of Swedish students returning from exchange programs to show that the abundance of antibiotic resistance genes in several classes are increased after travel. This work reiterates the findings of several papers describing uptake of resistant bacteria (2-8) or resistance genes (9-11) after travel to destinations with worse resistance situation.

Our paper is important because it:

  1. Addresses the abundance of a vast range of resistance genes (more than 300).
  2. Finds evidence for that the overall relative abundance of antibiotic resistance genes increased after travel, without any intake of antibiotics.
  3. Shows that the sensitivity of metagenomics was, despite very deep sequencing efforts, not sufficient to detect acquisition of the low-abundant (CTX-M) resistance genes responsible for observed ESBL phenotypes.
  4. Reveals a “core resistome” of resistance genes that are more or less omnipresent, and remain relatively stable regardless of travel, while changes seem to occur in the more variable part of the resistome.
  5. Hints at increased abundance of Proteobacteria after travel, although this increase could not specifically be linked to resistance gene increases.
  6. Uses de novo metagenomic assembly to physically link resistance genes in the same sample, giving hints of co-resistance patterns in the gut microbiome.

The paper was a collaboration with Martin Angelin, Helena Palmgren and Anders Johansson at Umeå University, and was made possible by bioinformatics support from SciLifeLab in Stockholm. I highly recommend reading it as a complement to e.g. the Forslund et al. paper (12) describing country-specific antibiotic resistance patterns in the gut microbiota.

Taken together, this study offers a broadened perspective on how the antibiotic resistance potential of the human gut microbiome changes after travel, providing an independent complement to previous studies targeting a limited number of bacterial species or antibiotic resistance genes. Understanding how resistance genes travels the globe is hugely important, since resistance in principle only need to appear in a pathogen once; improper hygiene and travel may then spread novel resistance genes across continents rapidly (13,14).

References

  1. Bengtsson-Palme J, Angelin M, Huss M, Kjellqvist S, Kristiansson E, Palmgren H, Larsson DGJ, Johansson A: The human gut microbiome as a transporter of antibiotic resistance genes between continents. Antimicrob Agents Chemother Accepted manuscript posted online (2015). doi: 10.1128/AAC.00933-15 [Paper link]
  2. Gaarslev K, Stenderup J: Changes during travel in the composition and antibiotic resistance pattern of the intestinal Enterobacteriaceae flora: results from a study of mecillinam prophylaxis against travellers’ diarrhoea. Curr Med Res Opin 9:384–387 (1985).
  3. Paltansing S, Vlot JA, Kraakman MEM, Mesman R, Bruijning ML, Bernards AT, Visser LG, Veldkamp KE: Extended-spectrum β-lactamase-producing enterobacteriaceae among travelers from the Netherlands. Emerging Infect. Dis. 19:1206–1213 (2013).
  4. Ruppé E, Armand-Lefèvre L, Estellat C, El-Mniai A, Boussadia Y, Consigny PH, Girard PM, Vittecoq D, Bouchaud O, Pialoux G, Esposito-Farèse M, Coignard B, Lucet JC, Andremont A, Matheron S: Acquisition of carbapenemase-producing Enterobacteriaceae by healthy travellers to India, France, February 2012 to March 2013. Euro Surveill. 19 (2014).
  5. Kennedy K, Collignon P: Colonisation with Escherichia coli resistant to “critically important” antibiotics: a high risk for international travellers. Eur J Clin Microbiol Infect Dis 29:1501–1506 (2010).
  6. Tham J, Odenholt I, Walder M, Brolund A, Ahl J, Melander E: Extended-spectrum beta-lactamase-producing Escherichia coli in patients with travellers’ diarrhoea. Scand. J. Infect. Dis. 42:275–280 (2010).
  7. Östholm-Balkhed Å, Tärnberg M, Nilsson M, Nilsson LE, Hanberger H, Hällgren A, Travel Study Group of Southeast Sweden: Travel-associated faecal colonization with ESBL-producing Enterobacteriaceae: incidence and risk factors. J Antimicrob Chemother 68:2144–2153 (2013).
  8. Kantele A, Lääveri T, Mero S, Vilkman K, Pakkanen SH, Ollgren J, Antikainen J, Kirveskari J: Antimicrobials increase travelers’ risk of colonization by extended-spectrum betalactamase-producing enterobacteriaceae. Clin Infect Dis 60:837–846 (2015).
  9. von Wintersdorff CJH, Penders J, Stobberingh EE, Oude Lashof AML, Hoebe CJPA, Savelkoul PHM, Wolffs PFG: High rates of antimicrobial drug resistance gene acquisition after international travel, The Netherlands. Emerging Infect. Dis. 20:649–657 (2014).
  10. Tängdén T, Cars O, Melhus A, Löwdin E: Foreign travel is a major risk factor for colonization with Escherichia coli producing CTX-M-type extended-spectrum beta-lactamases: a prospective study with Swedish volunteers. Antimicrob Agents Chemother 54:3564–3568 (2010).
  11. Dhanji H, Patel R, Wall R, Doumith M, Patel B, Hope R, Livermore DM, Woodford N: Variation in the genetic environments of bla(CTX-M-15) in Escherichia coli from the faeces of travellers returning to the United Kingdom. J Antimicrob Chemother 66:1005–1012 (2011).
  12. Forslund K, Sunagawa S, Kultima JR, Mende DR, Arumugam M, Typas A, Bork P: Country-specific antibiotic use practices impact the human gut resistome. Genome Res 23:1163–1169 (2013).
  13. Bengtsson-Palme J, Larsson DGJ: Antibiotic resistance genes in the environment: prioritizing risks. Nat Rev Microbiol 13:396 (2015).
  14. Larsson DGJ: Antibiotics in the environment. Ups J Med Sci 119:108–112 (2014).

If you are thinking about doing a PhD and think that bioinformatics and antibiotic resistance is a cool subject, then now is your chance to come and join us for the next four years! There is a PhD position open i Joakim Larsson’s group, which means that if you get the job you will work with me, Joakim Larsson, Erik Kristiansson, Ørjan Samuelsen and Carl-Fredrik Flach on a super-interesting project relating to discovery of novel beta-lactamase genes (NoCURE). The project aims to better understand where, how and under what circumstances these genetic transfer events take place, in order to provide opportunities to limit or delay resistance development and thus increase the functional lifespan of precious antibiotics. The lion’s share of the work will be related to interpreting large-scale sequencing data generated by collaborators within the project; both genome sequencing and metagenomic data.

This is a great opportunity to prove your bioinformatics skills and use them for something urgently important. Full details about the position can be found here.

Metaxa2 is here!

1 comment

The new version of MetaxaMetaxa2 – which I first started talking about more than 1.5 years ago, has finally been determined to be so stable that we can officially release it! The release come around the same time as we submitted a paper describing the changes in it, but I will briefly go through the changes here:

  • Metaxa2 now handles extraction and classification of LSU rRNA sequences in addition to SSU rRNA
  • The classification engine has been completely redesigned, and now enables accurate taxonomic classifications down to the genus – or in some cases – species level
  • The classification database has been updated, and is now based on the SILVA 111 release
  • The Metaxa2 Taxonomic Traversal Tool – metaxa2_ttt – has been added to the package, to ease the counting of rRNA sequences in different organism groups (at various taxonomic levels)
  • Metaxa2 adds support for paired-end libraries
  • It is now possible to directly input of sequences in FASTQ-format to Metaxa2
  • The support for libraries with short read lengths (~100 bp) has been vastly improved (and is now assumed to be the case for default settings)
  • Metaxa2 can do quality pre-filtering of reads in FASTQ-format
  • Metaxa2 adds support for the modern BLAST+ package (although the old blastall version is still default)
  • Compatibility with the HMMER 3.1 beta

Metaxa2 brings together a large set of features that we have been gradually incorporating since 2011, many of which have been dependent on each other. Most of the new features and changes are thoroughly explained in the manual. While we hope Metaxa2 is bug free, there will likely be bugs caused by usage scenarios we have not envisioned. I therefore encourage anyone who come across some unexpected behavior to send me an e-mail. Especially, I would like to know about how the software performs using HMMER 3.1 and BLAST+, where testing has been limited compared to older parts of the code.

We hope that you will find Metaxa2 useful, and that it will bring taxonomic assessment of metagenomes another step forward! Metaxa2 can be downloaded here.

You know the feeling when your assembler supports paired-end sequences, but your FASTQ quality filterer doesn’t care about what pairs that belong together? Meaning that you end up with a mess of sequences that you have to script together in some way. Gosh, that feeling is way too common. It is for situations like that I have put together the Paired-End ToolKit (PETKit), a collection of FASTQ/FASTA sequence handling programs written in Perl. Currently the toolkit contains three command-line tools that does sequence conversion, quality filtering, and ORF prediction, all adapted for paired-end sequences specifically. You can read more about the programs, which are released as open source software, on the PETKit page. At the moment they lack proper documentation, but running the software with the “–help” option should bring up a useful set of options for each tool. This is still considered beta-software, so any bug reports, and especially suggestions, are welcome.

Also, if you have an idea of another problem that is unsolved or badly executed for paired-end sequences, let me know, and I will see if I can implement it in PETKit.