Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg | Wisconsin Institute for Discovery

Browsing Posts tagged Quality control

This weekend, F1000Research put online the non-peer-reviewed version of the paper resulting from a workshop arranged by the JRC in Italy last year (1). (I will refer to this as a preprint, but at F1000Research the line is quite blurry between preprint and published paper.) The paper describes various challenges arising from the process of designing a benchmark strategy for bioinformatics pipelines (2) in the identification of antimicrobial resistance genes in next generation sequencing data.

The paper discusses issues about the benchmarking datasets used, testing samples, evaluation criteria for the performance of different tools, and how the benchmarking dataset should be created and distributed. Specially, we address the following questions:

  • How should a benchmark strategy handle the current and expanding universe of NGS platforms?
  • What should be the quality profile (in terms of read length, error rate, etc.) of in silico reference materials?
  • Should different sets of reference materials be produced for each platform? In that case, how to ensure no bias is introduced in the process?
  • Should in silico reference material be composed of the output of real experiments, or simulated read sets? If a combination is used, what is the optimal ratio?
  • How is it possible to ensure that the simulated output has been simulated “correctly”?
  • For real experiment datasets, how to avoid the presence of sensitive information?
  • Regarding the quality metrics in the benchmark datasets (e.g. error rate, read quality), should these values be fixed for all datasets, or fall within specific ranges? How wide can/should these ranges be?
  • How should the benchmark manage the different mechanisms by which bacteria acquire resistance?
  • What is the set of resistance genes/mechanisms that need to be included in the benchmark? How should this set be agreed upon?
  • Should datasets representing different sample types (e.g. isolated clones, environmental samples) be included in the same benchmark?
  • Is a correct representation of different bacterial species (host genomes) important?
  • How can the “true” value of the samples, against which the pipelines will be evaluated, be guaranteed?
  • What is needed to demonstrate that the original sample has been correctly characterised, in case real experiments are used?
  • How should the target performance thresholds (e.g. specificity, sensitivity, accuracy) for the benchmark suite be set?
  • What is the impact of these performance thresholds on the required size of the sample set?
  • How can the benchmark stay relevant when new resistance mechanisms are regularly characterized?
  • How is the continued quality of the benchmark dataset ensured?
  • Who should generate the benchmark resource?
  • How can the benchmark resource be efficiently shared?

Of course, we have not answered all these questions, but I think we have come down to a decent description of the problems, which we see as an important foundation for solving these issues and implementing the benchmarking standard. Some of these issues were tackled in our review paper from last year on using metagenomics to study resistance genes in microbial communities (3). The paper also somewhat connects to the database curation paper we published in 2016 (4), although this time the strategies deal with the testing datasets rather than the actual databases. The paper is the first outcome of the workshop arranged by the JRC on “Next-generation sequencing technologies and antimicrobial resistance” held October 4-5 last year in Ispra, Italy. You can find the paper here (it’s open access).

References and notes

  1. Angers-Loustau A, Petrillo M, Bengtsson-Palme J, Berendonk T, Blais B, Chan KG, Coque TM, Hammer P, Heß S, Kagkli DM, Krumbiegel C, Lanza VF, Madec J-Y, Naas T, O’Grady J, Paracchini V, Rossen JWA, Ruppé E, Vamathevan J, Venturi V, Van den Eede G: The challenges of designing a benchmark strategy for bioinformatics pipelines in the identification of antimicrobial resistance determinants using next generation sequencing technologies. F1000Research, 7, 459 (2018). doi: 10.12688/f1000research.14509.1
  2. You may remember that I hate the term “pipeline” for bioinformatics protocols. I would have preferred if it was called workflows or similar, but the term “pipeline” has taken hold and I guess this is a battle where I have essentially lost. The bioinformatics workflows will be known as pipelines, for better and worse.
  3. Bengtsson-Palme J, Larsson DGJ, Kristiansson E: Using metagenomics to investigate human and environmental resistomes. Journal of Antimicrobial Chemotherapy, 72, 2690–2703 (2017). doi: 10.1093/jac/dkx199
  4. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16, 18, 2454–2460 (2016). doi: 10.1002/pmic.201600034

MycoKeys earlier this week published a paper describing the results of a workshop in Aberdeen in April last year, where we refined annotations for fungal ITS sequences from the built environment (1). This was a follow-up on a workshop in May 2016 (2) and the results have been implemented in the UNITE database and shared with other online resources. The paper has also been highlighted at microBEnet. I have very little time to further comment on this at this very moment, but I believe, as I wrote last time, that distributed initiatives like this (and the ones I have been involved in in the past (3,4)) serve a very important purpose for establishing better annotation of sequence data (5). The full paper can be found here.

References

  1. Nilsson RH, Taylor AFS, Adams RI, Baschien C, Bengtsson-Palme J, Cangren P, Coleine C, Daniel H-M, Glassman SI, Hirooka Y, Irinyi L, Iršenaite R, Martin-Sánchez PM, Meyer W, Oh S-O, Sampaio JP, Seifert KA, Sklenár F, Stubbe D, Suh S-O, Summerbell R, Svantesson S, Unterseher M, Visagie CM, Weiss M, Woudenberg J, Wurzbacher C, Van den Wyngaert S, Yilmaz N, Yurkov A, Kõljalg U, Abarenkov K: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from an April 10-11, 2017 workshop (Aberdeen, UK). MycoKeys, 28, 65–82 (2018). doi: 10.3897/mycokeys.28.20887 [Paper link]
  2. Abarenkov K, Adams RI, Laszlo I, Agan A, Ambrioso E, Antonelli A, Bahram M, Bengtsson-Palme J, Bok G, Cangren P, Coimbra V, Coleine C, Gustafsson C, He J, Hofmann T, Kristiansson E, Larsson E, Larsson T, Liu Y, Martinsson S, Meyer W, Panova M, Pombubpa N, Ritter C, Ryberg M, Svantesson S, Scharn R, Svensson O, Töpel M, Untersehrer M, Visagie C, Wurzbacher C, Taylor AFS, Kõljalg U, Schriml L, Nilsson RH: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1–15 (2016). doi: 10.3897/mycokeys.16.10000
  3. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  4. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  5. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034

I just want to highlight that the paper on strategies to improve database accuracy and usability we recently published in Proteomics (1) has been included in their most recent issue, which is a special issue focusing on Data Quality Issues in Proteomics. I highly recommend reading our paper (of course) and many of the other in the special issue. Happy reading!

On another note, I will be giving a talk next Wednesday (October 5th) on a seminar day on next generation sequencing in clinical microbiology, titled “Antibiotic resistance in the clinic and the environment – There and back again“. You are very welcome to the lecture hall at floor 3 in our building at Guldhedsgatan 10A here in Gothenburg if you are interested! (Bear in mind though that it all starts at 8.15 in the morning.)

Finally, it seems that I am going to the Next Generation Sequencing Congress in London this year, which will be very fun! Hope to see some of you dealing with sequencing there!

References

  1. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16, 18, 2454–2460 (2016). doi: 10.1002/pmic.201600034 [Paper link]

MycoKeys today put a paper online which I was involved in. The paper describes the results of a workshop in May, when we added and refined annotations for fungal ITS sequences according to the MIxS-Built Environment annotation standard (1). Fungi have been associated with a range of unwanted effects in the built environment, including asthma, decay of building materials, and food spoilage. However, the state of the metadata annotation of fungal DNA sequences from the built environment is very much incomplete in public databases. The workshop aimed to ease a little part of this problem, by distributing the re-annotation of public fungal ITS sequences across 36 persons. In total, we added or changed of 45,488 data points drawing from published literature, including addition of 8,430 instances of countries of collection, 5,801 instances of building types, and 3,876 instances of surface-air contaminants. The results have been implemented in the UNITE database and shared with other online resources. I believe, that distributed initiatives like this (and the ones I have been involved in in the past (2,3)) serve a very important purpose for establishing better annotation of sequence data, an issue I have brought up also for sequences outside of barcoding genes (4). The full paper can be found here.

References

  1. Abarenkov K, Adams RI, Laszlo I, Agan A, Ambrioso E, Antonelli A, Bahram M, Bengtsson-Palme J, Bok G, Cangren P, Coimbra V, Coleine C, Gustafsson C, He J, Hofmann T, Kristiansson E, Larsson E, Larsson T, Liu Y, Martinsson S, Meyer W, Panova M, Pombubpa N, Ritter C, Ryberg M, Svantesson S, Scharn R, Svensson O, Töpel M, Untersehrer M, Visagie C, Wurzbacher C, Taylor AFS, Kõljalg U, Schriml L, Nilsson RH: Annotating public fungal ITS sequences from the built environment according to the MIxS-Built Environment standard – a report from a May 23-24, 2016 workshop (Gothenburg, Sweden). MycoKeys, 16, 1–15 (2016). doi: 10.3897/mycokeys.16.10000
  2. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  3. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  4. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034

I am happy to announce that our Viewpoint article on strategies for improving sequence databases has now been published in the journal Proteomics. The paper (1) defines some central problems hampering genomic, proteomic and metagenomic analyses and suggests five strategies to improve the situation:

  1. Clearly separate experimentally verified and unverified sequence entries
  2. Enable a system for tracing the origins of annotations
  3. Separate entries with high-quality, informative annotation from less useful ones
  4. Integrate automated quality-control software whenever such tools exist
  5. Facilitate post-submission editing of annotations and metadata associated with sequences

The paper is not long, so I encourage you to read it in its entirety. We believe that spreading this knowledge and pushing solutions to problems related to poor annotation metadata is vastly important in this era of big data. Although we specifically address protein-coding genes in this paper, the same logic also applies to other types of biological sequences. In this way the paper is related to my previous work with Henrik Nilsson on improving annotation data for taxonomic barcoding genes (2-4). This paper was one of the main end-results of the GoBiG network, and the backstory on the paper follows below the references…

References

  1. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034
  2. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  3. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  4. Nilsson RH, Tedersoo L, Ryberg M, Kristiansson E, Hartmann M, Unterseher M, Porter TM, Bengtsson-Palme J, Walker D, de Sousa F, Gamper HA, Larsson E, Larsson K-H, Kõljalg U, Edgar R, Abarenkov K: A comprehensive, automatically updated fungal ITS sequence dataset for reference-based chimera control in environmental sequencing efforts. Microbes and Environments, 30, 2, 145–150 (2015). doi: 10.1264/jsme2.ME14121

Backstory
In June 2013, the Gothenburg Bioinformatics Group for junior scientists (GoBiG) arranged a workshop with two themes: “Parallelized quantification of genes in large metagenomic datasets” and “Assigning functional predictions to NGS data”. The following discussion on how to database quality influenced results and what could be done to improve the situation was rather intense, and several good ideas were thrown around. I took notes from the meeting, and in the evening I put them down during a warm summer night at the balcony. In fact, the notes were good enough to be an early embryo for a manuscript. So I sent it to some of the most active GoBiG members (Kaisa Thorell and Fredrik Boulund), who were positive regarding the idea to turn it into a manuscript. I wrote it together more properly and we decided that everyone who contributed with ideas at the meeting would be invited to become co-authors. We submitted the manuscript in early 2014, only to see it (rather brutally) rejected. At that point most of us were sucked up in their own projects, so nothing happened to this manuscript for over a year. Then we decided to give it another go, updated the manuscript heavily and changed a few parts to better reflect the current database situation (at this point, e.g., UniProt had already started implementing some of our suggested ideas). Still, some of the proposed strategies were more radical in 2013 than they would be now, more than three years later. We asked the Proteomics editors if they would be interested in the manuscript, and they turned out to be very positive. Indeed, the entire experience with the editors at Proteomics has been very pleasant. I am very thankful to the GoBiG team for this time, and to the editors at Proteomics who saw the value of this manuscript.

I have had the pleasure to be chosen as a speaker for next week’s (ten days from now) Swedish Bioinformatics Workshop. My talk is entitled “Turn up the signal – wipe out the noise: Gaining insights into bacterial community functions using metagenomic data“, and will largely deal with the same questions as my talk on EDAR3 in May this year. As then, the talk will highlight the some particular pitfalls related to interpretation of data, and exemplify how flawed analysis practices can result in misleading conclusions regarding community function, and use examples from our studies of environments subjected to pharmaceutical pollution in India, the effect of travel on the human resistome, and modern municipal wastewater treatment processes.

The talk will take place on Thursday, September 24, 2015 at 16:30. The full program for the conference can be found here. And also, if you want a sneak peak of the talk, you can drop by on Friday 13.00 at Chemistry and Molecular Biology, where I will give a seminar on the same topic in the Monthly Bioinformatic Practical Meetings series.

I will be giving a talk at the Third International symposium on the environmental dimension of antibiotic resistance (EDAR2015) next month (five weeks from now. The talk is entitled “Turn up the signal – wipe out the noise: Gaining insights into antibiotic resistance of bacterial communities using metagenomic data“, and will deal with handling of metagenomic data in antibiotic resistance gene research. The talk will highlight the some particular pitfalls related to interpretation of data, and exemplify how flawed analysis practices can result in misleading conclusions regarding antibiotic resistance risks. I will particularly address how taxonomic composition influences the frequencies of resistance genes, the importance of knowledge of the functions of the genes in the databases used, and how normalization strategies influence the results. Furthermore, we will show how the context of resistance genes can allow inference of their potential to spread to human pathogens from environmental or commensal bacteria. All these aspects will be exemplified by data from our studies of environments subjected to pharmaceutical pollution in India, the effect of travel on the human resistome, and modern municipal wastewater treatment processes.

The talk will take place on Monday, May 18, 2015 at 13:20. The full scientific program for the conference can be found here. Registration for the conference is still possible, although not for the early-bird price. I look forward to see a lot of the people who will attend the conference, and hopefully also you!

It’s been a while since the PETKit got any attention from me. Partially, that has been due to a nasty bug that could produce no output for one of the read files in Pefcon when using FASTA input files, but mostly it has simply been due to lack of time to continue development on the package. Now, I have finally put all threads together (bug fixes, new features, documentation) and today the 1.1 version is released! The new features are:

  • A new tool has been added – peacat – that can be used to e.g. stitch contigs together that have been separated for one reason or another in an assembly
  • Another tool – pemap – has been added that can be used to determine whether an assembled contig is from a circular DNA element
  • The default offset value for FASTQ files has been set to 33 (as in Sanger and Illumina 1.8+ PHRED format)
  • The documentation has been vastly improved (but is still rather inferior)

Some good and some bad news regarding the PETKit. Good news first; I have written a fourth tool for the PETKit, which is included in the latest release (version 1.0.2b, download here). The new tool is called Pesort, and sorts input read pairs (or single reads) so that the read pairs occur in the same order. It also sorts out which reads that don’t have a pair and outputs them to a separate file. All this is useful if you for some reason have ended up with a scrambled read file (pair). This can e.g. happen if you want to further process the reads after running Khmer or investigate the reads remaining after mapping to a genome.

Then the bad news. There’s a critical bug in PETKit version 1.0.1b. This bug manifest itself when using custom offsets for quality scores (using the –offset option), and makes the Pearf and Pepp tools too strict – leading to that they discard reads that actually are of good quality. This does not affect the Pefcon program. If you use the PETKit for read filtering or ORF prediction, and have used custom offset values, I recommend that you re-run your data with the newly released PETKit version (1.0.2b), in which this bug has been fixed. If you have only used the default offset setting, your safe. I sincerely apologize for any inconveniences that this might have caused.

You know the feeling when your assembler supports paired-end sequences, but your FASTQ quality filterer doesn’t care about what pairs that belong together? Meaning that you end up with a mess of sequences that you have to script together in some way. Gosh, that feeling is way too common. It is for situations like that I have put together the Paired-End ToolKit (PETKit), a collection of FASTQ/FASTA sequence handling programs written in Perl. Currently the toolkit contains three command-line tools that does sequence conversion, quality filtering, and ORF prediction, all adapted for paired-end sequences specifically. You can read more about the programs, which are released as open source software, on the PETKit page. At the moment they lack proper documentation, but running the software with the “–help” option should bring up a useful set of options for each tool. This is still considered beta-software, so any bug reports, and especially suggestions, are welcome.

Also, if you have an idea of another problem that is unsolved or badly executed for paired-end sequences, let me know, and I will see if I can implement it in PETKit.