Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg

Browsing Posts in Open Science

In two weeks time, on the 15th of June, I will participate in a seminar organised by Landstingens nätverk för läkemedel och miljö (the Swedish county council network for pharmaceuticals and environment; the seminar will be held in Swedish) in Stockholm. I will give a talk on our proposed emission limits for antibiotics published last year (the paper is available here), but there will also be talks on wastewater treatment, sustainable pharmaceutical usage and environmental standards for pharmaceuticals. The full program can be found here, and you may register here until June 9. The seminar is free of charge.

And if you are interested in this, I can also recommend the webinar given by Healthcare Without Harm next week (on June 8), which will deal with sustainable procurement as a means to deal with pharmaceutical pollution in the environment. I will at least tune in to hear how the discussion goes here.

This morning as I was leaving my daughter at daycare, I got asked by one of the other kids at kindergarten what I do for work. Trying to communicate what you do as a researcher to a five-year-old is a quite interesting task. Five-year-olds are smart – but not very knowledgeable, which leads to very interesting turns to the conversation. Here’s the entire dialogue, transcribed from memory and translated to English:

– What do you do for work?
– I work at the hospital, but I’m not a doctor.
– So you are a psychologist?
– No, I am something called a researcher. I try to understand why bacteria turn evil and make us sick.
– Does someone need to do that?
– Not really. But if we can understand why bugs go bad, we may be able to be sick for much shorter in the future. Or perhaps not get sick at all.
– Okay. Isn’t that hard?
– Yes it is.
– Okay. Bye!

A few things I learned from this conversation: 1) explaining your research to young kids really makes you think about how to present what you do. 2) Kids really question the usefulness of your work (“Does someone need to do that?”). This is actually quite cool, because you need to think about how useful your work really is, in terms that a five-year-old can understand. 3) Society is awesome! To some extent, my work is a “luxury job”, i.e. maybe someone does not need to do my work, but it something we can afford because we share responsibilities and work together as a society, improving (hopefully) the world for all of us. In some sense, nobody strictly needs to be building houses; everyone could just build their own cottage. But building houses improve the standards for everyone, setting time aside for curing diseases, making music, researching microbial interactions, gardening, coffee roasting… Society is awesome.

First of all, I am happy to announce that the webinar I participated in on the (un)recognised pathways of AMR: Air pollution and food, organised by Healthcare Without Harm is now put online so that you can view it, in case you missed out on this event. To be honest it is probably not one of my best public appearances, but the topic is highly interesting.

Second, next week I am taking part in Vetenskapsfestivalen – the Science Festival in Gothenburg. Specifically, I will be on of the researchers participating in the Science Roulette, taking place in the big ferris wheel at Liseberg. This will take place between 17.00 and 18.00 on May 11th. The idea is that people will be paired with researchers in diverse subjects, of which I am one, and then have a 20 minute chat while the wheel is spinning. Sounds like potential for lot of fun, and I hope to see you there! I will discuss antibiotic resistance, and for how much longer we can trust that our antibiotics will work.

I am happy to announce that our Viewpoint article on strategies for improving sequence databases has now been published in the journal Proteomics. The paper (1) defines some central problems hampering genomic, proteomic and metagenomic analyses and suggests five strategies to improve the situation:

  1. Clearly separate experimentally verified and unverified sequence entries
  2. Enable a system for tracing the origins of annotations
  3. Separate entries with high-quality, informative annotation from less useful ones
  4. Integrate automated quality-control software whenever such tools exist
  5. Facilitate post-submission editing of annotations and metadata associated with sequences

The paper is not long, so I encourage you to read it in its entirety. We believe that spreading this knowledge and pushing solutions to problems related to poor annotation metadata is vastly important in this era of big data. Although we specifically address protein-coding genes in this paper, the same logic also applies to other types of biological sequences. In this way the paper is related to my previous work with Henrik Nilsson on improving annotation data for taxonomic barcoding genes (2-4). This paper was one of the main end-results of the GoBiG network, and the backstory on the paper follows below the references…

References

  1. Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034
  2. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
  3. Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
  4. Nilsson RH, Tedersoo L, Ryberg M, Kristiansson E, Hartmann M, Unterseher M, Porter TM, Bengtsson-Palme J, Walker D, de Sousa F, Gamper HA, Larsson E, Larsson K-H, Kõljalg U, Edgar R, Abarenkov K: A comprehensive, automatically updated fungal ITS sequence dataset for reference-based chimera control in environmental sequencing efforts. Microbes and Environments, 30, 2, 145–150 (2015). doi: 10.1264/jsme2.ME14121

Backstory
In June 2013, the Gothenburg Bioinformatics Group for junior scientists (GoBiG) arranged a workshop with two themes: “Parallelized quantification of genes in large metagenomic datasets” and “Assigning functional predictions to NGS data”. The following discussion on how to database quality influenced results and what could be done to improve the situation was rather intense, and several good ideas were thrown around. I took notes from the meeting, and in the evening I put them down during a warm summer night at the balcony. In fact, the notes were good enough to be an early embryo for a manuscript. So I sent it to some of the most active GoBiG members (Kaisa Thorell and Fredrik Boulund), who were positive regarding the idea to turn it into a manuscript. I wrote it together more properly and we decided that everyone who contributed with ideas at the meeting would be invited to become co-authors. We submitted the manuscript in early 2014, only to see it (rather brutally) rejected. At that point most of us were sucked up in their own projects, so nothing happened to this manuscript for over a year. Then we decided to give it another go, updated the manuscript heavily and changed a few parts to better reflect the current database situation (at this point, e.g., UniProt had already started implementing some of our suggested ideas). Still, some of the proposed strategies were more radical in 2013 than they would be now, more than three years later. We asked the Proteomics editors if they would be interested in the manuscript, and they turned out to be very positive. Indeed, the entire experience with the editors at Proteomics has been very pleasant. I am very thankful to the GoBiG team for this time, and to the editors at Proteomics who saw the value of this manuscript.

It is nice to see that Indian media has picked up the story about antibiotic resistance genes in the heavily polluted Kazipally lake. In this case, it is the Deccan Chronicle who have been reporting on our findings and briefly interviewed Prof. Joakim Larsson about the study. The issue of pharmaceutical pollution of the environment in drug-producing countries is still rather under-reported and public perception of the problem might be rather low. Therefore, it makes me happy to see an Indian newspaper reporting on the issue. The scientific publication referred to can be found here.

Some of you who think ITSx is running slowly despite being assigned multiple CPUs, particularly on datasets with only one kind of sequences (e.g. fungal) using the -t F option might be interested in trying out Andrew Krohn’s parallel ITSx implementation. The solution essentially employs a bash script spawning multiple ITSx instances running on different portions of the input file. Although there are some limitations to the script (e.g. you cannot select a custom name for the output and you will only get the ITS1 and ITS2 + full sequences FASTA files, as far as I understand the script), it may prove useful for many of you until we write up a proper solution to the poor multi-thread performance of ITSx (planned for version 1.1). In the coming months, I recommend that you check this solution out! See also the wiki documentation.

My speed tests shows the following (on a quite small test set of fungal ITS sequences):
ITSx parallel on 16 CPUs, all ITS types (option “-t all“):
3 min, 16 sec
ITSx parallel on 16 CPUs, only fungal ITS types (option “-t f“):
54 sec
ITSx native on 16 CPUs, all ITS types (options “-t all --cpu 16“):
4 min, 59 sec
ITSx native on 16 CPUs, only fungal types (options “-t f --cpu 16“):
5 min, 50 sec

Why fungal only took longer time in the native implementation is a mystery to me, but probably shows why there is a need to rewrite the multithreading code, as we did with Metaxa a couple of years ago. Stay tuned for ITSx updates!

A couple of days ago, a paper I have co-authored describing an ITS sequence dataset for chimera control in fungi went online as an advance online publication in Microbes and Environments. There are several software tools available for chimera detection (e.g. Henrik Nilsson’s fungal chimera checker (1) and UCHIME (2)), but these generally rely on the presence of a chimera-free reference dataset. Until now, there was no such dataset is for the fungal ITS region, and we in this paper (3) introduce a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database (4). This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. We estimated the dataset performance on a large set of artificial chimeras to be above 99.5%, and also used the dataset to remove nearly 1,000 chimeric fungal ITS sequences from the UNITE database. The dataset can be downloaded from the UNITE repository. Thereby, it is also possible for users to curate the dataset in the future through the UNITE interactive editing tools.

References:

  1. Nilsson RH, Abarenkov K, Veldre V, Nylinder S, Wit P de, Brosché S, Alfredsson JF, Ryberg M, Kristiansson E: An open source chimera checker for the fungal ITS region. Molecular Ecology Resources, 10, 1076–1081 (2010).
  2. Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics, 27, 16, 2194-2200 (2011). doi:10.1093/bioinformatics/btr381
  3. Nilsson RH, Tedersoo L, Ryberg M, Kristiansson E, Hartmann M, Unterseher M, Porter TM, Bengtsson-Palme J, Walker D, de Sousa F, Gamper HA, Larsson E, Larsson K-H, Kõljalg U, Edgar R, Abarenkov K: A comprehensive, automatically updated fungal ITS sequence dataset for reference-based chimera control in environmental sequencing efforts. Microbes and Environments, Advance Online Publication (2015). doi: 10.1264/jsme2.ME14121
  4. Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481

In an interesting development, Nature Publishing Group has launched a new initiative: Scientific Data – a online-only open access journal that publishes data sets without the demand of testing scientific hypotheses in connection to the data. That is, the data itself is seen as the valuable product, not any findings that might result from it. There is an immediate upside of this; large scientific data sets might be accessible to the research community in a way that enables proper credit for the sample collection effort. Since there is no demand for a full analysis of the data, the data itself might quicker be of use to others, without worrying that someone else might steal the bang of the data per se. I also see a possible downside, though. It would be easy to hold on to the data until you have analyzed it yourself, and then release it separately just about when you submit the paper on the analysis, generating extra papers and citation counts. I don’t know if this is necessarily bad, but it seems it could contribute to “publishing unit dilution”. Nevertheless, I believe that this is overall a good initiative, although how well it actually works will be up to us – the scientific community. Some info copied from the journal website:

Scientific Data’s main article-type is the Data Descriptor: peer-reviewed, scientific publications that provide an in-depth look at research datasets. Data Descriptors are a combination of traditional scientific publication content and structured information curated in-house, and are designed to maximize reuse and enable searching, linking and data mining. (…) Scientific Data aims to address the increasing need to make research data more available, citable, discoverable, interpretable, reusable and reproducible. We understand that wider data-sharing requires credit mechanisms that reward scientists for releasing their data, and peer evaluation mechanisms that account for data quality and ensure alignment with community standards.

I read an interesting note today in Nature regarding the willingness to be review papers. The author of the note (Dan Graur) claims that scientists that publish many papers contribute less to peer review, and proposes a system in which “journals should ask senior authors to provide evidence of their contribution to peer review as a condition for considering their manuscripts.” I think that this is a very interesting thought, however I see other problems coming with it. Let us for example assume that a senior author is neglecting peer review not to be evil, but simply due to an already monumental workload. If we force peer review on such a person, what kind of reviews do we expect to get back? Will this person be able to fulfill a proper, high-quality, peer review assignment? I doubt it.

On the other hand, I don’t have a good alternative either. If no one wants to do the peer reviewing, that system will inevitably break down. However, I think that there would be better to encourage peer review with positive bonuses, rather than pressure – maybe faster handling times, and higher priority, of papers with authors who have done their share of peer reviewing the last two years? Maybe cheaper publishing costs? In any case, I welcome that the subject is brought up for debate, since it is immensely important for the way we perform science today. Thanks Dan!

I have recently started to receive requests for full-text versions of my publications on ResearchGate. That’s great, but I have yet to figure out how to send them over, without breaking any agreements. As I am in a somewhat intensive work-period at the moment, please forgive me for not spending time on ResearchGate right now. And if you would like full-text versions of my publications, please send me an e-mail! I’ll be glad to help!