I am happy to announce that our Viewpoint article on strategies for improving sequence databases has now been published in the journal Proteomics. The paper (1) defines some central problems hampering genomic, proteomic and metagenomic analyses and suggests five strategies to improve the situation:
- Clearly separate experimentally verified and unverified sequence entries
- Enable a system for tracing the origins of annotations
- Separate entries with high-quality, informative annotation from less useful ones
- Integrate automated quality-control software whenever such tools exist
- Facilitate post-submission editing of annotations and metadata associated with sequences
The paper is not long, so I encourage you to read it in its entirety. We believe that spreading this knowledge and pushing solutions to problems related to poor annotation metadata is vastly important in this era of big data. Although we specifically address protein-coding genes in this paper, the same logic also applies to other types of biological sequences. In this way the paper is related to my previous work with Henrik Nilsson on improving annotation data for taxonomic barcoding genes (2-4). This paper was one of the main end-results of the GoBiG network, and the backstory on the paper follows below the references…
- Bengtsson-Palme J, Boulund F, Edström R, Feizi A, Johnning A, Jonsson VA, Karlsson FH, Pal C, Pereira MB, Rehammar A, Sánchez J, Sanli K, Thorell K: Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, Early view (2016). doi: 10.1002/pmic.201600034
- Kõljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AFS, Bahram M, Bates ST, Bruns TT, Bengtsson-Palme J, Callaghan TM, Douglas B, Drenkhan T, Eberhardt U, Dueñas M, Grebenc T, Griffith GW, Hartmann M, Kirk PM, Kohout P, Larsson E, Lindahl BD, Lücking R, Martín MP, Matheny PB, Nguyen NH, Niskanen T, Oja J, Peay KG, Peintner U, Peterson M, Põldmaa K, Saag L, Saar I, Schüßler A, Senés C, Smith ME, Suija A, Taylor DE, Telleria MT, Weiß M, Larsson KH: Towards a unified paradigm for sequence-based identification of Fungi. Molecular Ecology, 22, 21, 5271–5277 (2013). doi: 10.1111/mec.12481
- Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity, 67, 1, 11–19 (2014). doi: 10.1007/s13225-014-0291-8
- Nilsson RH, Tedersoo L, Ryberg M, Kristiansson E, Hartmann M, Unterseher M, Porter TM, Bengtsson-Palme J, Walker D, de Sousa F, Gamper HA, Larsson E, Larsson K-H, Kõljalg U, Edgar R, Abarenkov K: A comprehensive, automatically updated fungal ITS sequence dataset for reference-based chimera control in environmental sequencing efforts. Microbes and Environments, 30, 2, 145–150 (2015). doi: 10.1264/jsme2.ME14121
In June 2013, the Gothenburg Bioinformatics Group for junior scientists (GoBiG) arranged a workshop with two themes: “Parallelized quantification of genes in large metagenomic datasets” and “Assigning functional predictions to NGS data”. The following discussion on how to database quality influenced results and what could be done to improve the situation was rather intense, and several good ideas were thrown around. I took notes from the meeting, and in the evening I put them down during a warm summer night at the balcony. In fact, the notes were good enough to be an early embryo for a manuscript. So I sent it to some of the most active GoBiG members (Kaisa Thorell and Fredrik Boulund), who were positive regarding the idea to turn it into a manuscript. I wrote it together more properly and we decided that everyone who contributed with ideas at the meeting would be invited to become co-authors. We submitted the manuscript in early 2014, only to see it (rather brutally) rejected. At that point most of us were sucked up in their own projects, so nothing happened to this manuscript for over a year. Then we decided to give it another go, updated the manuscript heavily and changed a few parts to better reflect the current database situation (at this point, e.g., UniProt had already started implementing some of our suggested ideas). Still, some of the proposed strategies were more radical in 2013 than they would be now, more than three years later. We asked the Proteomics editors if they would be interested in the manuscript, and they turned out to be very positive. Indeed, the entire experience with the editors at Proteomics has been very pleasant. I am very thankful to the GoBiG team for this time, and to the editors at Proteomics who saw the value of this manuscript.
This has happened along with the 25th release of the Pfam database (released 1st of April), and basically means that Wikipedia articles will be linked to Pfam families. Gradually, this will (hopefully) improve the annotation of Pfam families, which has in many cases been rather poor. The Xfam blog post related to Pfam release 25 says the change will be happening gradually, which might actually be good thing, given the quirks that might pop up.
(…) a major change is that Pfam annotation is now beginning to be co-ordinated via Wikipedia. Unlike Rfam, where every entry has a Wikipedia entry, we expect this to be a more gradual transition for Pfam, so not all entries currently have a corresponding Wikipedia article. For a more detailed discussion, check the help page. We actively encourage the addition of new/updated annotations via Wikipedia as they will appear far quicker than waiting for a Pfam release. If there are articles in Wikipedia that you think correspond to a family, then please mail us!
I have awaited this change for a long time, and is very happy that Pfam has finally taken this step. Congratulations and my sincerest thanks to the Pfam team! Now, let’s go editing!
In December, Alex Bateman, whose opinions on open science I support and have touched upon earlier, wrote a short correspondence letter to Nature  in which he again repeated the points of his talk at FEBS last summer. He concludes by the paragraph:
Many in the scientific community will admit to using Wikipedia occasionally, yet few have contributed content. For society’s sake, scientists must overcome their reluctance to embrace this resource.
I agree with this statement. However, as I also touched upon earlier, but like to repeat again – bold statements doesn’t make dreams come true – action does. Rfam, and the collaboration with RNA Biology and Wikipedia is a great example of such actions. So what other actions may be necessary to get researchers to contribute to the Wikipedian wisdom?
First of all, I do not think that the main obstacle to get researchers to edit Wikipedia articles is reluctance to doing so because Wikipedia is “inconsistent with traditional academic scholarship”, though that might be a partial explanation. What I think is the major problem is the time-reward tradeoff. Given the focus on publishing peer-reviewed articles, the race for higher impact factor, and the general tendency of measuring science by statistical measures, it should be no surprise that Wikipedia editing is far down on most scientists to-do lists, so also on mine. The reward of editing a Wikipedia article is a good feeling in your stomach that you have benefitted society. Good stomach feelings will, however, feed my children just as little as freedom of speech. Still, both Wikipedia editing and freedom of speech are extremely important, especially as a scientist.
Thus, there is a great need of a system that:
- Provides a reward or acknowledgement for Wikipedia editing.
- Makes Wikipedia editing economically sustainable.
- Encourages publishing of Wikipedia articles, or contributions to existing ones as part of the scientific publishing process.
Such a system could include a “contribution factor” similar to the impact factor, in which contribution of Wikipedia and other open access forums was weighted, with or without a usefulness measure. Such a usefulness measure could easily be determined by links from other Wikipedia articles, or similar. I realise that there would be severe drawbacks of such a system, similar to those of the impact factor system. I am not a huge fan of impact factors (read e.g. Per Seglen’s 1997 BMJ article  for some reasons why), but I do not see that system changing any time soon, and thus some kind of contribution factor could provide an additional statistical measure for evaluators to consider when examining scientists’ work.
While a contribution factor would be an incitement for researchers to contribute to the common knowledge, it will still not provide an economic value to do so. This could easily be changed by allowing, and maybe even requiring, scientists to contribute to Wikipedia and other public fora of scientific information as part of their science outreach duties. In fact, this public outreach duty (“tredje uppgiften” in Swedish) is governed in Swedish law. In 2009, the universities in Sweden have been assigned to “collaborate with the society and inform about their operations, and act such that scientific results produced at the university benefits society” (my translation). It seems rational that Wikipedia editing would be part of that duty, as that is the place were many (most?) people find information online today. Consequently, it is only up to the universities to demand 30 minutes of Wikipedia editing per week/month from their employees. Note here that I am referring to paid editing.
Another way of increasing the economic appeal of writing Wikipedia articles would be to encourage funding agencies and foundations to demand Wikipedia articles or similar as part of project reports. This would require researchers to make their findings public in order to get further funding, a move that would greatly increase the importance of increasing the common wisdom treasure. However, I suspect that many funding agencies, as well as researchers would be reluctant to such a solution.
Lastly, as shown by the Rfam/RNA Biology/Wikipedia relationship, scientific publishing itself could be tied to Wikipedia editing. This process could be started by e.g. open access journals such as PLoS ONE, either by demanding short Wikipedia notes to get an article published, or by simply provide prioritised publishing of articles which also have an accompanying Wiki-article. As mentioned previously, these short Wikipedia notes would also go through a peer-review process along with the full article. By tying this to the contribution factor, further incitements could be provided to get scientific progress in the hands of the general public.
Now, all these ideas put a huge burden on already hard-working scientists. I realise that they cannot all be introduced simultaneously. Opening up publishing requires time and thought, and should be done in small steps. But doing so is in the interest of scientists, the general public and the funders, as well as politicians. Because in the long run it will be hard to argue that society should pay for science when scientists are reluctant to even provide the public with an understandable version of the results. Instead of digging such a hole for ourselves, we should adapt the reward, evaluation, funding and publishing systems in a way that they benefit both researchers and the society we often say we serve.
- Bateman and Logan. Time to underpin Wikipedia wisdom. Nature (2010) vol. 468 (7325) pp. 765
- Seglen. Why the impact factor of journals should not be used for evaluating research. BMJ (1997) vol. 314 (7079) pp. 498-502
Nature recently had a nice news article on Bio-wikis and biological databases connected to Wikipedia where Alex Bateman says they’re working on a protein-family wiki that will be hosted on Wikipedia, similar to the Rfam wiki, which he talked about at FEBS this summer. I am of course very excited about this, and hope that the new Pfam (?) wiki will come rather sooner than later. As pointed out earlier, the Nature article also underlines the problem of scientific wikis; currently there is no career incentive to get researchers to spend their time editing wiki-articles. That is a shame, and perhaps a system copying the system of Rfam and RNA Biology could help in this direction. The only question is which journal(s) that would be interested in such a commitment to open science…
I listened to a great talk by Alex Bateman (one of the guys behind Pfam and Rfam, as well as involved in HMMER development) at FEBS yesterday. In addition to talking about the problems of increasing sequence amounts, Alex also provided some reflections on co-operativity and knowledge-sharing – not only among fellow researchers, but also to a wider audience. The starting point of this discussion is Rfam, where the annotation of RNA families is entirely based on a community-driven wiki, tightly integrated with Wikipedia. This means that to make a change in the Rfam annotation, the same change is also made at the corresponding Wikipedia page for this RNA family. And what’s the use of this? Well, as Alex says, for most of the keywords in molecular biology (and I would guess in all of science), the top hit on Google will be a Wikipedia entry. If not, the Wikipedia entry will be in the top ten list of hits, if a good Wiki page exists. This means that Wikipedia is the primary source of scientific information for the general public, as well as many scientists. Wikipedia – not scientific journals.
The consequence of this is that to communicate your research subject, you should contribute to its Wikipedia page. In fact, Bateman argues, we have a responsibility as scientists to provide accurate and correct information to the public through the best sources available, which in most cases would be Wikipedia. To put this in perspective (and here I once again borrow Alex’ words), if somebody told you ten years ago that there would be one single internet site that everybody would visit to find scientific information, and where discussion and continuous improvement would be allowed, encouraged and performed, most people would have said that was too good to be true. But that’s what Wikipedia offers. It is time to get rid of the Wiki-sceptisism, and start improving it.
And so, what about the future of publishing? Bateman has worked hard to form an agreement with the journal RNA Biology to integrate the publishing into the process of adding to the easily accessible public information. To have an article on a new RNA family published under the journal’s RNA families track, the family must not only be submitted to the Rfam database, but the authors must also provide a Wikipedia formatted article, which undergo the same peer-review process as the journal article. This ensures high-quality Wikipedia material, as well as making new scientific discoveries public.
I don’t think there’s a long stretch to guess that in the future, more journals and/or funding agencies will take on similar approaches, as researchers and decision-makers discover the importance of correct, publicly available information. The scientific world is slowly moving towards being more open, also for non-scientists. This openness is of extremely high importance in these times of climate scepticism, GMO controversy, extinction of species, and nuclear power debate. For the public to make proper decisions and send a clear message to the politicians, scientists need to be much better at communicating the current state of knowledge, or what many people prefer to call “truth”.