An update to Megraft

You might remember that I a long time ago promised a minor update to Megraft. I then forgot about actually posting the update. So it’s very much about time, the updated 1.0.2 version of Megraft. The new thing in this version is improved handling of sequences with N’s (unknown bases) in them, and improved handling of sequences with strange sequence IDs (which sometimes have confused Megraft 1.0.1). The update can be downloaded here.

Metagenomics and the Hype Cycle

I was creating the diagram below an upcoming presentation, and I realized that the exponential growth in published metagenomics papers might be coming to an end. Interestingly enough the small drop in pace the recent years (701 -> 983 -> 1148) reminds me of the Hype Cycle, where we would (if my projection holds) have reached the “Peak of Inflated Expectations”, which means that we will see a rapid drop in the number of metagenomics publications in the next few years, as the field moves on.

The thought is interesting, but it seems a little bit early to draw any conclusions from the number of publications, yet. It is still kind of strange to note, though, that more than 20% of metagenomics publications (740/3547) are review papers. Come on, let’s do some science first and then review it… Anyway, it’ll be interesting to see what 2013 has in store for us.

PETKit updated – Critical bug fix

Some good and some bad news regarding the PETKit. Good news first; I have written a fourth tool for the PETKit, which is included in the latest release (version 1.0.2b, download here). The new tool is called Pesort, and sorts input read pairs (or single reads) so that the read pairs occur in the same order. It also sorts out which reads that don’t have a pair and outputs them to a separate file. All this is useful if you for some reason have ended up with a scrambled read file (pair). This can e.g. happen if you want to further process the reads after running Khmer or investigate the reads remaining after mapping to a genome.

Then the bad news. There’s a critical bug in PETKit version 1.0.1b. This bug manifest itself when using custom offsets for quality scores (using the –offset option), and makes the Pearf and Pepp tools too strict – leading to that they discard reads that actually are of good quality. This does not affect the Pefcon program. If you use the PETKit for read filtering or ORF prediction, and have used custom offset values, I recommend that you re-run your data with the newly released PETKit version (1.0.2b), in which this bug has been fixed. If you have only used the default offset setting, your safe. I sincerely apologize for any inconveniences that this might have caused.

Metaxa updated to version 1.1.2

Some users have asked me to fix a table output bug in Metaxa, and I have finally got around to do so. The fix is released today in the 1.1.2 Metaxa package (download here). This version also brings an updated manual (finally), as the User’s Guide has lagged behind since version 1.0. Please continue to report bugs to metaxa [at sign] microbiology [dot] se

Download the Metaxa package

Read the manual

Bloutminer updated

Good news for everyone using my bloutminer script; it has received an update making it even more useful! Basically, I have added a function to extract the top N matches to each query (using the -n option), and I have also added the ability to output a filtered set of sequences in the same tabulated BLAST-format as the input came in. Thereby, bloutminer can now be used in more settings to easily filter out a subset in a large BLAST report (in tabular format, generated using the blastall -m 8 option). The script can be downloaded here: https://microbiology.se/software/

Introducing the PETKit

You know the feeling when your assembler supports paired-end sequences, but your FASTQ quality filterer doesn’t care about what pairs that belong together? Meaning that you end up with a mess of sequences that you have to script together in some way. Gosh, that feeling is way too common. It is for situations like that I have put together the Paired-End ToolKit (PETKit), a collection of FASTQ/FASTA sequence handling programs written in Perl. Currently the toolkit contains three command-line tools that does sequence conversion, quality filtering, and ORF prediction, all adapted for paired-end sequences specifically. You can read more about the programs, which are released as open source software, on the PETKit page. At the moment they lack proper documentation, but running the software with the “–help” option should bring up a useful set of options for each tool. This is still considered beta-software, so any bug reports, and especially suggestions, are welcome.

Also, if you have an idea of another problem that is unsolved or badly executed for paired-end sequences, let me know, and I will see if I can implement it in PETKit.

Looking for a job?

The Core Facilites at Sahlgrenska are looking for a skilled bioinformatician that can support research projects employing the Core Facilites’ services. The employee will e.g. deal with setting up analysis pipelines for next generation sequencing data. They (of course) want an experienced bioinformatician, who also knows programming (Java, C and/or C++, and scripting languages such as Perl or Python). It is also preferable if the applicant knows how to set up secure systems and manage work with the Unix/Linux terminal. More on the position can be found at GU’s web site. The application time closes on the 17th of September.

Published paper: Guidelines for DNA quality checking

I have co-authored a paper together with, among others, Henrik Nilsson that was published today in MycoKeys. The paper deals with checking quality of DNA sequences prior to using them for research purposes. In our opinion, a lot of the software available for sequence quality management is rather complex and resource intensive. Not everyone have the skills to master such software, and in addition computational resources might also be scarce. Luckily, there’s a lot that can be done in quality control of DNA sequences just using manual means and a web browser. This paper puts these means together into one comprehensible and easy-to-digest document. Our targeted audience is primaily biologists who do not have a strong background in computer science, and still have a dataset requiring DNA sequence quality control.

We have chosen to focus on the fungal ITS barcoding region, but the guidelines should be pretty general and applicable to most groups of organisms. In very short our five guidelines spells:

  1. Establish that the sequences come from the intended gene or marker
    Can be done using a multiple alignment of the sequences and verifying that they all feature some suitable, conserved sub-region (the 5.8S gene in the ITS case)
  2. Establish that all sequences are given in the correct (5’ to 3’) orientation
    Examine the alignment for any sequences that do not align at all to the others; re-orient these; re-run the alignment step; and examine them again
  3. Establish that there are no (at least bad cases of) chimeras in the dataset
    Run the sequences through BLAST in one of the large sequence databases, e.g. at NCBI (or in the ITS case, use the UNITE database), to verify that the best match comprises more or less the full length of the query sequences
  4. Establish that there are no other major technical errors in the sequences
    Examine the BLAST results carefully, particularly the graphical overview and the pairwise alignment, for anomalies (there are some nice figures in the paper on how it should and should not look like)
  5. Establish that any taxonomic annotations given to the sequences make sense
    Examine the BLAST hit list to see that the species names produced make sense

A much more thorough description of these guidelines can be found in the paper itself, which is available under open access from MycoKeys. There’s simply no reason not to go there and at least take a look at it. Happy quality control!

Reference
Nilsson RH, Tedersoo L, Abarenkov K, Ryberg M, Kristiansson E, Hartmann M, Schoch CL, Nylander JAA, Bergsten J, Porter TM, Jumpponen A, Vaishampayan P, Ovaskainen O, Hallenberg N, Bengtsson-Palme J, Eriksson KM, Larsson K-H, Larsson E, Kõljalg U: Five simple guidelines for establishing basic authenticity and reliability of newly generated fungal ITS sequences. MycoKeys. Issue 4 (2012), 37–63. doi: 10.3897/mycokeys.4.3606 [Paper link]

Improving Swedish research – is there a need for a research elite?

I know that this is not supposed to be a political page, but writing this up, I realized that there is no way I can keep my political views entirely out of this post. So just a quick warning, the following text contains political opinions and is a reflection of my views and believes rather than well supported facts.

So, Swedish minister for education Jan Björklund has announced the government’s plan to spend 3 billion SEK (~350 million EUR, ~450 million USD) on “elite” researchers over the next ten years. One main reason to do so is to strengthen Swedish research in competition with American universities, and to be able to recruit top researchers from other countries to Sweden. While I welcome the prospect of more money to research, I have to say I am very skeptical about the nature of how this money is distributed. First of all, giving more money to the researchers that have already succeeded (I guess this is how you would define elite researchers – if someone has a better idea, please tell both me and Jan Björklund), is not going to generate more innovative research – just more of the same (or similar) things to what these researchers already do. If the government is serious about that Swedish research has a lower-than-expected output (which is a questionable statement in itself), the best way of increasing that output would be to give more researchers the opportunity to put their ideas into action. Second, a huge problem for research in Sweden is that a lot of the scientists’ time is spent on doing other stuff – writing grant applications, administering courses, filling in forms etc. Therefore, one way of improving research would be to put more money into funding at the university administration level, so that researchers actually have time to do what they are supposed to do. I will now provide my own four-point program for how I think that Sweden should move forward to improve the output of science.

1. Researchers need more time
My first point is that researchers need more time to do what they are supposed to do – science. This means that they cannot be expected to apply for money from six different research foundations every year, just to receive a very small amount of money that will keep them from getting thrown out for another 8 months. The short-term contracts that are currently the norm in Sweden create a system where way too much time is spent on writing grant applications – the majority of which will not succeed. In addition, researchers are often expected to be their own secretary, as well as organizing courses (not only lecturing). To solve this we need:

  • Longer contracts for scientists. A grant should be large enough to secure five years of salary, plus equipment costs. This allows for some time to actually get the science done, not just the time to write the next application.
  • Grants that come with a guaranteed five-year extension of grants to projects that have fulfilled their goals in the first five years. This further secures longevity of researchers and their projects. Also, this allows for universities to actually employ scientists instead of the current system which is all about trying to work around the employment rules.
  • More money to university administration. It is simple more cost efficient to have a secretary handling non-science related stuff in the department or group, as well as economic people handling the economy. The current system expects every researcher to be a jack of all trades – which efficiently reduces one to a master of none. More money to administration means more time spent on research.

2. Broad funding creates a foundation for success
Another problem is that if only a few projects are funded repeatedly, the success of Swedish research is very much bound to the success of these projects. While large-scale and high-cost projects are definitely needed, there is also a need to invest in a variety of projects. Many applied ideas have originated from very non-applied research, and the applied research need fundamental research to be done to be able to move forward. However, in the shortsighted governmental view of science, the output has to be almost immediate, which means that applied projects are much more likely to be funded. Thus, projects that could do fundamental discoveries, but are more complicated and take longer time will be down-prioritized by both researchers and universities. To further make situation worse, Björklund et al. have promised more money to universities that cut out non-productive research, with almost guarantees that any projects with a ten-year timeframe will not even be started.

If we are serious about making Swedish research successful, we need to do exactly the opposite. Fund a lot of different projects, both applied and fundamental, regardless of their short-term value. Because the ideas that are most likely to produce short-term results are probably also the ones that are the least innovative in the long-term. Consequently, we need to:

  • Spend research funding on a variety of projects, both of fundamental and applied nature.
  • Secure funding for “crazy” projects that span long periods of time, at least five to ten years.

3. If we don’t dare to fail, we will not have a chance to win
Finally, research funding must become better at taking risks. If we only bet our money on the most successful researchers, there is absolutely no chance for young scientists to get funded, unless of course they have been picked up by one of the right supervisors. This means that the same ideas get disseminated through the system over and over again, at the expense of more innovative ideas that could pop up in groups with less money to realize them. If these untested ideas in smaller groups get funded, some of them might undoubtedly fail to produce research of high societal value. But some of them will likely develop entirely new ideas, which in the long term might be much more fruitful than throwing money on the same groups over and over again. Suggestions:

  • Spend research funding broadly and with an active risk-gain management strategy.
  • Allow for fundamental research to investigate completely new concepts – even if they are previously untested, and regardless (or less dependent on) previous research output.
  • Invest in infrastructure for innovative research – and do so fast. For example, the money spent on the sequencing facilities at Sci Life Lab in Stockholm is an excellent example of an infrastructure investment that gains a lot of researchers at different universities access to high-throughput sequencing, without each university having to invest in expensive sequencing platforms themselves. More such centers would both spur collaboration and allow for faster adoption of new technologies.

4. Competing with what we are best at
A mistake that is often done when trying to compete with those that are best in the class is to try to compete by doing the same things as the best players do. This makes it extremely hard to win a game against exactly those players, as they are likely more experienced, have more resources, and already has the attention to get the resources we compete for. Instead, one could try to play the Wayne Gretzky trick: to try to skate where the puck is heading, instead of where it is today. Another approach would be to invent a new arena for the puck to land in, where you have better control over the settings than your competitors (slightly similar to what Apple did when the iPod was released, and Microsoft couldn’t use Windows to leverage their mp3-player Zune).

For Sweden, this would mean that we should not throw some bucks at the best players at our universities and hope that they will be happy with this (comparably small) amount of money. Instead, we should give them circumstances to work under that are much better or appealing from other standpoints. This could be better job security, longer contracts, less administrative work, securer grants, more freedom to decide over ones time, and larger possibilities to combine work and family. Simply creating a better, securer and nicer environment to work in. However, Björklund’s suggestions go the very opposite way: researchers should compete to be part of the elite community, and if your not in that group, you’d get thrown out. Therefore, I suggest (with the risk of repeating myself) that we should compete by:

  • Offering longer contracts and grants for scientists.
  • Giving scientists opportunities to combine work and family life.
  • Embracing all kinds of science, both fundamental and applied, both short-term and long-term.
  • Allowing researchers to take risks, even if they fail.
  • Giving universities enough funding to let scientists do the science and administrative personal do the administration.
  • Funding large-scale collaborative infrastructure investments.
  • Thinking of how to create an environment that is appealing for scientists, not only from an economic perspective.

A note on other important aspects of funding
Finally, I have now been focusing a lot on width as opposed to directed funding to an elite research squad. It is, however, apparent that we also need to allocate funding to bring in more women to the top positions in the academy. Likely, a system which favors elite groups will also favor male researchers, judging from how the Swedish Foundation for Strategic Research picks their bets for the future. Also, it is important that young researchers without strong track records gets funded, otherwise a lot of new and interesting ideas risk to be lost.

In the fourth point of my proposal, I suggest that Sweden should compete at what Sweden is good at, that is to view researchers as human beings, which are most likely to succeed in an environment where they can develop their ideas in a free and secure way. For me, it is surprising that a minister of education representing a liberal party wants to excess such control over what is good and bad research. Putting up a working social security system around science seems much more logical than throwing money at those who already have. Apparently I have forgotten that our current government is not interested in having a working social security system – their interest seem to lie in deconstructing the very same structures.

Megraft paper in print

I just learned from Research in Microbiology that the paper on our software Megraft has now been assigned a volume and an issue. The proper way of referencing Megraft should consequently now be:

Bengtsson J, Hartmann M, Unterseher M, Vaishampayan P, Abarenkov K, Durso L, Bik EM, Garey JR, Eriksson KM, Nilsson RH: Megraft: A software package to graft ribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes. Research in Microbiology. Volume 163, Issues 6–7 (2012), 407–412, doi: 10.1016/j.resmic.2012.07.001[Paper link]

Megraft is currently at version 1.0.1, but I have a slightly updated version in the pipeline which will be made available later this fall.