Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg

Browsing Posts in Bioinformatics

One of the highlights of the Swedish Bioinformatics Workshop 2014 was of course the dinner entertainment, a song specially crafted for the event. It has now, fortunately, been put online. For anyone who might not catch all the words, here’s the complete lyrics for the song (which is based on the song “Java Jive” in the Manhattan Transfer arrangement):

The Bioinformatics ABC

Grab your coffee
Grab your tea
Put down your spoon now and listen to me
For the bioinformatics ABC
Wake up, wake up, wake up, wake up, wake up
(Boy)

A for ABYSS
B for BLAST
And C for Clustal, though it’s not that fast
Alternatives are Muscle and MAFFT
ABYSS and BLAST and Clustal, Muscle, MAFFT
(Yeah)

D count reads with DESeq or E for EdgeR
And F for FastQC and G for GLIMMER
H for HMMER using Markov Chains
(Explain)
Hidden hidden Markov model

I for Inchworm;
Jellyfish
Add Chrysalis and Butterfly and wish
Assemble fast with a sound that goes swish
Contig, contig, contig, contig, transcript
(Girl)

KBASE
Lasergene
And the ton of tools for metagenomics
MEGAN, Megraft,
MetaPhlan, MG-RAST, Meta-GeneMark
And that’s just mentioning a few of them
(Talk it boy)

N for Newbler, old-school it is
If you’re still using 454 it’s a bliss
O for Oases, P for PyroNoise
45-45-45-454-454

Q is for Quake for that great quality
And R is for all those neat statistics
S for the Spades assembler, oh yeah

T for TopHat
U for Uclust
V for Velvet

There is Wham to align
XMatchView to review
And YASS to pursue
(But do you)
Know any tools beginning with Z?
What?
Yeah, Zorro, Zorro, Zorro
Oh yeah

With the publication of my latest paper last week (1), I also would like to highlight some of the software underpinning the findings a bit. To get around the problem that extremely common resistance genes could be present in multiple contexts and variants, causing assembler such as Velvet (2) to perform sub-optimally, we have written a software tool that utilizes Vmatch (3) and Trinity (4) to iteratively construct contigs from reads associated with resistance genes. This could of course be used in many other situations as well, when you want to specifically assemble a certain portion of a metagenome, but suspect that that portion might be found in multiple contexts.

TriMetAss is a Perl program, employing Vmatch and Trinity to construct multi-context contigs. TriMetAss uses extracted reads associated with, e.g., resistance genes as seeds for a Vmatch search against the complete set of read pairs, extracting reads matching with at least 49 bp (by default) to any of the seed reads. These reads are then assembled using Trinity. The resulting contigs are then used as seeds for another search using Vmatch to the complete set of reads, as above. All matches (including the previously matching read pairs) are again then used for a Trinity assembly. This iterative process is repeated until a stop criteria is met, e.g. when the total number of assembled nucleotides starts to drop rather than increase. The software can be downloaded here.

References:

  1. Bengtsson-Palme J, Boulund F, Fick J, Kristiansson E, Larsson DGJ: Shotgun metagenomics reveals a wide array of antibiotic resistance genes and mobile elements in a polluted lake in India. Frontiers in Microbiology, 5, 648 (2014). doi: 10.3389/fmicb.2014.00648
  2. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821–829 (2008). doi:10.1101/gr.074492.107
  3. Kurtz S: The Vmatch large scale sequence analysis software (2010). http://vmatch.de/
  4. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al.: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644–652 (2011). doi:10.1038/nbt.1883

After a long delay-time in testing ITSx version 1.0.10 has been made public. The new version patches a bug causing the 3′ anchor not being properly written to file when using the “--anchor hmm” option. If a number was used for the “--anchor” option, this bug did not apply. Thus, if you have not been using the “--anchor” option together with “hmm”, you have not been affected in any way by this bug. Nevertheless, I encourage updating in case you would use the “--anchor hmm” option in the future. The update can be downloaded here. Happy barcoding!

I would like to sincerely apologize for that I have been terrible at responding to support issues pertaining to ITSx, Metaxa, Atosh etc. lately. I am currently on 50% parental leave and at the same time I am wrapping up three first-author papers, organizing a workshop and preparing a talk. Thus, support issues has been lagging a bit behind the last weeks to be able to cope with everything else. I have been ticking off most (all?) of my support questions the last couple of days, but if I have any remaining issues that I have missed to reply to, please re-send them to me!

I will try to improve response times, but it is hard when I am working less than usual (also, note that I (strangely) don’t get paid for supporting software, so I have to do this on my “sparetime”). My aim is to respond within a few days, so if I have not done so, please resend your e-mail with a friendly reminder that you are waiting for my response. Reminding me will very likely put your question up the priority pile.

So, my advice to becoming dads is: Do take paternal leave. Do take a lot of it. Share responsibilities with your partner. Because what you get back is awesome. (And also you get a good reason not to answer support questions in time.) But finally, don’t plan to wrap up the last couple of year’s worth of work and arrange a conference at the same time as you take out paternal leave. That will only make you feel insufficient at all fronts.

Keep the spirit high!

Another paper I have made a contribution to have just recently been published in Molecular Ecology Resources. The paper (1), which is lead-authored by Xin-Cun Wang and Chang Liu at the Institute of Medicinal Plant Development in Beijing, investigates the usability of the ITS1 and ITS2 as separate barcodes across the Eukaryotes. The study is a large scale meta-analysis comparing available high-quality sequence data in as many taxonomic groups at possible from three different aspects: PCR amplification, DNA sequencing efficiency and species discrimination ability. Specifically, we have looked for the presence of DNA barcoding gaps, species discrimination efficiency, sequence length distribution, GC content distribution and primer universality, using bioinformatic approaches. We found that the ITS1 had significantly higher efficiencies than the ITS2 in 17 of 47 families and 20 of 49 investigated genera, which was markedly better than the performance of ITS2. We conclude that, in general, ITS1 represents a better DNA barcode than ITS2 for a majority of eukaryotic taxonomic groups. This of course doesn’t mean that using the ITS2 or the ITS region in its entirety should be dismissed, but our results can serve as a ground for making informed decisions about which region to choose for your amplicon sequencing project. The results complement what have previously been observed for e.g. fungi, where the difference between ITS1 and ITS2 were much less pronounced (2).

References:

  1. Wang X-C, Liu C, Huang L, Bengtsson-Palme J, Chen H, Zhang J-H, Cai D, Li J-Q: ITS1: A DNA barcode better than ITS2 in eukaryotes? Molecular Ecology Resources. Early view. doi: 10.1111/1755-0998.12325 [Paper link]
  2. Blaalid R, Kumar S, Nilsson RH, Abarenkov K, Kirk PM, Kauserud H: ITS1 versus ITS2 as DNA metabarcodes for fungi. Molecular Ecology Resources. Volume 13, Issue2, Page 218-224. doi: 10.1111/1755-0998.12065 [Paper link]

I would like to bring your attention to that the abstract deadline for the Swedish Bioinformatics Workshop held in Gothenburg in October has been extended to September 15. So hurry on and contribute with your latest research, we look forward to get to know what you’re doing!

I and one of the other developers of ITSx had a discussion a while ago about that using the --anchor option should output the “anchor sequences” around the ITS regions also for the full-length output file (given that the --truncate option is activated). I have today changed ITSx to employ this behaviour, updating it to version 1.0.9. The update also improves sensitivity when using the --anchor HMM option slightly, and can be downloaded here. Happy barcoding!

I am part of the organizing committee for the Swedish Bioinformatics Workshop (#SBW2014) that will be held October 23-24 this year in Gothenburg. I would like to invite you all, especially master/PhD students and PostDocs in Sweden, to come and share the event with us!

SBW is an annual event that has been organized by the different universities in Sweden. This year it will take place at the Wallenberg Conference Centre in Gothenburg and is arranged by both University of Gothenburg and Chalmers University of Technology. SBW2014 will, as the tradition abides, be a meeting point for PhD students and postdocs working with any kind of bioinformatics within Sweden and is therefore free of charge for these groups. We are proud to announce a program including both invited speakers – such as Mick Watson from the Roslin institute, Dawn Field from University of Oxford, and Joakim Lundeberg from KTH – along with participant presentations and poster sessions. This year, the program will also contain a number of workshop sessions where hands-on problems will be used as starting points for discussions on new bioinformatics approaches to these problems. This will provide opportunities for attendees with different methodological backgrounds to interact and work together to find synergies between fields and come up with creative solutions.

More information about the event including registration and abstract submission can be found at www.sbw2014.se.

I, and the rest of the organizers, look forward to meeting you in Gothenburg in October!

Webpage: http://www.sbw2014.se

Facebook: https://www.facebook.com/events/1450513325188910/

Google+: https://plus.google.com/events/cuhlpovcc275stut854dk5ussnk

If you want, you can spread the word, for example using this flyer!

ITSx has today been updated, bringing it to version 1.0.8. This update adds the “--only_full” option, which restricts output in the ITS1, 5.8S and ITS2 files to only the files that contain the full region, i.e. that both surrounding domains have been detected. The update also fixes a bug with the --anchor option, and can be downloaded here. Happy barcoding!

Another paper I have co-authored related to the UNITE database for fungal rDNA ITS sequences is now published as an Online Early article in Fungal Diversity. The paper describes an effort to improve the annotation of ITS sequences from fungal plant pathogens. Why is this important? Well, apart from fungal plant pathogens being responsible for great economic losses in agriculture, the paper is also conceptually important as it shows that together we can accomplish a substantial improvement to the metadata in sequence databases. In this work we have hunted down high-quality reference sequences for various plant pathogenic fungi, and re-annotated incorrectly or insufficiently annotated ITS sequences from the same fungal lineages. In total, the 59 authors have made 31,954 changes to UNITE database data, on average 540 changes per author. While one, or a few, persons could not feasibly have made this effort alone, this work shows that in larger consortia vast improvements can be made to the quality of databases, by distributing the work among many scientists. In many ways, this relates to proposals to “wikify” GenBank, and after Rfam and Pfam it might now be time to take the user-contribution model to, at least, the RefSeq portion of GenBank, which despite its description as being “comprehensive, integrated, non-redundant, [and] well-annotated” still contains errors and examples of non-usable annotation. More on that at a later point…

Paper reference:

Nilsson RH, Hyde KD, Pawlowska J, Ryberg M, Tedersoo L, Aas AB, Alias SA, Alves A, Anderson CL, Antonelli A, Arnold AE, Bahnmann B, Bahram M, Bengtsson-Palme J, Berlin A, Branco S, Chomnunti P, Dissanayake A, Drenkhan R, Friberg H, Frøslev TG, Halwachs B, Hartmann M, Henricot B, Jayawardena R, Jumpponen A, Kauserud H, Koskela S, Kulik T, Liimatainen K, Lindahl B, Lindner D, Liu J-K, Maharachchikumbura S, Manamgoda D, Martinsson S, Neves MA, Niskanen T, Nylinder S, Pereira OL, Pinho DB, Porter TM, Queloz V, Riit T, Sanchez-García M, de Sousa F, Stefaczyk E, Tadych M, Takamatsu S, Tian Q, Udayanga D, Unterseher M, Wang Z, Wikee S, Yan J, Larsson E, Larsson K-H, Kõljalg U, Abarenkov K: Improving ITS sequence data for identification of plant pathogenic fungi. Fungal Diversity Online early (2014). doi: 10.1007/s13225-014-0291-8 [Paper link]