Good news for everyone using my bloutminer script; it has received an update making it even more useful! Basically, I have added a function to extract the top N matches to each query (using the -n option), and I have also added the ability to output a filtered set of sequences in the same tabulated BLAST-format as the input came in. Thereby, bloutminer can now be used in more settings to easily filter out a subset in a large BLAST report (in tabular format, generated using the blastall -m 8 option). The script can be downloaded here: https://microbiology.se/software/
You know the feeling when your assembler supports paired-end sequences, but your FASTQ quality filterer doesn’t care about what pairs that belong together? Meaning that you end up with a mess of sequences that you have to script together in some way. Gosh, that feeling is way too common. It is for situations like that I have put together the Paired-End ToolKit (PETKit), a collection of FASTQ/FASTA sequence handling programs written in Perl. Currently the toolkit contains three command-line tools that does sequence conversion, quality filtering, and ORF prediction, all adapted for paired-end sequences specifically. You can read more about the programs, which are released as open source software, on the PETKit page. At the moment they lack proper documentation, but running the software with the “–help” option should bring up a useful set of options for each tool. This is still considered beta-software, so any bug reports, and especially suggestions, are welcome.
Also, if you have an idea of another problem that is unsolved or badly executed for paired-end sequences, let me know, and I will see if I can implement it in PETKit.
The Core Facilites at Sahlgrenska are looking for a skilled bioinformatician that can support research projects employing the Core Facilites’ services. The employee will e.g. deal with setting up analysis pipelines for next generation sequencing data. They (of course) want an experienced bioinformatician, who also knows programming (Java, C and/or C++, and scripting languages such as Perl or Python). It is also preferable if the applicant knows how to set up secure systems and manage work with the Unix/Linux terminal. More on the position can be found at GU’s web site. The application time closes on the 17th of September.
I have co-authored a paper together with, among others, Henrik Nilsson that was published today in MycoKeys. The paper deals with checking quality of DNA sequences prior to using them for research purposes. In our opinion, a lot of the software available for sequence quality management is rather complex and resource intensive. Not everyone have the skills to master such software, and in addition computational resources might also be scarce. Luckily, there’s a lot that can be done in quality control of DNA sequences just using manual means and a web browser. This paper puts these means together into one comprehensible and easy-to-digest document. Our targeted audience is primaily biologists who do not have a strong background in computer science, and still have a dataset requiring DNA sequence quality control.
We have chosen to focus on the fungal ITS barcoding region, but the guidelines should be pretty general and applicable to most groups of organisms. In very short our five guidelines spells:
- ￼￼￼Establish that the sequences come from the intended gene or marker
Can be done using a multiple alignment of the sequences and verifying that they all feature some suitable, conserved sub-region (the 5.8S gene in the ITS case)
- Establish that all sequences are given in the correct (5’ to 3’) orientation
Examine the alignment for any sequences that do not align at all to the others; re-orient these; re-run the alignment step; and examine them again
- Establish that there are no (at least bad cases of) chimeras in the dataset
Run the sequences through BLAST in one of the large sequence databases, e.g. at NCBI (or in the ITS case, use the UNITE database), to verify that the best match comprises more or less the full length of the query sequences
- Establish that there are no other major technical errors in the sequences
Examine the BLAST results carefully, particularly the graphical overview and the pairwise alignment, for anomalies (there are some nice figures in the paper on how it should and should not look like)
- Establish that any taxonomic annotations given to the sequences make sense
Examine the BLAST hit list to see that the species names produced make sense
A much more thorough description of these guidelines can be found in the paper itself, which is available under open access from MycoKeys. There’s simply no reason not to go there and at least take a look at it. Happy quality control!
Nilsson RH, Tedersoo L, Abarenkov K, Ryberg M, Kristiansson E, Hartmann M, Schoch CL, Nylander JAA, Bergsten J, Porter TM, Jumpponen A, Vaishampayan P, Ovaskainen O, Hallenberg N, Bengtsson-Palme J, Eriksson KM, Larsson K-H, Larsson E, Kõljalg U: Five simple guidelines for establishing basic authenticity and reliability of newly generated fungal ITS sequences. MycoKeys. Issue 4 (2012), 37–63. doi: 10.3897/mycokeys.4.3606 [Paper link]
Bengtsson J, Hartmann M, Unterseher M, Vaishampayan P, Abarenkov K, Durso L, Bik EM, Garey JR, Eriksson KM, Nilsson RH: Megraft: A software package to graft ribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes. Research in Microbiology. Volume 163, Issues 6–7 (2012), 407–412, doi: 10.1016/j.resmic.2012.07.001. [Paper link]
Megraft is currently at version 1.0.1, but I have a slightly updated version in the pipeline which will be made available later this fall.
I am on my way to Copenhagen for the ISME14 conference that begins today. I’m myself quite excited about this event, and will present three posters (two as first author), and give a short talk on antibiotic resistance gene identification and metagenomics. My talk will be in the Bioinformatics in Microbial Ecology session on Thursday afternoon (at 13.30).
If you’d like to talk about Metaxa and Megraft, I will present an SSU-oriented poster in the Monday afternoon poster section (board number 267A). My antibiotic resistance gene poster will be presented on Thursday afternoon (board number 002A), and I really encourage everyone interested in metagenomics (especially metagenomic assembly) to come talk to me then! Finally, I am also partially responsible for a poster on periphyton metagenomics with Martin Eriksson as its main author. This poster is also presented on Monday, in the Microbial Dispersion and Biogeography session (board number 021A).
I hope to be able to make another post later tonight on what are the “essential” sessions for me on this conference. Hope to see you there soon!
Yesterday, our paper on Megraft – a software tool to graft ribosomal small subunit (16S/18S) fragments onto full-length SSU sequences – became available as an accepted online early article in Research in Microbiology. Megraft is built upon the notion that when examining the depth of a community sequencing effort, researchers often use rarefaction analysis of the ribosomal small subunit (SSU/16S/18S) gene in a metagenome. However, the SSU sequences in metagenomic libraries generally are present as fragmentary, non-overlapping entries, which poses a great problem for this analysis. Megraft aims to remedy this problem by grafting the input SSU fragments from the metagenome (obtained by e.g. Metaxa) onto full-length SSU sequences. The software also uses a variability model which accounts for observed and unobserved variability. This way, Megraft enables accurate assessment of species richness and sequencing depth in metagenomic datasets.
The algorithm, efficiency and accuracy of Megraft is thoroughly described in the paper. It should be noted that this is not a panacea for species richness estimates in metagenomics, but it is a huge step forward over existing approaches. Megraft shares some similarities with EMIRGE (Miller et al., 2011), which is a software package for reconstruction of full-length ribosomal genes from paired-end Illumina sequences. Megraft, however, is set apart in that it has a strong focus on rarefaction, and functions also when the number of sequences is small, which is often the case in 454 and Sanger-based metagenomics studies. Thus, EMIRGE and Megraft seek to solve a roughly similar problem, but for different sequencing technologies and sequencing scales.
Bengtsson, J., Hartmann, M., Unterseher, M., Vaishampayan, P., Abarenkov, K., Durso, L., Bik, E.M., Garey, J.R., Eriksson, K.M., Nilsson R.H. (2012). Megraft: A software package to graftribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes and similar environmental datasets. Research in Microbiology, doi: 10.1016/j.resmic.2012.07.001.
- Miller, C. S., Baker, B. J., Thomas, B. C., Singer, S. W., & Banfield, J. F. (2011). EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biology, 12(5), R44. doi:10.1186/gb-2011-12-5-r44
I realized that I have been using a newer version of Metaxa than most of you for the last couple of months. This bug fix was written sometime in February or March, and we have kept it internal to make sure it works as it should. Then other things came across and we never got around to actually release it. But with testing passed and upcoming versions of Metaxa in the pipeline, I think it is about time that everyone gets their hands on the latest Metaxa version.
It’s only two small things this time:
- Slight tweaks to the new HMM scoring system, making Metaxa just a little bit faster
- Fixed a rarely occurring bug causing the –heuristics options to be ignored in certain circumstances
For the last months I have been (part time) struggling with getting Metaxa to eat Illumina paired-end data. This is a pretty tricky task, mainly due to the fact that Illumina reads are so much shorter than those obtained by Sanger and 454 sequencing. Therefore, I am more than happy to inform the community that today (the day before I go on vacation) I have a working prototype up and running. In fact, calling it a prototype is unfair, it is a quite far gone piece of software by now. Currently, I am running it on test data sets, and I will try to keep it running over the next couple of weeks. Thereafter, I hope to be able to release it sometime this autumn (but don’t expect a September release!), harnessing the power of Illumina sequencing for SSU identification. Stayed tuned, and have a great summer!
For those of you who like to listen to (or look at) me, I will be giving a presentation at this year’s SocBiN conference in Stockholm. My presentation has the long and quite informative title: Comprehensive Analysis of Antibiotic Resistance Genes in River Sediment, Well Water and Soil Microbial Communities Using Metagenomic DNA Sequencing. The talk is scheduled in the Using Next generation sequence data session, right after Jeroen Raes and Christopher Quince… It’s a short talk, so I will probably need to keep it simple, but it will be the first time I present results generated in relation with my present position, which I personally feel is very nice. We’re moving forward!