You know the feeling when your assembler supports paired-end sequences, but your FASTQ quality filterer doesn’t care about what pairs that belong together? Meaning that you end up with a mess of sequences that you have to script together in some way. Gosh, that feeling is way too common. It is for situations like that I have put together the Paired-End ToolKit (PETKit), a collection of FASTQ/FASTA sequence handling programs written in Perl. Currently the toolkit contains three command-line tools that does sequence conversion, quality filtering, and ORF prediction, all adapted for paired-end sequences specifically. You can read more about the programs, which are released as open source software, on the PETKit page. At the moment they lack proper documentation, but running the software with the “–help” option should bring up a useful set of options for each tool. This is still considered beta-software, so any bug reports, and especially suggestions, are welcome.
Also, if you have an idea of another problem that is unsolved or badly executed for paired-end sequences, let me know, and I will see if I can implement it in PETKit.
I have co-authored a paper together with, among others, Henrik Nilsson that was published today in MycoKeys. The paper deals with checking quality of DNA sequences prior to using them for research purposes. In our opinion, a lot of the software available for sequence quality management is rather complex and resource intensive. Not everyone have the skills to master such software, and in addition computational resources might also be scarce. Luckily, there’s a lot that can be done in quality control of DNA sequences just using manual means and a web browser. This paper puts these means together into one comprehensible and easy-to-digest document. Our targeted audience is primaily biologists who do not have a strong background in computer science, and still have a dataset requiring DNA sequence quality control.
We have chosen to focus on the fungal ITS barcoding region, but the guidelines should be pretty general and applicable to most groups of organisms. In very short our five guidelines spells:
- ￼￼￼Establish that the sequences come from the intended gene or marker
Can be done using a multiple alignment of the sequences and verifying that they all feature some suitable, conserved sub-region (the 5.8S gene in the ITS case)
- Establish that all sequences are given in the correct (5’ to 3’) orientation
Examine the alignment for any sequences that do not align at all to the others; re-orient these; re-run the alignment step; and examine them again
- Establish that there are no (at least bad cases of) chimeras in the dataset
Run the sequences through BLAST in one of the large sequence databases, e.g. at NCBI (or in the ITS case, use the UNITE database), to verify that the best match comprises more or less the full length of the query sequences
- Establish that there are no other major technical errors in the sequences
Examine the BLAST results carefully, particularly the graphical overview and the pairwise alignment, for anomalies (there are some nice figures in the paper on how it should and should not look like)
- Establish that any taxonomic annotations given to the sequences make sense
Examine the BLAST hit list to see that the species names produced make sense
A much more thorough description of these guidelines can be found in the paper itself, which is available under open access from MycoKeys. There’s simply no reason not to go there and at least take a look at it. Happy quality control!
Nilsson RH, Tedersoo L, Abarenkov K, Ryberg M, Kristiansson E, Hartmann M, Schoch CL, Nylander JAA, Bergsten J, Porter TM, Jumpponen A, Vaishampayan P, Ovaskainen O, Hallenberg N, Bengtsson-Palme J, Eriksson KM, Larsson K-H, Larsson E, Kõljalg U: Five simple guidelines for establishing basic authenticity and reliability of newly generated fungal ITS sequences. MycoKeys. Issue 4 (2012), 37–63. doi: 10.3897/mycokeys.4.3606 [Paper link]