PETKit – Paired-End ToolKit
Johan Bengtsson-Palme, 2012-2014
The Paired-End ToolKit is a set of tools to ease the use of sequences from large-scale sequencing projects, generating paired-end reads. Many tools, like e.g. the FASTX Toolkit, does not readily process paired-end sequences in any intelligent manner. Currently, the PETKit software lacks any documentation apart from this document, and is currently in some kind of beta stage. However, I am using these tools myself in my research, and find them quite handy, so I want to share with them with the community. As of version 1.1, the toolkit currently consists of six programs, which are described below.
Download the PETKit (version 1.1b)
Peacat – Paired-end Exactly Alike Consensus Alignment Tool
This tool takes reads in the FASTA format (separated in two files – one for each read and/or a single file of non-paired reads) and iteratively aligns them to a set of reference (seed) sequences. If the input reads have a specified overlap to a seed sequence, it is used to extend the seed sequence. The process is repeated until the seed sequences can no longer be extended. Peacat is unique in that it outputs ALL possible combinations that reads can be used to merge seed sequences. The tool is useful for finalizing assemblies from e.g. metagenomic data with highly identical genes occurring in several different contexts. However, the user should also look into the TriMetAss package for other alternatives in such applications.
Pearf – Paired-End-Aware Read Filterer
This tool takes reads in the FASTQ format (separated in two files – one for each read) and filters them according to certain quality criteria. Pearf looks at the quality of both reads at once, and then determines if the pair of reads should be kept or discarded. This makes it much easier to sort out which pairs that are left when doing e.g. paired-end based assemblies.
Pefcon – Paired-End Format Converter
Pefcon can be used to convert between FASTQ and FASTA format, and also to convert between single-file paired-end format (which all reads in one file, every second read being the “paired” one), and dual-file paired-end format (two files: one containing the first reads, and one containing the paired reads). Pefcon can also use and create .qual files for quality scores associated with FASTA format.
Pemap – Paired-End Mapper
Pemap uses vmatch (external required software for Pemap; can be downloaded from http://vmatch.de) to map reads to contigs. It will then output if there is support for that the contig is circular or not. The tool is highly useful for determining whether or not a certain config could come from e.g. a plasmid.
Pepp – Paired-End Protein Predictor
Pepp is a simple ORF predictor that takes into account all possible ORFs that can be created by a pair of reads. This is useful to do before scanning large paired-end data sets against a protein database, using e.g. HMMER.
Pesort – Paired-End Sorter
Pesort takes one or two input FASTA or FASTQ files containing paired end reads (or single reads) and sort them so that the read pairs occur in the same order. It also sorts out which reads that don’t have a pair and outputs them to a separate file.
Please note that this is a beta-version of PETKit. Please report bugs to my: first name (dot) last name (at) microbiology (dot) se
- 31st March 2014 – 1.1b – added the Peacat and Pemap programs, fixed a bug when using FASTA files in Pefcon, changed default offset value to 33.
- 14th January 2013 – 1.0.2b – added the Pesort program, fixed a critical bug with custom offset values.
- 19th October 2012 – 1.0.1b – support for other quality score offsets.
- 9th October 2012 – 1.0b – first public release.
Johan Bengtsson-Palme (firstname.lastname [at] microbiology.se)
University of Gothenburg
Department of Infectious Diseases, Institute of Biomedicine
Thank you for the kit. Very useful ! I was searching for the scripts like Pesort for a long and I am glad that I found this.
However I have a question to ask. This is sort of a general question, hope you don’t mind. I have used pesort on many files and it worked fine. But some paired fastqs have no common reads betwen them. In such case is it valid to cancatenate both the files (forward and reverse) and align them as a single end reads ? I am mainly dealing with SNPs, so it should not be any problem right ?
I am very happy to hear that you find use of the PETKit. I’ve been on vacation and have not been able to answer your question earlier, but I hope that you still might find my input useful.
I am somewhat surprised that you find many files with no matching pairs. What is your input data coming from? I guess you have some quality filtering involved before you get the files, maybe the filtering is too hard? I am just curious.
In any case, I guess you could concatenate the file into one single file (since I have never seen you problem I haven’t really tried though). However, keep in mind that they are oriented in opposite directions. In general, that should not matter, but I would make sanity check that you have reads from both ends actually ending up mapped to your reference. And of course make sure that the mapper treats them as single end reads.