PETKit

PETKit – Paired-End ToolKit
Johan Bengtsson-Palme, 2012-2014

The Paired-End ToolKit is a set of tools to ease the use of sequences from large-scale sequencing projects, generating paired-end reads. Many tools, like e.g. the FASTX Toolkit, does not readily process paired-end sequences in any intelligent manner. Currently, the PETKit software lacks any documentation apart from this document, and is currently in some kind of beta stage. However, I am using these tools myself in my research, and find them quite handy, so I want to share with them with the community. As of version 1.1, the toolkit currently consists of six programs, which are described below.

Download the PETKit (version 1.1b)

Peacat – Paired-end Exactly Alike Consensus Alignment Tool

This tool takes reads in the FASTA format (separated in two files – one for each read and/or a single file of non-paired reads) and iteratively aligns them to a set of reference (seed) sequences. If the input reads have a specified overlap to a seed sequence, it is used to extend the seed sequence. The process is repeated until the seed sequences can no longer be extended. Peacat is unique in that it outputs ALL possible combinations that reads can be used to merge seed sequences. The tool is useful for finalizing assemblies from e.g. metagenomic data with highly identical genes occurring in several different contexts. However, the user should also look into the TriMetAss package for other alternatives in such applications.

Pearf – Paired-End-Aware Read Filterer

This tool takes reads in the FASTQ format (separated in two files – one for each read) and filters them according to certain quality criteria. Pearf looks at the quality of both reads at once, and then determines if the pair of reads should be kept or discarded. This makes it much easier to sort out which pairs that are left when doing e.g. paired-end based assemblies.

Pefcon – Paired-End Format Converter

Pefcon can be used to convert between FASTQ and FASTA format, and also to convert between single-file paired-end format (which all reads in one file, every second read being the “paired” one), and dual-file paired-end format (two files: one containing the first reads, and one containing the paired reads). Pefcon can also use and create .qual files for quality scores associated with FASTA format.

Pemap – Paired-End Mapper

Pemap uses vmatch (external required software for Pemap; can be downloaded from http://vmatch.de) to map reads to contigs. It will then output if there is support for that the contig is circular or not. The tool is highly useful for determining whether or not a certain config could come from e.g. a plasmid.

Pepp – Paired-End Protein Predictor

Pepp is a simple ORF predictor that takes into account all possible ORFs that can be created by a pair of reads. This is useful to do before scanning large paired-end data sets against a protein database, using e.g. HMMER.

Pesort – Paired-End Sorter

Pesort takes one or two input FASTA or FASTQ files containing paired end reads (or single reads) and sort them so that the read pairs occur in the same order. It also sorts out which reads that don’t have a pair and outputs them to a separate file.

Please note that this is a beta-version of PETKit. Please report bugs to my: first name (dot) last name (at) microbiology (dot) se

Version history

  • 31st March 2014 – 1.1b – added the Peacat and Pemap programs, fixed a bug when using FASTA files in Pefcon, changed default offset value to 33.
  • 14th January 2013 – 1.0.2b – added the Pesort program, fixed a critical bug with custom offset values.
  • 19th October 2012 – 1.0.1b – support for other quality score offsets.
  • 9th October 2012 – 1.0b – first public release.

Contact information

Johan Bengtsson-Palme (firstname.lastname [at] microbiology.se)
University of Gothenburg
Department of Infectious Diseases, Institute of Biomedicine

6 Comments

Add a Comment