A long time ago, we (Martin Eriksson, Martin Hartmann, Henrik Nilsson and me) were invited to write an overview on Metaxa for the Encyclopedia of Metagenomics. I guess the workload for pulling such a project off is huge, so there’s no surprise that it has taken a while for it to be accepted, but now it is available for consumption.
Meanwhile, Metaxa have been getting regular updates, and I hope to soon be able to show you a new major update to the software, bringing it up to the next generation of metagenomics. More on that soon.
I have co-authored a paper together with, among others, Henrik Nilsson that was published today in MycoKeys. The paper deals with checking quality of DNA sequences prior to using them for research purposes. In our opinion, a lot of the software available for sequence quality management is rather complex and resource intensive. Not everyone have the skills to master such software, and in addition computational resources might also be scarce. Luckily, there’s a lot that can be done in quality control of DNA sequences just using manual means and a web browser. This paper puts these means together into one comprehensible and easy-to-digest document. Our targeted audience is primaily biologists who do not have a strong background in computer science, and still have a dataset requiring DNA sequence quality control.
We have chosen to focus on the fungal ITS barcoding region, but the guidelines should be pretty general and applicable to most groups of organisms. In very short our five guidelines spells:
- ￼￼￼Establish that the sequences come from the intended gene or marker
Can be done using a multiple alignment of the sequences and verifying that they all feature some suitable, conserved sub-region (the 5.8S gene in the ITS case)
- Establish that all sequences are given in the correct (5’ to 3’) orientation
Examine the alignment for any sequences that do not align at all to the others; re-orient these; re-run the alignment step; and examine them again
- Establish that there are no (at least bad cases of) chimeras in the dataset
Run the sequences through BLAST in one of the large sequence databases, e.g. at NCBI (or in the ITS case, use the UNITE database), to verify that the best match comprises more or less the full length of the query sequences
- Establish that there are no other major technical errors in the sequences
Examine the BLAST results carefully, particularly the graphical overview and the pairwise alignment, for anomalies (there are some nice figures in the paper on how it should and should not look like)
- Establish that any taxonomic annotations given to the sequences make sense
Examine the BLAST hit list to see that the species names produced make sense
A much more thorough description of these guidelines can be found in the paper itself, which is available under open access from MycoKeys. There’s simply no reason not to go there and at least take a look at it. Happy quality control!
Nilsson RH, Tedersoo L, Abarenkov K, Ryberg M, Kristiansson E, Hartmann M, Schoch CL, Nylander JAA, Bergsten J, Porter TM, Jumpponen A, Vaishampayan P, Ovaskainen O, Hallenberg N, Bengtsson-Palme J, Eriksson KM, Larsson K-H, Larsson E, Kõljalg U: Five simple guidelines for establishing basic authenticity and reliability of newly generated fungal ITS sequences. MycoKeys. Issue 4 (2012), 37–63. doi: 10.3897/mycokeys.4.3606 [Paper link]
Yesterday, our paper on Megraft – a software tool to graft ribosomal small subunit (16S/18S) fragments onto full-length SSU sequences – became available as an accepted online early article in Research in Microbiology. Megraft is built upon the notion that when examining the depth of a community sequencing effort, researchers often use rarefaction analysis of the ribosomal small subunit (SSU/16S/18S) gene in a metagenome. However, the SSU sequences in metagenomic libraries generally are present as fragmentary, non-overlapping entries, which poses a great problem for this analysis. Megraft aims to remedy this problem by grafting the input SSU fragments from the metagenome (obtained by e.g. Metaxa) onto full-length SSU sequences. The software also uses a variability model which accounts for observed and unobserved variability. This way, Megraft enables accurate assessment of species richness and sequencing depth in metagenomic datasets.
The algorithm, efficiency and accuracy of Megraft is thoroughly described in the paper. It should be noted that this is not a panacea for species richness estimates in metagenomics, but it is a huge step forward over existing approaches. Megraft shares some similarities with EMIRGE (Miller et al., 2011), which is a software package for reconstruction of full-length ribosomal genes from paired-end Illumina sequences. Megraft, however, is set apart in that it has a strong focus on rarefaction, and functions also when the number of sequences is small, which is often the case in 454 and Sanger-based metagenomics studies. Thus, EMIRGE and Megraft seek to solve a roughly similar problem, but for different sequencing technologies and sequencing scales.
Bengtsson, J., Hartmann, M., Unterseher, M., Vaishampayan, P., Abarenkov, K., Durso, L., Bik, E.M., Garey, J.R., Eriksson, K.M., Nilsson R.H. (2012). Megraft: A software package to graftribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes and similar environmental datasets. Research in Microbiology, doi: 10.1016/j.resmic.2012.07.001.
- Miller, C. S., Baker, B. J., Thomas, B. C., Singer, S. W., & Banfield, J. F. (2011). EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biology, 12(5), R44. doi:10.1186/gb-2011-12-5-r44
One potential use for Metaxa (paper) is to include it in a pipeline for classification of SSU rRNA in metagenomic data (or other environmental sequencing sets). However, as Metaxa is provided from this site, it only classifies SSUs to the domain level (archaea, bacteria and eukaryotes, with the addition of chloroplasts and mitochondria). It is also able to do some (pretty rough) species guesses using the “
--guess_species T” option. An easy solution to implement would be to pass the Metaxa output, e.g. “metaxa_output.bacteria.fasta” to BLAST, and compare all these sequences to the sequences in e.g. the SILVA or GreenGenes database. There is, however, a way to improve this, which uses Metaxa’s ability to compares sequences to custom databases. In this tutorial, I will show you how to achieve this.
Before we start, you will of course need to download and install Metaxa, and its required software packages (BLAST, HMMER, MAFFT). When you have done this, we can get going with the database customization. I will in this tutorial use the SILVA database for SSU classification. However, the basic idea for the tutorial should be easily applicable to GreenGenes and other rRNA databases as well.
- Visit SILVA through this link, and download the file named “SSURef_106_tax_silva.fasta.tgz”. The file is pretty big so it may take a while to download it. If you’re running Metaxa on a server, you’ll have to get the SILVA-file to the server somehow.
- Unzip and untar the file (Mac OS X makes this neatly by doubleclicking the file, on linux you can do it on the command line by typing “
tar -xvzf SSURef_106_tax_silva.fasta.tgz“). This will give you a FASTA-file.
- The FASTA-file needs to be prepared a bit for Metaxa usage. First, we need to give Metaxa identifiers it can understand. Metaxa identifies sequences’ origins by the last character in their identifier, e.g. “>A16379.1.1496.B”. Here, “.B” indicates that this is a bacterial sequence. We are now going to use the unix command sed to process the file and insert the appropriate identifiers.
- We begin with the archaeal sequences. To get those straight, we type:
sed "s/ Archaea;/.A - Archaea;/" SSURef_106_tax_silva.fasta > temp1
Notice that we direct the output to a temporary file. It is bad practice to replace the input file with the output file, so we work with two temp-files instead.
- The next step is also easy, now we find all eukaryote sequences and add E:s to the identifiers:
sed "s/ Eukaryota;/.E - Eukaryota;/" temp1 > temp2
- Now it becomes a little more complicated, as SILVA classes mitochondrial and chloroplast SSU sequences as subclasses of bacteria. However, there is a neat little trick we can use. First we do the same with the bacterial sequences as with the archaeal and eukaryote:
sed "s/ Bacteria;/.B - Bacteria;/" temp2 > temp1
- Now, we can use two a little more complicated commands to annotate the mitochondrial and chloroplast sequences:
sed "s/\.B - \(Bacteria;.*;[Mm]itochondria;\)/.M - \1/" temp1 > temp2
sed "s/\.B - \(Bacteria;.*;[Cc]hloroplast;\)/.C - \1/" temp2 > temp1
- We also need to get “rid” of the unclassified sequences, by assigning them to the “other” origin (O):
sed "s/ Unclassified;/.O - Unclassified;/" temp1 > temp2
- We begin with the archaeal sequences. To get those straight, we type:
- That wasn’t too complicated, was it? We can now check the number of different sequences in the file by typing the pretty complicated command:
grep ">" temp2 | cut -f 1 -d " " | rev | cut -f 1 -d "." | sort | uniq -c
If you have been working with the same files as me, you should now see the following numbers:
- At this stage, we need to remove the full taxonomy from the FASTA headers, as Metaxa cannot handle species names of this length. We do this by typing:
sed "s/ - .*;/ - /" temp2 > temp1
- We can now change the temp-file into a FASTA file, and delete the other temp-file:
mv temp1 SSURef.fasta
- We now need to configure Metaxa to use the database. First, we format a BLAST-database from the FASTA-file we just created:
formatdb -i SSURef.fasta -t "SSURef Metaxa DB" -o T -p F
- With that done, we can now run Metaxa using this database instead of the classification database that comes with the program. By specifying that we want to guess the species origin of sequences, we can get (as accurate as SILVA lets us be) which species each sequence in our set come from. We do this by using the
metaxa -i test.fasta -d SSURef.fasta -o TEST --guess_species T --cpu 2
The input in this case was the test file that comes with Metaxa. Note also that we’re using two CPUs to get multithreaded speeds. Remember that you must provide the full (or relative) path to the database files we just created, if you are not running Metaxa from the same directory as the database resides in.
- The output should now look like this (taken from the bacterial file):
>coryGlut_Bielefeld_dna Bacterial 16S SSU rRNA, best species guess: Corynebacterium glutamicum
>gi|116668568:792344-793860 Bacterial 16S SSU rRNA, best species guess: Arthrobacter sp. J3.40
>gi|117927211:c1399163-1397655 Bacterial 16S SSU rRNA, best species guess: Acidothermus cellulolyticus
And so on. As you can see the species names are now located at the end of each definition line, and can easily be extracted using sed, e.g. “
grep ">" TEST.bacteria.fasta | sed "s/.*: //"“.
And that’s it. It’s pretty simple, and can easily be scripted. In fact, I have already made the bash script for you. That means that the short version is, download the script, download the sequence file from SILVA, move into the directory you have downloaded the file to and run the script by typing:
A few notes at the end. The benefit of using this approach is that we maintain the sorting capabilities, marking of uncertain sequences and error checking of Metaxa, but we don’t have to add another BLAST step after Metaxa has finished. However, as this database we create is a lot bigger than the database that comes with Metaxa, the running time of the classification step will be substantially longer. This is in most cases acceptable, as that time is the same as the time it would have taken to run BLAST on the Metaxa output. It should also be noted that this approach limits Metaxa’s ability of classifying 12S sequences, as there are no such sequences in SILVA. Good luck with classifying your metagenome SSUs (and if you use Metaxa in your research, remember to cite the paper)!