You might remember that I a long time ago promised a minor update to Megraft. I then forgot about actually posting the update. So it’s very much about time, the updated 1.0.2 version of Megraft. The new thing in this version is improved handling of sequences with N’s (unknown bases) in them, and improved handling of sequences with strange sequence IDs (which sometimes have confused Megraft 1.0.1). The update can be downloaded here.
Some users have asked me to fix a table output bug in Metaxa, and I have finally got around to do so. The fix is released today in the 1.1.2 Metaxa package (download here). This version also brings an updated manual (finally), as the User’s Guide has lagged behind since version 1.0. Please continue to report bugs to metaxa [at sign] microbiology [dot] se
Bengtsson J, Hartmann M, Unterseher M, Vaishampayan P, Abarenkov K, Durso L, Bik EM, Garey JR, Eriksson KM, Nilsson RH: Megraft: A software package to graft ribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes. Research in Microbiology. Volume 163, Issues 6–7 (2012), 407–412, doi: 10.1016/j.resmic.2012.07.001. [Paper link]
Megraft is currently at version 1.0.1, but I have a slightly updated version in the pipeline which will be made available later this fall.
Yesterday, our paper on Megraft – a software tool to graft ribosomal small subunit (16S/18S) fragments onto full-length SSU sequences – became available as an accepted online early article in Research in Microbiology. Megraft is built upon the notion that when examining the depth of a community sequencing effort, researchers often use rarefaction analysis of the ribosomal small subunit (SSU/16S/18S) gene in a metagenome. However, the SSU sequences in metagenomic libraries generally are present as fragmentary, non-overlapping entries, which poses a great problem for this analysis. Megraft aims to remedy this problem by grafting the input SSU fragments from the metagenome (obtained by e.g. Metaxa) onto full-length SSU sequences. The software also uses a variability model which accounts for observed and unobserved variability. This way, Megraft enables accurate assessment of species richness and sequencing depth in metagenomic datasets.
The algorithm, efficiency and accuracy of Megraft is thoroughly described in the paper. It should be noted that this is not a panacea for species richness estimates in metagenomics, but it is a huge step forward over existing approaches. Megraft shares some similarities with EMIRGE (Miller et al., 2011), which is a software package for reconstruction of full-length ribosomal genes from paired-end Illumina sequences. Megraft, however, is set apart in that it has a strong focus on rarefaction, and functions also when the number of sequences is small, which is often the case in 454 and Sanger-based metagenomics studies. Thus, EMIRGE and Megraft seek to solve a roughly similar problem, but for different sequencing technologies and sequencing scales.
Bengtsson, J., Hartmann, M., Unterseher, M., Vaishampayan, P., Abarenkov, K., Durso, L., Bik, E.M., Garey, J.R., Eriksson, K.M., Nilsson R.H. (2012). Megraft: A software package to graftribosomal small subunit (16S/18S) fragments onto full-length sequences for accurate species richness and sequencing depth analysis in pyrosequencing-length metagenomes and similar environmental datasets. Research in Microbiology, doi: 10.1016/j.resmic.2012.07.001.
- Miller, C. S., Baker, B. J., Thomas, B. C., Singer, S. W., & Banfield, J. F. (2011). EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data. Genome Biology, 12(5), R44. doi:10.1186/gb-2011-12-5-r44
I realized that I have been using a newer version of Metaxa than most of you for the last couple of months. This bug fix was written sometime in February or March, and we have kept it internal to make sure it works as it should. Then other things came across and we never got around to actually release it. But with testing passed and upcoming versions of Metaxa in the pipeline, I think it is about time that everyone gets their hands on the latest Metaxa version.
It’s only two small things this time:
- Slight tweaks to the new HMM scoring system, making Metaxa just a little bit faster
- Fixed a rarely occurring bug causing the –heuristics options to be ignored in certain circumstances
For the last months I have been (part time) struggling with getting Metaxa to eat Illumina paired-end data. This is a pretty tricky task, mainly due to the fact that Illumina reads are so much shorter than those obtained by Sanger and 454 sequencing. Therefore, I am more than happy to inform the community that today (the day before I go on vacation) I have a working prototype up and running. In fact, calling it a prototype is unfair, it is a quite far gone piece of software by now. Currently, I am running it on test data sets, and I will try to keep it running over the next couple of weeks. Thereafter, I hope to be able to release it sometime this autumn (but don’t expect a September release!), harnessing the power of Illumina sequencing for SSU identification. Stayed tuned, and have a great summer!
The guys at Pfam recently introduced a new database, called AntiFam, which will provide HMM profiles for some groups of sequences that seemingly formed larger protein families, although they were not actually real proteins. For example, rRNA sequences could contain putative ORFs, that seems to be conserved over broad lineages; with the only problem being that they are not translated into proteins in real life, as they are part of an rRNA .
With this initiative the Xfam team wants to “reduce the number of spurious proteins that make their way into the protein sequence databases.” I have run into this problem myself at some occasions with suspicious sequences in GenBank, and I highly encourage this development towards consistency and correctness in sequence databases. It is of extreme importance that databases remain reliable if we want bioinformatics to tell us anything about organismal or community functions. The Antifam database is a first step towards such a cleanup of the databases, and as such I would like to applaud Pfam for taking actions in this direction.
To my knowledge, GenBank are doing what they can with e.g. barcoding data (SSU, LSU, ITS sequences), but for bioinformatics and metagenomics (and even genomics) to remain viable, these initiatives needs to come quickly; and automated (but still very sensitive) tools for this needs to get our focus immediately. For example, Metaxa  could be used as a tool to clean up SSU sequences of misclassified origin. More such tools are needed, and a lot of work remains to be done in the area of keeping databases trustworthy in the age of large-scale sequencing.
- Tripp, H. J., Hewson, I., Boyarsky, S., Stuart, J. M., & Zehr, J. P. (2011). Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies. Nucleic Acids Research, 39(20), 8792–8802. doi:10.1093/nar/gkr576
- Bengtsson, J., Eriksson, K. M., Hartmann, M., Wang, Z., Shenoy, B. D., Grelet, G.-A., Abarenkov, K., et al. (2011). Metaxa: a software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequencing datasets. Antonie van Leeuwenhoek, 100(3), 471–475. doi:10.1007/s10482-011-9598-6
One thing that I find slightly annoying is when people do not get the basic concepts right – or when debatable concepts are used without discussion of their implications. This further annoys me when it is done by senior scientists, who should know better. Sometimes, I guess this happens out of ignorance, and sometimes to be able to stick your subject to a certain buzzword concept. Neither is good, even though the former reason is little more forgivable then the latter. One area where this problem becomes agonizingly evident is when molecular biologists or medical scientists moves into ecology, as has happened with the advent of metagenomics. When the study of the human gut microflora turned into a large-scale sequencing effort, people who had previously studied bacteria grown on plates started facing a world of community ecology. However, I get the impression that way too often these people do not ask ecologists for advice, or even read up on the ecological literature. Which, I suppose, is the reason why medical scientists can talk about how the human gut microflora can “evolve” into a stable community a couple years after birth, even though words such as “development” or “succession” would be much more accurate to describe this change.
The marker gene flaw
To set what I mean straight, let us compare the human gut to a forest. If an open field is left to itself, larger plants will slowly inhabit it, and gradually different species will replace each other, until we have a fully developed forest. Similarly, the human gut microflora is at birth rather unstable, but stabilizes relatively quickly and within a few years we have a microbial community with “adult-like” characteristics. To arrive at this conclusion, scientists generally use the 16S (small sub-unit) genetic marker to study the bacterial species diversity. This works in pretty much the same way as going out into the forest and count trees of different kinds.
Now, if I went out into the forest once and counted the tree species, waited for 50 years and then did the same thing again, I would presumably see that the forest species composition had changed. However, if I called this “evolution”, fellow scientists would laugh at me. Raspberry bushes do not evolve into birches, and birches do not evolve into firs. Instead, ecologists talk about “succession”; a progressive transformation of a community, going on until a stable community is formed. The concept of succession seems well-suited also to describe what is happening in the human gut, and should of course also be used in that setting. The most likely driver of the functional community changes is not that some bacterial species have evolved new functions, but rather that bacterial species performing these new functions have outcompeted the once previously present.
In fact, I would argue that it is impossible to study evolution through a genetic marker such as the 16S gene (except in the rare case when you study evolution of the 16S gene itself). Instead, the only thing we could assess using a marker gene is how the copy number of the different gene variants change over time (or space, or conditions). The copy number tells us about the species composition of the community at a given time, which can be used to measure successional changes. However, evolutionary changes would require heritable changes in the characteristics of biological populations, i.e. that their genetic material change in some way. Unless that change happens in the marker gene of choice, we cannot measure it, and the alterations of composition we measure will only reflect differences in species abundances. These differences might have arisen from genetic (i.e. evolutionary) changes, but we cannot assess that.
What are we studying with metagenomics?
This brings us to the next problem, which is not only a problem of semantics and me getting annoyed, but a problem with real implications. What are we really studying using metagenomics? When we apply an environmental sequencing approach to a microbial community, we get a snapshot of the genetic material at a given time and site; at specific conditions. Usually, we aim to characterize the community from a taxonomic or functional perspective, and we often have some other community which we want to compare to. However, if we only collect data from different communities at one time point, or if we only study a community before and after exposure, we have no way of telling if differences stem from selective pressures or from more a random succession progress. As most microbial habitats are not as well studied as the human gut, we know little about microbial community assembly and succession.
Also, in ecology a disturbance to a particular community is generally considered as a starting point for a new succession process. This process may, or may not, return the community to the same stable state. However, if the disturbance was of permanent nature, the new community will have to adapt to the new conditions, and the stable state will likely not have the same species distribution. Such an adaption could be caused by genetic changes (which would clearly be an evolutionary process), or by simple replacement of sensitive species with tolerant ones. The latter would be a selective process, but not necessarily an evolutionary one. If the selection does not alter the genetic material within the species, but only the species composition, I would argue that this is also a case of succession.
Complications with resistance
This complicates the work with metagenomic data. If we study antibiotic resistance genes, and say that bacteria in an environment have evolved antibiotic resistance, we base that assertion on that genes responsible for resistance have either evolved within the present bacteria, or have (more likely) been transferred into the genomes of the bacteria via horizontal gene transfer. However, if the resistance profile we see is simply caused by a replacement of sensitive species with resistant ones, we have not really discovered something new evolving, but are only witnessing spread of already resistant bacteria. In the gut, this would be a problem by itself, but say that we do the same study in the open environment. We already know that environmental bacteria have contained resistance genes for ages, so the real threat to human health here would be a spread from naturally resistant bacteria to human pathogens. However, as mentioned earlier, without extremely well thought-through methodology we cannot really see such transmissions of resistance genes. Here, the search for mobile elements, and large-scale takes on community composition vs. resistance profiles in contaminated and non-polluted areas can play a huge role in shedding light on the question of spreading. However, this will require larger and better planned experiments using metagenomics than what is generally performed at the moment. The questions of microbial community assembly, dispersal, succession and adaption are still largely unanswered, and our metagenomic and environmental sequencing approaches have just started to tinker around with the lid of the jar.
I am extremely happy to announce that Metaxa 1.1 (first announced back in July) has finally left the beta stage, and is now designated as a feature complete 1.1 update. We consider this update stable for production use. The 1.1 update utilize hmmsearch instead of hmmscan for higher extraction speeds and better accuracy. This clever trick was inspired by a blog post by HMMER’s creator Sean Eddy on hmmscan vs hmmsearch (http://selab.janelia.org/people/eddys/blog/?p=424). As the speedup comes from the extraction step, the speed increase will be largest for huge data sets with only a small proportion of actual SSU sequences (typically large 454 metagenomes).
What took so long, you might ask, as I promised an imminent release already in August. Well, during testing a difference in scoring was discovered. This difference did not have any implications for long sequences (> ~350 bp), but caused Metaxa to have problems on short reads (most evident on ~150 bp and shorter). Therefore, the scoring system had to be redesigned, which in turn required more extensive testing. Now, however, Metaxa 1.1 has a fine-tuned scoring system, which by default is based on scores instead of E-values, and in some instances have even better detection accuracy than the old Metaxa version. We encourage everyone to try out this new version of Metaxa (although the 1.0.2 version will remain available for download). It should be bug free, but we cannot ensure 100% compatibility in all usage scenarios. Therefore, we are happy if you report any bugs or inconsistencies to the e-mail address: metaxa (at] microbiology [dot) se.
The new version of Metaxa can be downloaded here: https://microbiology.se/software/metaxa/ Please note that the manual has not yet been updated yet, so use the help feature for the up-to-date options. Happy SSU detecting!
One potential use for Metaxa (paper) is to include it in a pipeline for classification of SSU rRNA in metagenomic data (or other environmental sequencing sets). However, as Metaxa is provided from this site, it only classifies SSUs to the domain level (archaea, bacteria and eukaryotes, with the addition of chloroplasts and mitochondria). It is also able to do some (pretty rough) species guesses using the “
--guess_species T” option. An easy solution to implement would be to pass the Metaxa output, e.g. “metaxa_output.bacteria.fasta” to BLAST, and compare all these sequences to the sequences in e.g. the SILVA or GreenGenes database. There is, however, a way to improve this, which uses Metaxa’s ability to compares sequences to custom databases. In this tutorial, I will show you how to achieve this.
Before we start, you will of course need to download and install Metaxa, and its required software packages (BLAST, HMMER, MAFFT). When you have done this, we can get going with the database customization. I will in this tutorial use the SILVA database for SSU classification. However, the basic idea for the tutorial should be easily applicable to GreenGenes and other rRNA databases as well.
- Visit SILVA through this link, and download the file named “SSURef_106_tax_silva.fasta.tgz”. The file is pretty big so it may take a while to download it. If you’re running Metaxa on a server, you’ll have to get the SILVA-file to the server somehow.
- Unzip and untar the file (Mac OS X makes this neatly by doubleclicking the file, on linux you can do it on the command line by typing “
tar -xvzf SSURef_106_tax_silva.fasta.tgz“). This will give you a FASTA-file.
- The FASTA-file needs to be prepared a bit for Metaxa usage. First, we need to give Metaxa identifiers it can understand. Metaxa identifies sequences’ origins by the last character in their identifier, e.g. “>A16379.1.1496.B”. Here, “.B” indicates that this is a bacterial sequence. We are now going to use the unix command sed to process the file and insert the appropriate identifiers.
- We begin with the archaeal sequences. To get those straight, we type:
sed "s/ Archaea;/.A - Archaea;/" SSURef_106_tax_silva.fasta > temp1
Notice that we direct the output to a temporary file. It is bad practice to replace the input file with the output file, so we work with two temp-files instead.
- The next step is also easy, now we find all eukaryote sequences and add E:s to the identifiers:
sed "s/ Eukaryota;/.E - Eukaryota;/" temp1 > temp2
- Now it becomes a little more complicated, as SILVA classes mitochondrial and chloroplast SSU sequences as subclasses of bacteria. However, there is a neat little trick we can use. First we do the same with the bacterial sequences as with the archaeal and eukaryote:
sed "s/ Bacteria;/.B - Bacteria;/" temp2 > temp1
- Now, we can use two a little more complicated commands to annotate the mitochondrial and chloroplast sequences:
sed "s/\.B - \(Bacteria;.*;[Mm]itochondria;\)/.M - \1/" temp1 > temp2
sed "s/\.B - \(Bacteria;.*;[Cc]hloroplast;\)/.C - \1/" temp2 > temp1
- We also need to get “rid” of the unclassified sequences, by assigning them to the “other” origin (O):
sed "s/ Unclassified;/.O - Unclassified;/" temp1 > temp2
- We begin with the archaeal sequences. To get those straight, we type:
- That wasn’t too complicated, was it? We can now check the number of different sequences in the file by typing the pretty complicated command:
grep ">" temp2 | cut -f 1 -d " " | rev | cut -f 1 -d "." | sort | uniq -c
If you have been working with the same files as me, you should now see the following numbers:
- At this stage, we need to remove the full taxonomy from the FASTA headers, as Metaxa cannot handle species names of this length. We do this by typing:
sed "s/ - .*;/ - /" temp2 > temp1
- We can now change the temp-file into a FASTA file, and delete the other temp-file:
mv temp1 SSURef.fasta
- We now need to configure Metaxa to use the database. First, we format a BLAST-database from the FASTA-file we just created:
formatdb -i SSURef.fasta -t "SSURef Metaxa DB" -o T -p F
- With that done, we can now run Metaxa using this database instead of the classification database that comes with the program. By specifying that we want to guess the species origin of sequences, we can get (as accurate as SILVA lets us be) which species each sequence in our set come from. We do this by using the
metaxa -i test.fasta -d SSURef.fasta -o TEST --guess_species T --cpu 2
The input in this case was the test file that comes with Metaxa. Note also that we’re using two CPUs to get multithreaded speeds. Remember that you must provide the full (or relative) path to the database files we just created, if you are not running Metaxa from the same directory as the database resides in.
- The output should now look like this (taken from the bacterial file):
>coryGlut_Bielefeld_dna Bacterial 16S SSU rRNA, best species guess: Corynebacterium glutamicum
>gi|116668568:792344-793860 Bacterial 16S SSU rRNA, best species guess: Arthrobacter sp. J3.40
>gi|117927211:c1399163-1397655 Bacterial 16S SSU rRNA, best species guess: Acidothermus cellulolyticus
And so on. As you can see the species names are now located at the end of each definition line, and can easily be extracted using sed, e.g. “
grep ">" TEST.bacteria.fasta | sed "s/.*: //"“.
And that’s it. It’s pretty simple, and can easily be scripted. In fact, I have already made the bash script for you. That means that the short version is, download the script, download the sequence file from SILVA, move into the directory you have downloaded the file to and run the script by typing:
A few notes at the end. The benefit of using this approach is that we maintain the sorting capabilities, marking of uncertain sequences and error checking of Metaxa, but we don’t have to add another BLAST step after Metaxa has finished. However, as this database we create is a lot bigger than the database that comes with Metaxa, the running time of the classification step will be substantially longer. This is in most cases acceptable, as that time is the same as the time it would have taken to run BLAST on the Metaxa output. It should also be noted that this approach limits Metaxa’s ability of classifying 12S sequences, as there are no such sequences in SILVA. Good luck with classifying your metagenome SSUs (and if you use Metaxa in your research, remember to cite the paper)!