Microbiology, Metagenomics and Bioinformatics

Johan Bengtsson-Palme, University of Gothenburg | Wisconsin Institute for Discovery

One potential use for Metaxa (paper) is to include it in a pipeline for classification of SSU rRNA in metagenomic data (or other environmental sequencing sets). However, as Metaxa is provided from this site, it only classifies SSUs to the domain level (archaea, bacteria and eukaryotes, with the addition of chloroplasts and mitochondria). It is also able to do some (pretty rough) species guesses using the “--guess_species T” option. An easy solution to implement would be to pass the Metaxa output, e.g. “metaxa_output.bacteria.fasta” to BLAST, and compare all these sequences to the sequences in e.g. the SILVA or GreenGenes database. There is, however, a way to improve this, which uses Metaxa’s ability to compares sequences to custom databases. In this tutorial, I will show you how to achieve this.

Before we start, you will of course need to download and install Metaxa, and its required software packages (BLAST, HMMER, MAFFT). When you have done this, we can get going with the database customization. I will in this tutorial use the SILVA database for SSU classification. However, the basic idea for the tutorial should be easily applicable to GreenGenes and other rRNA databases as well.

  1. Visit SILVA through this link, and download the file named “SSURef_106_tax_silva.fasta.tgz”. The file is pretty big so it may take a while to download it. If you’re running Metaxa on a server, you’ll have to get the SILVA-file to the server somehow.
  2. Unzip and untar the file (Mac OS X makes this neatly by doubleclicking the file, on linux you can do it on the command line by typing “tar -xvzf SSURef_106_tax_silva.fasta.tgz“). This will give you a FASTA-file.
  3. The FASTA-file needs to be prepared a bit for Metaxa usage. First, we need to give Metaxa identifiers it can understand. Metaxa identifies sequences’ origins by the last character in their identifier, e.g. “>A16379.1.1496.B”. Here, “.B” indicates that this is a bacterial sequence. We are now going to use the unix command sed to process the file and insert the appropriate identifiers.
    1. We begin with the archaeal sequences. To get those straight, we type:
      sed "s/ Archaea;/.A - Archaea;/" SSURef_106_tax_silva.fasta > temp1
      Notice that we direct the output to a temporary file. It is bad practice to replace the input file with the output file, so we work with two temp-files instead.
    2. The next step is also easy, now we find all eukaryote sequences and add E:s to the identifiers:
      sed "s/ Eukaryota;/.E - Eukaryota;/" temp1 > temp2
    3. Now it becomes a little more complicated, as SILVA classes mitochondrial and chloroplast SSU sequences as subclasses of bacteria. However, there is a neat little trick we can use. First we do the same with the bacterial sequences as with the archaeal and eukaryote:
      sed "s/ Bacteria;/.B - Bacteria;/" temp2 > temp1
    4. Now, we can use two a little more complicated commands to annotate the mitochondrial and chloroplast sequences:
      sed "s/\.B - \(Bacteria;.*;[Mm]itochondria;\)/.M - \1/" temp1 > temp2
      sed "s/\.B - \(Bacteria;.*;[Cc]hloroplast;\)/.C - \1/" temp2 > temp1
    5. We also need to get “rid” of the unclassified sequences, by assigning them to the “other” origin (O):
      sed "s/ Unclassified;/.O - Unclassified;/" temp1 > temp2
  4. That wasn’t too complicated, was it? We can now check the number of different sequences in the file by typing the pretty complicated command:
    grep ">" temp2 | cut -f 1 -d " " | rev | cut -f 1 -d "." | sort | uniq -c
    If you have been working with the same files as me, you should now see the following numbers:
    23172 A
    471949 B
    3712 C
    55937 E
    534 M
    226 O
  5. At this stage, we need to remove the full taxonomy from the FASTA headers, as Metaxa cannot handle species names of this length. We do this by typing:
    sed "s/ - .*;/ - /" temp2 > temp1
  6. We can now change the temp-file into a FASTA file, and delete the other temp-file:
    mv temp1 SSURef.fasta
    rm temp2
  7. We now need to configure Metaxa to use the database. First, we format a BLAST-database from the FASTA-file we just created:
    formatdb -i SSURef.fasta -t "SSURef Metaxa DB" -o T -p F
  8. With that done, we can now run Metaxa using this database instead of the classification database that comes with the program. By specifying that we want to guess the species origin of sequences, we can get (as accurate as SILVA lets us be) which species each sequence in our set come from. We do this by using the -d and the --guess_species options:
    metaxa -i test.fasta -d SSURef.fasta -o TEST --guess_species T --cpu 2
    The input in this case was the test file that comes with Metaxa. Note also that we’re using two CPUs to get multithreaded speeds. Remember that you must provide the full (or relative) path to the database files we just created, if you are not running Metaxa from the same directory as the database resides in.
  9. The output should now look like this (taken from the bacterial file):
    >coryGlut_Bielefeld_dna Bacterial 16S SSU rRNA, best species guess: Corynebacterium glutamicum
    CGAACGCTG...
    >gi|116668568:792344-793860 Bacterial 16S SSU rRNA, best species guess: Arthrobacter sp. J3.40
    TGAACGCTG...
    >gi|117927211:c1399163-1397655 Bacterial 16S SSU rRNA, best species guess: Acidothermus cellulolyticus
    >CGAACGCTG...

    And so on. As you can see the species names are now located at the end of each definition line, and can easily be extracted using sed, e.g. “grep ">" TEST.bacteria.fasta | sed "s/.*: //"“.

And that’s it. It’s pretty simple, and can easily be scripted. In fact, I have already made the bash script for you. That means that the short version is, download the script, download the sequence file from SILVA, move into the directory you have downloaded the file to and run the script by typing: ./prepare_silva_for_metaxa.sh

A few notes at the end. The benefit of using this approach is that we maintain the sorting capabilities, marking of uncertain sequences and error checking of Metaxa, but we don’t have to add another BLAST step after Metaxa has finished. However, as this database we create is a lot bigger than the database that comes with Metaxa, the running time of the classification step will be substantially longer. This is in most cases acceptable, as that time is the same as the time it would have taken to run BLAST on the Metaxa output. It should also be noted that this approach limits Metaxa’s ability of classifying 12S sequences, as there are no such sequences in SILVA. Good luck with classifying your metagenome SSUs (and if you use Metaxa in your research, remember to cite the paper)!

It seriously worries me that a number of indications recently have pointed to that the heavy use of antibiotics does not only drive antibiotic resistance development, but also the development towards more virulent and aggressive strains of pathogenic bacteria. First, the genome sequencing of the E. coli strain that caused the EHEC outbreak in Germany in May revealed not only antibiotic resistance genes, but also is also able to make Shiga toxin, which is causes the severe diarrhoea and kidney damage related to the haemolytic uremic syndrome (HUS). The genes encoding the Shiga toxin are not originally bacterial genes, but instead seem to originate from phages. When E. coli gets infected with a Shiga toxin-producing phage, it becomes a human pathogen [1]. David Acheson, managing director for food safety at consulting firm Leavitt Partners, says that exposure to antibiotics might be enhancing the spread of Shiga toxin-producing phage. Some antibiotics triggers what is referred to as the SOS response, which induces the phage to start replicating. The replication of the phage causes the bacteria to burst, releasing the phages, and with them the toxin [1].

Second, there is apparently an ongoing outbreak of scarlet fever in Hong Kong. Kwok-Yung Yuen, microbiologist at the University of Hong Kong, has analyzed the draft sequence of the genome, and suggests that the bacteria acquired greater virulence and drug resistance by picking up one or more genes from bacteria in the human oral and urogenital tracts. He believes that the overuse of antibiotics is driving the emergence of drug resistance in these bacteria [2].

Now, both of these cases are just indications, but if they are true that would be an alarming development, where the use of antibiotics promotes the spread not only of resistance genes, impairing our ability to treat bacterial infections, but also the development of far more virulent and aggressive strains. Combining increasing untreatability with increasing aggressiveness seems to me like the ultimate weapon against our relatively high standards of treatment of common infections. Good thing hand hygiene still seems to help [3].

References

  1. Phage on the rampage (http://www.nature.com/news/2011/110609/full/news.2011.360.html), Published online 9 June 2011, Nature, doi:10.1038/news.2011.360
  2. Mutated Bacteria Drives Scarlet Fever Outbreak (http://news.sciencemag.org/scienceinsider/2011/06/mutated-bacteria-drives-scarlet.html?etoc&elq=cd94aa347dca45b3a82f144b8213e82b), Published online 27 June 2011.
  3. Luby SP, Halder AK, Huda T, Unicomb L, Johnston RB (2011) The Effect of Handwashing at Recommended Times with Water Alone and With Soap on Child Diarrhea in Rural Bangladesh: An Observational Study. PLoS Med 8(6): e1001052. doi:10.1371/journal.pmed.1001052 (http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.1001052)

So Metaxa has gone into the wild, which means that I start to get feedback from users using it in ways I have not foreseen. This is the best and the worst thing about having your software exposed to real-world usage; it makes it possible to improve it in a variety of ways, but it also gives you severe headaches at times. I could luckily fix a smaller bug in the Metaxa code within a matter of hours and issue an update to version 1.0.2. The interesting thing here was that I would never have discovered the bug myself, as I never would have called the Metaxa program in the way required for the bug to happen. But once I saw the command given, and the output, which the user kindly sent me, I pretty quickly realized what was wrong, and how to fix it. Therefore, I would like to ask all out you who use Metaxa to send me your questions, problems and bug reports. The feedback is highly appreciated, and I can (at least currently) promise to issue fixes as fast as possible. We are really committed to make Metaxa work for everyone.

If you have suggestions for improvements, those are welcome as well (though it will take significantly more time to implement new features than to fix bugs). I am currently compiling a FAQ, and all questions are welcome. Finally, I would like to thank everybody who has downloaded and tried the Metaxa package. I can see in the server logs that there are quite many of you, which of course makes us happy.

I was informed by a Metaxa user of a bug in the current Metaxa version (1.0.1). This bug caused problems when Metaxa-output was directed to another directory than the current directory Metaxa was run from. I have fixed this issue as fast as I could, as this could cause problems when Metaxa is included in larger analysis pipelines. The update to 1.0.2 is therefore strongly recommended for all Metaxa users. The update to 1.0.2 also introduces better handling of input files created in Windows environments, as well as improving the handling of extremely long sequence identifiers. The update can be downloaded using this link.

New features:
  • Improves import of sequence sets from Windows environments.
Fixed bugs:
  • Fixed a bug causing trouble with sequences with extremely long identifiers.
  • Fixed an output-related bug causing problems with output directed to another directory.

It is a pleasure to annonce that the paper on Metaxa is now available as an Online early article in Antonie van Leeuwenhoek. In short, the paper describes a software tool that is able to extract small subunit (SSU) rRNA sequences from large data sets, such as metagenomes and environmental PCR libraries, and classify them according to bacterial, archaeal, eukaryote, chloroplast or mitochondrial origin. The program makes it easy to distinguish between e.g. the bacterial SSU sequences you like to analyze, and the SSU sequences you would like to remove prior to the analysis (e.g. mitochondrial and chloroplast sequences). This task is particularly important in metagenomics, where sequences can potentially derive from a variety of origins, but bacterial diversity often is the desired target for analysis. The software can be downloaded here, and the article can be read here. I would like to thank all the co-authors on this paper for a brilliant collaboration, and hope to be working with them again.

Reference:

A random sample of things from this week’s scientific news I think are worth sharing:

Britain is apparently shutting down many of its climate change outreach efforts. I find this very saddening, and see it as an indication of our extreme short-sightedness. We need to put more effort and funding into preserving the environment – not less. In addition, the economic benefits of taking care of the nature around us will probably be much larger than the small sums we save in the short term by not doing anything. We clearly need better incentives to look beyond the next budget and the next election.

The editorial of Nature Reviews Microbiology points the torch on the need for research within basic microbiology, pointing out that “the functions of many genes in the genomes of even the best studied organisms, such as Escherichia coli and Bacillus subtilis, remain unknown. Often these genes do not resemble other, characterized, genes in the databases, allowing for the possibility that interesting new pathways remain to be discovered. (…) if we want to understand how life works at the molecular level, it is crucial to continue and expand basic microbiology research.” I would like to add that a more complete understanding of at least one model organism would drastically increase the accuracy of genome (and metagenome) annotation in new sequencing projects, which today is patchy, to say the least.

Just a short note; Metaxa has been updated to version 1.0.1. This incremental version brings two small new features, and a minimal bug fix.

  • Added the option to select whether HMMER’s heuristic filtering should be used or not. This can be configured using the –heuristics option:
    –heuristics {T or F} : Selects whether to use HMMER’s heuristic filtering, off (F) by default
  • Removed some redundant information written to the screen, as output to the screen was a bit cluttered.

Bug fix:

  • Fixed a rare bug affecting detection sensivity of some SSU sequences.

Of course I would recommend it to every Metaxa user as it fixes a small bug, but the update is not in anyway critical for normal use.  The updated version can be downloaded using this link.

I proudly announce that today Metaxa has been officially released. Metaxa is a a software tool for automated detection and discrimination among ribosomal small subunit (12S/16S/18S) sequences of archaea, bacteria, eukaryotes, mitochondria, and chloroplasts in metagenomes and environmental sequence datasets. We have been working on Metaxa for quite some time, and it has now been in beta for about two months. However, it seems to be stable enough for public consumption. In addition, the software package is today presented in a talk at the SocBiN conference in Helsinki.

A more thorough post on the rationale behind Metaxa, and how it works will follow when I am not occupied by the SocBiN conference. A paper on Metaxa is to be published in the journal Antonie van Leeuwenhoek. The  software can be downloaded from here.

For those of you who are not already fed up with my writings on biology stuff on the web site, two opportunities to hear me talk in real life has popped up in May. The first is already on May 2nd, on the Open Day in Life Sciences, arranged by the Science Faculty at the University of Gothenburg. I will talk about the search for detoxification systems in metagenomic sequence data (from a collections point of view, as that is the theme for the day). There will also be an opportunity be guided in the herbarium and the botanical garden, plus having lunch and an optional after-work drink at Botaniska Paviljongen. But hurry, last day of admission is tomorrow! Register here.

The second opportunity will be at the SocBiN-2011 bioinformatics conference in Helsinki, on the 12th of May. I will present in the session called “Bioinformatics of Metagenomics”, and talk about a software tool for rRNA classification. I really look forward to this Bioinformatics conference, there are a number of highly prominent and interesting speakers, and I have heard that Helsinki in May is very beautiful. Besides, I am going there with extremely nice people, adding up to potentially being the best biology venue I will attend this spring.

So, last week I started my Ph.D. in Joakim Larsson’s group at the Sahlgrenska Academy. While I am very happy about how things have evolved, I will also miss the ecotox group and the functional genomics group a lot (though both do their research within 10 minutes walking distance from my new place…) I spent last week getting through the usual administrative hassle; getting keys and cards, signing papers, installing bioinformatics software on my new monster of a computer etc. Slowly, the new room starts to feel like it is mine (after nailing phylogenetic trees, my favorite map of the amino acids, and my remember-why-Cytoscape-visualisation-might-not-be-a-good-idea-for-all-network-like-structures poster to the billboard).

So what will this change of positions mean? Will I quit doing research on microbial communities? Of course not! In my new position, my subject of investigation will be bacterial communities subjected to antibiotics. We will look for resistance genes in such communities, and try to answer questions like: How do a high antibiotic selection pressure affect abundance of resistance genes and mobile elements that could facilitate their transfer between bacteria? Can resistance genes found in environmental bacteria be transferred to the microbes of the human gut? Can the environmental bacteria tell us what resistance genes that will be present in clinical situations in the near future? All these questions could, at least partially, be answered by metagenomic approaches and good bioinformatics tools, and my role will be to come up with the solutions provide answers to them.

I am excited over this new project, which involves my favorite subject – metagenomics and community analysis – as well as important factors, such as the clinical connections, the possibility to add pieces to the antibiotic resistance puzzle, and the role of gene and species transfer in resistance development. I also like the fact that I will need to handle high-throughput  sequence data, meaning that there will be many opportunities to develop tools, a task I highly enjoy. I think the next couple of years will be an exciting time.