Here are the problems associated with the sequences given in the exercises:
- This protein is actually a spurious ORF within the ribosomal 18S rRNA gene. As it is located in a conserved non-coding region, it is unfortunately perceived as a conserved protein, although the protein has no function and is only conserved because of the rRNA gene. Where this error originated from I don’t know, but it has obviously then propagated through misannotations in a number of sequencing projects, and there now exists a great number of conserved “senescence-associated protein” entries. As some study designated a putative function to the (probably non-existing) protein, this problem has only been further exaggerated.
- The problem here is not really the annotation per se but the interpretation of it. The sequence in the example is the gyrA gene, a ubiquitously occurring gene in bacteria involved in DNA replication. This gene is one of the target genes for fluoroquinolone antibiotics, and mutations in certain positions in it can therefore render bacteria resistant. However, when mapping short reads to this sequence (as in the exercise), the reads will map to all parts of the sequence – also those that have nothing to do with resistance and are identical between resistant and sensitive variants. Therefore, you cannot really tell from typical mapping data if this is a resistant variant of not. (Second of all, resistance mutations are also context-dependent, so species carrying them will also matter, for example. But that is a much more complicated issue.) There are actually software that can distinguish between mutated and wild type variants, such as the Mumame tool we have developed in my lab.
- In this case, the problem is that the protein consists of two conserved protein domains. However, one of the domains get better scores in the BLAST-report, and therefore “took over” the annotation. Unfortunately, this domain is common between many different sequence types (in this case it encodes a Forkhead-associated domain, a DNA-binding motif common to many proteins), but has very little to do with the protein’s function. Instead, the functional domain in the example is ABC_tran – a type of ABC transporter, which unfortunately ends up much further down in the BLAST search as the first domain has a higher degree of conservation.
- This protein is simply mis-annotated. This is obviously a Pseudomonas protein, and the Klebsiella sequencing project that submitted this sequence have made an error. This can easily be spotted by clicking the “identical sequences” button in the NCBI Protein database.
- Yes, this annotation does make sense, or may at least do! This DNA sequence encodes an antibiotic resistance gene carried on a plasmid, a mobile genetic elements which has been found in a range of (actually quite distantly related) species. So while this could be an error, that is not necessarily the case. This shows how tricky it can be to spot errors and how context dependent functional annotation is! (And why it is complicated to do annotation on a large scale.)
- This is another blatant annotation error in GenBank. The sequence in question here is actually not a 16s rRNA gene, but a gene encoding (part of) the rpoB gene, another ubiquitous gene in bacteria which has also been proposed for species barcoding (but is not commonly used). Unfortunately, there are several such examples in GenBank, for example: