About AlterORF
Introduction to alternate ORFs.
Each
gene has 6 potential reading frames, three on the forward
DNA strand and three on the reverse strand (figure). Usually only one of the reading frames is translated
into a protein because it is associated with a
ribosome binding site (RBS), a
start codon
(usually ATG
in E. coli) and has an
open reading frame (ORF) that is terminated by one of the three
stop codons. However, extensive ORFs (i.e. potentially
encoding at least 100
amino acids) can occur
in alternate reading frames, although they are generally not
associated with properly positioned
RBSs and are probably not translated. These alternate ORFs are surprisingly common,
especially in high G+C genomes (e.g.
70% G+C) (read
more). Computational
programs that automatically annotate genomes
can usually identify the correct ORF because it is more likely to obey the average
codon usage of the genome in question and it usually exhibits other identifiable
characteristics of a gene such as a predicted RBS and appropriate dinucleotide distributions.
Also its computationally translated product may have a significant
BLAST hit with a known protein. However, at times, alternate
ORFs can be misannotated as real genes, especially those in
frame
-1 because this frame exhibits some computationally identifiable features that are similar to the
real gene such as codon usage, amino acid content and percentage of predicted
a-helix and
(read
more).
Using AlterORF as a tool to depurate potential genome annotation errors.
Some of the
alternate ORFs predicted to be genes by automatic annotation programs are subsequently
culled by human curators; however, many escape even expert curation. AlterORF provides
a database of such potentially mis-annotated
ORFs. It has warehoused all alternate ORFs in fully sequenced microbial genomes
that have significant hits with one or more of the protein features
decribed in CDD,
COG,
KOG,
PFAM, PRK and SMART
in which the corresponding annotated genes that have been
deposited in the databases have no such characteristics. In such instances, it is
suggested that the alternate ORF rather than the annotated gene may be the real gene
(example). It is hoped that the AlterORF database will provide
a platform of such potential errors that can be reviewed by expert curators to determine
if the existing annotation should be revised. If the database is successful, these
potential mis-annotations should be corrected and any such changes will be tracked
in future updates as a metric of the
success of AlterORF.
Using AlterORF as a tool to find new genes.
Some of the
alternate ORFs that have significant protein features (CDD, COG, etc.) are associated
with annotated genes that also have significant protein features, making it moot
which is the real gene. Some of these instances are potentially dual function genes
in which both ORFs may be expressed (e.g. ref). AlterORF identifies these possibilities for further
computational and experimental investigation.
Potential role of alternate ORFs in the generation of new proteins.
A particularly
exciting direction that can be explored
using AlterORF is the search for genes that
may have arisen by the capture of alternate ORFs. This could occur if an alternative
ORF gains signals for its transcription and translation. It could also occur by
recombination or transposition that places an alternate ORF, or a portion of one,
inside an existing gene (figure). This would allow the generation of new folds or domains
within an existing protein and will leave a molecular fossil of the original gene
that travelled with the alternate ORF.
An inspection
of all the proteins that can be generated computationally from all alternate ORFs
shows that frame -1 alternate ORFs have amino acid compositions and predicted
a-helix contents
that are similar to real genes and so might be expected to fold correctly and escape
degradation by proteosomes. Therefore, if they were to be expressed, as outlined above,
they might be expected to survive long enough to be subjected to natural selection.
On the other hand, the other four alternate frames (+2, +3, -2 and -3) potentially
generate proteins that have unusual amino acid compositions and a-helix contents that might promote incorrect folding and thus serve as targets for proteolytic destruction by proteosomes .
It is widely accepted that novel genetic information can be generated by gene duplication followed by divergence of the copies via mutation. However, the original sequence delimits the way in which the copies can subsequently mutate and restricts the evolutionary space that can be explored. In contrast, “captured” alternate ORFs represent a novel source of genetic information that has not been previously subjected to direct selection at the amino acid level - although, of course, they are linked in sequence to the real gene that is subjected to such pressures. Therefore, alternate ORFs can be considered as a reservoir of novel genetic information that may play an important role in gene evolution and the AlterORF database is a useful repository of potential examples of such events.
New microbial genomes will be added to the database periodically.