Frequently Asked Questions
How is an Open Reading Frame defined?
An
open reading frame is segment of a genome that
potentially encodes a protein and is bounded by a start codon
(usually ATG) and a stop codon (TAG, TGA and TAA). In prokaryotic
genomes each ORF usually has a
ribosome binding site near the ATG start codon.
How is an Alternate Open Reading Frame defined?
An alternate open reading frame (ORF) is one of the five alternate
ORFs of a gene where the coding sequence for the gene resides in
frame +1 (see Figure 1). Many such alternate ORFs are very short and
unlikely to encode proteins. As a working definition for the
construction of the AlterORF database only alternate ORFs with 300
or more nucleotides uninterrupted by a stop codon were accepted for
analysis. This is sufficient to encode a protein of 100 amino acids
or more. By taking this strict definition, the AlterORF database may
miss potentially shorter proteins. An alternate ORF need not start
with one of the three most start codons (ATG, GTG and CTG), allowing
a longer sequence than might be biologically significant to be analyzed.

Figure 1. Frame +1 is the ORF predicted in the database to encode a
protein. +2 and +3 are the other two potential ORFs in the same
strand and -1, -2 and -3 are the three potential ORFs in the
antisense strand
(after
Veloso
et al. 2005) .
What is gene mis-annotation?
Gene mis-annotation is an incorrectly identified gene. This can
occur by incorrect identification of the function of the encoded
protein or by incorrect assignment of an alternate ORF as the coding
ORF. In this latter case, AlterORF can serve as a platform for
suggesting such mis-annotations. Mis-annotation of one of the
alternate ORFs is not uncommon in prokaryotic genomes, especially in
G+C rich organisms where the frequency of large alternate ORFs is
very common (Veloso
et al. 2005).
How can I use AlterORF to find gene mis-annotation?
A mis-annotation can be suggested when the annotated ORF appears in
a database as a hypothetical gene with no known function and where
AlterORF predicts a known protein (by Blast or domain and motif
searching) in one of the alternate ORFs of this gene, raising the
possibility that the alternative ORF is really the coding sequence (see example in tutorial).
Don’t let conservation of hypothetical genes misguide you into
thinking that they cannot be mis-annotated genes. High G+C genomes
are particularly rich in alternate ORFs for reasons discussed in Veloso
et al. 2005
and frequently exhibit conserved
alternate ORFs.
Can alternate ORFs sometimes be real genes (i.e. bonafide coding
sequences)?
Cases are known where both the annotated ORF and an
alternative ORF within the same gene exhibit strong predictions for
function (Blast, or domain and motif searching) (see example in tutorial), but most of these remain to be
experimentally validated and it is one of the goals of AlterORF to
help detect such cases. A few cases are known where a gene and an
alternate ORF have both been experimentally
(see example in tutorial).
Searching AlterORF.
Searches can be performed using Protein IDs from the source sequence
Database, Genome Database of NCBI, by organism and by sequence using BLAST
web
sequence search service.
In addition, pre-analyzed alternate ORFs are available for many
completely sequenced prokaryotic genomes and new genomes are
continually being added. If you wish to analyze a complete genome
not present in the database, please e-mail us (Contact
us). and we can discuss
the possibility of carrying out the analysis for you.
Searching
by protein ID.
This search can be performed from the home page. In this case, the
user can explore a table with the results of BLAST and domain
searches. You can perform this search from
this page and
look at the
tutorial to see how to search by protein ID.
Searching by Organism.
In this case the user can identify the organism of interest for
which a list of annotated protein coding genes with alternate ORFs
is provided. The gene of interest can be analyzed clicking at the
protein ID. You can perform this search from
this page.
Searching by sequence.
Sequence search against all sequences stored in AlterORF can be
performed using the BLAST software at
here.
How are protein families built in the AlterORF Database?
The cross genera conservation of some alternate ORFs suggest that
they might represent new protein families or domains. Hierarchically
clustering was used to build sequence families using the hcluster_sg
software, developed as part of the
TreeFam project, because it is fast and avoids
loading the matrix to memory (our matrix is a square matrix of ~ 3
million elements). Blast e-values were normalized from 0 to 100
(with 100 meaning e-value = 0) with the formulae (–log10
(e-value))/2. We plan to provide a method to evaluate the
significance of each family as well as a manual curation of the most
significant families in the next release of AlterORF Database.
Downloading the data.
All data present in AlterORF and database files and tables can be
download in a compressed file from here.
Please take in account that the complete database has ~ 300 GB.
Description of Alternative Orf.
Alternative Orfs were analyzed for predicted domains in all motif
content using Pfam,
CDD,
COG,
KOG,
UNIPROT,
SMART,