Frequently Asked Questions

 

How is an Open Reading Frame defined?

 An open reading frame is segment of a genome that potentially encodes a protein and is bounded by a start codon (usually ATG) and a stop codon (TAG, TGA and TAA). In prokaryotic genomes each ORF usually has a ribosome binding site near the ATG start codon.

 

 How is an Alternate Open Reading Frame defined?

An alternate open reading frame (ORF) is one of the five alternate ORFs of a gene where the coding sequence for the gene resides in frame +1 (see Figure 1). Many such alternate ORFs are very short and unlikely to encode proteins. As a working definition for the construction of the AlterORF database only alternate ORFs with 300 or more nucleotides uninterrupted by a stop codon were accepted for analysis. This is sufficient to encode a protein of 100 amino acids or more. By taking this strict definition, the AlterORF database may miss potentially shorter proteins. An alternate ORF need not start with one of the three most start codons (ATG, GTG and CTG), allowing a longer sequence than might be biologically significant to be analyzed.

 

 

Figure 1. Frame +1 is the ORF predicted in the database to encode a protein. +2 and +3 are the other two potential ORFs in the same strand and -1, -2 and -3 are the three potential ORFs in the antisense strand  (after Veloso et al. 2005) .

 

What is gene mis-annotation?

Gene mis-annotation is an incorrectly identified gene. This can occur by incorrect identification of the function of the encoded protein or by incorrect assignment of an alternate ORF as the coding ORF. In this latter case, AlterORF can serve as a platform for suggesting such mis-annotations. Mis-annotation of one of the alternate ORFs is not uncommon in prokaryotic genomes, especially in G+C rich organisms where the frequency of large alternate ORFs is very common (Veloso et al. 2005).

 

How can I use AlterORF to find gene mis-annotation?

A mis-annotation can be suggested when the annotated ORF appears in a database as a hypothetical gene with no known function and where AlterORF predicts a known protein (by Blast or domain and motif searching) in one of the alternate ORFs of this gene, raising the possibility that the alternative ORF is really the coding sequence (see example in tutorial).

 

Don’t let conservation of hypothetical genes misguide you into thinking that they cannot be mis-annotated genes. High G+C genomes are particularly rich in alternate ORFs for reasons discussed in Veloso et al. 2005 and frequently exhibit conserved alternate ORFs.

 

 

Can alternate ORFs sometimes be real genes (i.e. bonafide coding sequences)?

Cases are known where both the annotated ORF and an alternative ORF within the same gene exhibit strong predictions for function (Blast, or domain and motif searching) (see example in tutorial), but most of these remain to be experimentally validated and it is one of the goals of AlterORF to help detect such cases. A few cases are known where a gene and an alternate ORF have both been experimentally  (see example in tutorial).

 

Searching AlterORF.

Searches can be performed using Protein IDs from the source sequence Database, Genome Database of NCBI, by organism and by sequence using BLAST web sequence search service. In addition, pre-analyzed alternate ORFs are available for many completely sequenced prokaryotic genomes and new genomes are continually being added. If you wish to analyze a complete genome not present in the database, please e-mail us (Contact us). and we can discuss the possibility of carrying out the analysis for you.

 

Searching by protein ID.

This search can be performed from the home page. In this case, the user can explore a table with the results of BLAST and domain searches. You can perform this search from this page and look at the tutorial to see how to search by protein ID.

 

Searching by Organism.

In this case the user can identify the organism of interest for which a list of annotated protein coding genes with alternate ORFs is provided. The gene of interest can be analyzed clicking at the protein ID. You can perform this search from this page.

 

Searching by sequence.

Sequence search against all sequences stored in AlterORF can be performed using the BLAST software at here.

 

How are protein families built in the AlterORF Database?

The cross genera conservation of some alternate ORFs suggest that they might represent new protein families or domains. Hierarchically clustering was used to build sequence families using the hcluster_sg software, developed as part of the TreeFam project, because it is fast and avoids loading the matrix to memory (our matrix is a square matrix of ~ 3 million elements). Blast e-values were normalized from 0 to 100 (with 100 meaning e-value = 0) with the formulae (–log10 (e-value))/2. We plan to provide a method to evaluate the significance of each family as well as a manual curation of the most significant families in the next release of AlterORF Database.

 

Downloading the data.

All data present in AlterORF and database files and tables can be download in a compressed file from here. Please take in account that the complete database has ~ 300 GB.

Description of Alternative Orf.

Alternative Orfs were analyzed for predicted domains in all motif content using Pfam, CDD, COG, KOG, UNIPROT, SMART,