GO Annotation Guide





Description  |  GO Annotation Conventions  |  Annotation File Format  |  Computational Annotation Methods

 

Description

This document describes the use of GO terms for annotating gene products. It will be more useful if you first read the general documentation and usage guide for more general background information about the GO project and how the ontology works.

Collaborating databases annotate their gene products (or genes) with GO terms, according to two general principles: First, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based.

The Annotation Conventions section contains guidelines; they apply to all annotation methods and are particularly useful for manual literature-based annotation. The Annotation File Format section describes the content of the "gene association files" (i.e. association between a database object and a GO term) in which annotation data are stored. A forthcoming section will describe different Computational Annotation Methods that have been used by various contributing databases.

 

GO Annotation Conventions

Database objects (the level of attribution)

Because a single gene may encode very different products with very different attributes, GO recommends associating GO terms with database objects representing gene products rather than genes. At present, however, many participating databases are unable to associate GO terms to gene products, and therefore use genes instead. If the database object is a gene, it is associated with all GO terms applicable to any of its products. See the Annotation File Format section for more information.


References and evidence

  • Every annotation must be attributed to a source, which may be a literature reference, another database or a computational analysis.
  • The annotation must indicate what kind of evidence is found in the cited source to support the association between the gene product and the GO term. A simple controlled vocabulary is used to record evidence:
    IMP inferred from mutant phenotype
    IGI inferred from genetic interaction [with <database:gene_symbol[allele_symbol]>]
    IPI inferred from physical interaction [with <database:protein_name>]
    ISS inferred from sequence similarity [with <database:sequence_id>]
    IDA inferred from direct assay
    IEP inferred from expression pattern
    IEA inferred from electronic annotation [to <database:id>]
    TAS traceable author statement
    NAS non-traceable author statement
    ND no biological data available
    IC inferred by curator

More information on the meaning and use of the evidence codes can be found in the GO Evidence Codes documentation.


General recommendations

  • A gene product can be annotated to zero or more nodes of each ontology.
  • Annotation of a gene product to one ontology is independent of its annotation to other ontologies.
  • Annotate gene products in each species database to the most detailed level in the ontology that correctly describes the biology of the gene product.
  • Keep the True Path Rule in mind: annotating to a term implies annotation to all parents via any path, so it is a good idea to check the parentage of a term before annotating (and request new terms or path corrections if necessary).
  • There is an important distinction between a gene/gene product annotated to "unknown" function, process, and/or component, and one that has not been annotated. "Unknown" means that someone has tried annotating the gene, but didn't find any information. Absence of annotation implies that no one has looked.
  • Curators are encouraged to annotate to terms from all three ontologies, using "unknown" liberally if necessary (see item above).
  • For annotations to "unknown" from any of the three ontologies, curators should cite a reference within their database that explains that they found no relevant biological information in the literature (or any other sources they may have considered). The evidence code is ND, for "no data." Exception: if a paper explicitly says that something is unknown, the paper can be cited as the reference, with TAS or NAS as evidence.
  • Uncertain knowledge of where a gene product operates should be denoted by annotating it to two nodes, one of which can be a parent of the other. For instance, a yeast gene product known to be in the nucleolus, but also experimentally observed in the nucleus generally, can be annotated to both nucleolus and nucleus in the cell component ontology. Even though annotation to nucleolus alone implies that a gene product is also in the nucleus, annotate to both so as to explicitly indicate that it has been reported in the two locations. The two annotations may have the same or different supporting evidence. Similar reports of general and specific molecular function or biological process for a gene product could be handled the same way (come up with some good examples). You also can annotate to multiple nodes that conflict with each other if there are conflicting claims in the literature.
  • A gene product should be annotated with terms reflecting its normal activity and location. A function, process, or localization (component) observed only in a mutant or disease state is therefore not usually included. In some circumstances, however, what is "normal" depends on the point of view taken by the annotator. For example, many viruses use host proteins to carry out viral processes. The host protein is then doing something abnormal from the perspective of the host, but completely normal from the perspective of the virus. GO annotators handle these cases by including two taxon IDs in the "Taxon" column of the gene association file (see Annotation File Format, below). When two taxon IDs appear, the first is that of the organism that encodes the gene product, and the second ID is that of the organism that uses the gene product, and whose perspective is considered "normal" for that annotation.
  • An individual gene product that is part of a complex can be annotated to terms that describe the the action (function or process) of the complex. This is colloquially known as annotating 'to the potential of the complex', and is a way to capture information about what a complex does in the absence of database objects and identifiers representing complexes. It is especially useful in cases where a complex has an activity, but the individual subunits do not. One example is the subunits of nuclear RNA polymerases, where none of the individual subunits have RNA polymerase activity, yet all of these subunits are annotated to 'DNA-dependent RNA polymerase activity', to capture the activity of the complex. Another example is ATP citrate lyase (ACL) in Arabidopsis: it is a heterooctomer, composed of two types of subunits, ACLA and ACLB in a A(4)B(4) stoichiometry. Neither of the subunits expressed alone give ACL activity, but co-expression results in ACL activity. Both subunits can be annotated to ATP citrate lyase activity.

 

Annotation File Format

Collaborating databases export to GO a tab delimited file, known informally as a "gene association file," of links between database objects and GO terms. Despite the jargon, the database object may represent a gene or a gene product (transcript or protein). Columns in the file are described below, a table showing the columns in order, with examples, is available.

The entry in the DB_Object_ID field (see below) of the association file is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB_Object_ID field) or annotations to a protein object (protein ID in DB_Object_ID field).

The entry in the DB_Object_Symbol field should be a symbol that means something to a biologist, wherever possible (gene symbol, for example). It is not an ID or an accession number (the second column, DB_Object_ID, provides the unique identifier), although IDs can be used in DB_Object_Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).

The object type (gene, transcript, protein, or protein_structure) listed in the DB_Object_Type field MUST match the database entry identified by DB_Object_ID. The text entered in the DB_Object_Name and DB_Object_Symbol can refer to the same database entry (recommended), or to a "broader" entity. For example, several alternative transcripts from one gene may be annotated separately, each with a unique transcript DB_Object_ID, but list the same gene symbol in the DB_Object_Symbol column.


Annotation File Fields

The flat file format comprises 15 tab-delimited fields (another table is available displaying the fields horizontally, as they would appear in the flat file). Yellow denotes required fields:

Column Content Example
1. DB SGD
2. DB_Object_ID S0000296
3. DB_Object_Symbol PHO3
4. NOT  
5. GO ID GO:0015888
6. DB:Reference(|DB:Reference) SGD:8789|PMID:2676709
7. Evidence IMP
8. With (or) From  
9. Aspect P
10. DB_Object_Name(|Name) acid phosphatase
11. DB_Object_Synonym(|Synonym) YBR092C
12. DB_Object_Type gene
13. taxon(|taxon) taxon:4932
14. Date 20010118
15. Assigned_by SGD

Where:

DB
the database contributing the gene_association file
one of the values in the table of database abbreviations. [Database abbreviations explanation]
this field is mandatory, cardinality 1
DB_Object_ID
a unique identifier in DB for the item being annotated
this field is mandatory, cardinality 1
DB_Object_Symbol
a (unique and valid) symbol to which DB_Object_ID is matched
can use ORF name for otherwise unnamed gene or protein
if gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol
this field is mandatory, cardinality 1
NOT
prefixing a GOid with the string NOT allows annotators to state that a particular gene product is NOT associated with a particular GO term.
Note: This field should be used when a cited reference explicitly says the gene product is not associated with the GO term (e.g. "our favorite protein is not found in the nucleus"). It was introduced to allow curators to document conflicting claims in the literature. NOT can also be used in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method). For example, if a protein has sequence similarity to an enzyme (whose activity is GO:nnnnnnn), but has been shown experimentally not to have the enzymatic activity, it can be annotated as NOT GO:nnnnnnn.
this field is not mandatory, cardinality 0, 1
GOid
the GO identifier for the term attributed to the DB_Object_ID
this field is mandatory, cardinality 1
DB:Reference
the unique identifier appropriate to DB for the authority for the attribution of the GOid to the DB_Object_ID. This may be a literature reference or a database record. The syntax is DB:accession_number.
Note that only one reference can be cited on a single line. If a reference has identifiers in more than one database, multiple identifiers can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database.
this field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. SGD:8789|PMID:2676709).
Evidence
one of IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC
this field is mandatory, cardinality 1
With (or) From
one of:
DB:gene_symbol
DB:gene_symbol[allele_symbol]
DB:gene_id
DB:protein_name
DB:sequence_id
GO:GO_id
this field is not mandatory, cardinality 0, 1, >1
Note: Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality should be >1 only for IGI and IPI evidence codes (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222).

"GO:GO_id" is used only when "Evidence = IC" and refers to the GO term(s) used as the basis of a curator inference. In these cases the "DB:Reference" will be that used to assign the GO term(s) from which the inference is made.

Aspect
one of P (biological process), F (molecular function) or C (cellular component)
this field is mandatory; cardinality 1
DB_Object_Name
name of gene or gene product
this field is not mandatory, cardinality 0, 1 [white space allowed]
Synonym
Gene_symbol [or other text]
this field is not mandatory, cardinality 0, 1, >1 [white space allowed]
DB_Object_Type
what kind of thing is being annotated
one of gene, transcript, protein, protein_structure
this field is mandatory cardinality 1
Taxon
taxonomic identifier(s)
For cardinality 1, the ID of the species encoding the gene product.
For cardinality 2, the first ID is that of the species encoding the gene product; the second ID is that of the species using the gene product.
this field is mandatory, cardinality 1, 2
Date
Date on which the annotation was made; format is YYYYMMDD
this field is mandatory, cardinality 1
Assigned_by

The database which made the annotation
one of the values in the table of database abbreviations. [Database abbreviations explanation]
Used for tracking the source of an individual annotation.
Default value is value entered in column 1 (DB).
Value will differ from column 1 for any that is made by one database and incorporated into another.
this field is mandatory, cardinality 1

Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GOid (where dbname is always GO), DB:Reference, With, Taxon (where dbname is always taxon).


Computational Annotation Methods

This section includes descriptions of automated annotation methods used by participating databases (descriptions have been provided by each group listed).

EBI | MGI | TIGR
EBI GOA Electronic Annotation

The large-scale assignment of GO terms to SWISS-PROT and TrEMBL entries involves electronic techniques. This strategy exploits existing properties within database entries including keywords and Enzyme Commission (EC) numbers and cross-reference to InterPro ( a database of protein motifs) which are manually mapped to GO. SWISS-PROT keyword and InterPro to GO mappings are maintained in-house and shared on the GO home page for local database updates. Electronically combining these mappings with a table of matching SWISS-PROT and TrEMBL entries generates a table of associations. For each GOA association, an evidence code, which summarizes how the association is made is provided. Associations that are made electronically are labelled as 'inferred from electronic annotation'(IEA).

Submitted by Evelyn Camon, 2002-09-03
 
MGI Electronic Annotation Methods
Every object in the MGI databases (markers, seqids, references, etc.) has an MGI: accession ID. In the items listed below, the J number refers to the reference.
 
Title Gene Ontology Annotation by the MGI Curatorial Staff
MGI Accession ID MGI:2152096
J Number J:72245
Authors Mouse Genome Informatics Scientific Curators
Journal NULL
Year 2001
Review Status Reviewed by MGI Editorial Staff
Abstract Enzyme Commission numbers that had been assigned to genes in MGI were annotated to GO terms based on the inclusion of EC#s within GO terms from the molecular function ontology. Details of this strategy can be found in Hill et al, Genomics (2001) 74:121-128.
 
Title Gene Ontology Annotation by the MGI Curatorial Staff
MGI Accession ID MGI:2152097
J Number J:72246
Authors Mouse Genome Informatics Scientific Curators
Journal NULL
Year 2001
Review Status Reviewed by MGI Editorial Staff
Abstract For annotations documented via this citation, curators used the information in the Mouse Locus Catalog in MGI to assign GO terms. The GO terms were assigned based on MLC textual descriptions of genes that could not be traced to the primary literature. Details of this strategy can be found in Hill et al, Genomics (2001) 74:121-128.
 
Title Gene Ontology Annotation by the MGI Curatorial Staff
MGI Accession ID MGI:2152098
J Number J:72247
Authors Mouse Genome Informatics Scientific Curators
Journal NULL
Year 2001
Review Status Reviewed by MGI Editorial Staff
Abstract For annotations documented via this citation, GO terms were assigned to MGI genes through InterPro protein domain assignments. Interpro protein domains are assigned to MGI genes as part of an ongoing curatorial collaboration between the SwissProt database and MGI (see J:53168). GO terms are associated with MGI genes using a translation table of InterPro protein domains to GO terms generated by Nicola Mulder at EBI.
 
Title Gene Ontology Annotation by the MGI Curatorial Staff
MGI Accession ID MGI:2154458
J Number J:73065
Authors Mouse Genome Informatics Scientific Curators
Journal NULL
Year 2001
Review Status Reviewed by MGI Editorial Staff
Abstract The sequence conservation that permits the establishment of orthology between mouse and rat or mouse and human genes is a strong predictor of the conservation of function for the gene product across these species. Therefore, in instances where a mouse gene product has not been functionally characterized, but its human or rat orthologs have, Mouse Genome Informatics (MGI) curators append the GO terms associated with the orthologous gene(s) to the mouse gene. Only those GO terms assigned by experimental determination to the ortholog of the mouse gene will be adopted by MGI. GO terms that are assigned to the ortholog of the mouse gene computationally (i.e. IEA), will not be transferred to the mouse ortholog. The evidence code represented by this citation is Inferred by Sequence Similarity (ISS.)
 
Title Gene Ontology Annotation by electronic association of SwissProt Keywords with GO terms
MGI Accession ID MGI:1354194
J Number J:60000
Authors Mouse Genome Informatics Scientific Curators
Journal NULL
Year 2000
Review Status Reviewed by MGI Editorial Staff
Abstract The Mouse Genome Informatics (MGI) curation of data includes annotating genes to three ontologies (Function, Cellular Component, and Process) using the Gene Ontology (GO) controlled vocabulary shared with other species genomic databases (www.geneontology.org <http://www.geneontology.org/> ). Gene annotations in MGI citing this reference were assigned based on an electronic association of keywords from the SwissProt database with GO terms. The translation of SwissProt keywords to GO terms was carefully curated by MGI curators utilizing both SP and GO definitions to confirm that the associations were correct. The assignment of GO terms to individual genes was achieved electronically through database links. If a User discovers an annotation error or inconsistency, or requires more detailed information about this process, please contact MGI at mgi-help@informatics.jax.org.
 
Title Gene Ontology Annotation by the MGI Curatorial Staff
MGI Accession ID MGI:1347124
J Number J:56000
Authors Mouse Genome Informatics Scientific Curators
Journal NULL
Year 1999
Review Status Reviewed by MGI Editorial Staff
Abstract For annotations documented via this citation, curators designed queries based on their knowledge of mouse gene nomenclature to group genes that shared common molecular functions, biological processes or cellular components. GO annotations were assigned to these genes in groups. Details of this strategy can be found in Hill et al, Genomics (2001) 74:121-128.
Submitted by Harold Drabkin, 2002-06-05

TIGR ISS Annotation (Arabidopsis, T. brucei)

For TIGR Arabidopsis or T. brucei annotations using 'Inferred from Sequence Similarity' (ISS) evidence, the reference is usually 'TIGR_Ath1:annotation' for Arabidopsis (author: TIGR Arabidopsis annotation team) and TIGR_Tba1:annotation for T. brucei (author: TIGR Trypanosoma brucei annotation team), which are defined as follows:

name: TIGR annotation based upon multiple sources of similarity evidence

description: TIGR_Ath1:annotation or TIGR_Tba1:annotation denotes a curator's interpretation of a combination of evidence. Our internal software tools present us with a great deal of evidence based domains, sequence similarities, signal sequences, paralogous proteins, etc. The curator interprets the body of evidence to make a decision about a GO assignment when an external reference is not available. The curator places one or more accessions that informed the decision in the "with" field."

What this says is that we have used many sequence similarity hits, etc., to make our decision. However, we choose only 1-3 pieces of information as "with" information, as it is not practical to enter and submit many entries for each annotation. We also have internal calculations of paralogy and new domains we are identifying which have not yet been published, but which help inform our decisions.

Submitted by Linda Hannick, 2002-10-10


Copyright © 1999-2000 Gene Ontology Consortium. Permission to use the information contained in this database was given by the researchers/institutes who contributed or published the information. Users of the database are solely responsible for compliance with any copyright restrictions, including those applying to the author abstracts. Documents from this server are provided "AS-IS" without any warranty, expressed or implied.

 last modified October-2003 Report problems with this website to webmaster@geneontology.org
For problems with Netscape 4, please upgrade to Netscape 7