GO Annotation Guide

This document describes the use of GO terms for annotating gene products. It will be more useful if you first read the general documentation and usage guide for more general background information about the GO project and how the ontology works.

Collaborating databases annotate their gene products (or genes) with GO terms, according to two general principles: First, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based.

The Annotation Conventions section contains guidelines; they apply to all annotation methods and are particularly useful for manual literature-based annotation. The Annotation File Format section describes the content of the "gene association files" (i.e. association between a database object and a GO term) in which annotation data are stored. A forthcoming section will describe different Computational Annotation Methods that have been used by various contributing databases.

GO Annotation Conventions

Database objects (the level of attribution)

Because a single gene may encode very different products with very different attributes, GO recommends associating GO terms with database objects representing gene products rather than genes. At present, however, many participating databases are unable to associate GO terms to gene products, and therefore use genes instead. If the database object is a gene, it is associated with all GO terms applicable to any of its products. See the Annotation File Format section for more information.

References and evidence

More information on the meaning and use of the evidence codes can be found in the GO Evidence Codes documentation.

General recommendations

Annotation File Format

Collaborating databases export to GO a tab delimited file, known informally as a "gene association file," of links between database objects and GO terms. Despite the jargon, the database object may represent a gene or a gene product (transcript or protein). Columns in the file are described below, a table showing the columns in order, with examples, is available.

The entry in the DB_Object_ID field (see below) of the association file is the identifier for the database object, which may or may not correspond exactly to what is described in a paper. For example, a paper describing a protein may support annotations to the gene encoding the protein (gene ID in DB_Object_ID field) or annotations to a protein object (protein ID in DB_Object_ID field).

The entry in the DB_Object_Symbol field should be a symbol that means something to a biologist, wherever possible (gene symbol, for example). It is not an ID or an accession number (the second column, DB_Object_ID, provides the unique identifier), although IDs can be used in DB_Object_Symbol if there is no more biologically meaningful symbol available (e.g., when an unnamed gene is annotated).

The object type (gene, transcript, protein, or protein_structure) listed in the DB_Object_Type field MUST match the database entry identified by DB_Object_ID. The text entered in the DB_Object_Name and DB_Object_Symbol can refer to the same database entry (recommended), or to a "broader" entity. For example, several alternative transcripts from one gene may be annotated separately, each with a unique transcript DB_Object_ID, but list the same gene symbol in the DB_Object_Symbol column.

Annotation File Fields

The flat file format comprises 15 tab-delimited fields (another table is available displaying the fields horizontally, as they would appear in the flat file). Yellow denotes required fields:

Column	Content	Example
1.	DB	SGD
2.	DB_Object_ID	S0000296
3.	DB_Object_Symbol	PHO3
4.	NOT
5.	GO ID	GO:0015888
6.	DB:Reference(\|DB:Reference)	SGD:8789\|PMID:2676709
7.	Evidence	IMP
8.	With (or) From
9.	Aspect	P
10.	DB_Object_Name(\|Name)	acid phosphatase
11.	DB_Object_Synonym(\|Synonym)	YBR092C
12.	DB_Object_Type	gene
13.	taxon(\|taxon)	taxon:4932
14.	Date	20010118
15.	Assigned_by	SGD

Where:

DB: the database contributing the gene_association file
one of the values in the table of database abbreviations. [Database abbreviations explanation]
this field is mandatory, cardinality 1
DB_Object_ID: a unique identifier in DB for the item being annotated
this field is mandatory, cardinality 1
DB_Object_Symbol: a (unique and valid) symbol to which DB_Object_ID is matched
can use ORF name for otherwise unnamed gene or protein
if gene products are annotated, can use gene product symbol if available, or many gene product annotation entries can share a gene symbol
this field is mandatory, cardinality 1
NOT: prefixing a GOid with the string NOT allows annotators to state that a particular gene product is NOT associated with a particular GO term.
Note: This field should be used when a cited reference explicitly says the gene product is not associated with the GO term (e.g. "our favorite protein is not found in the nucleus"). It was introduced to allow curators to document conflicting claims in the literature. NOT can also be used in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method). For example, if a protein has sequence similarity to an enzyme (whose activity is GO:nnnnnnn), but has been shown experimentally not to have the enzymatic activity, it can be annotated as NOT GO:nnnnnnn.
this field is not mandatory, cardinality 0, 1
GOid: the GO identifier for the term attributed to the DB_Object_ID
this field is mandatory, cardinality 1
DB:Reference: the unique identifier appropriate to DB for the authority for the attribution of the GOid to the DB_Object_ID. This may be a literature reference or a database record. The syntax is DB:accession_number.
Note that only one reference can be cited on a single line. If a reference has identifiers in more than one database, multiple identifiers can be included on a single line. For example, if the reference is a published paper that has a PubMed ID, we strongly recommend that the PubMed ID be included, as well as an identifier within a model organism database.
this field is mandatory, cardinality 1, >1; for cardinality >1 use a pipe to separate entries (e.g. SGD:8789|PMID:2676709).
Evidence: one of IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC
this field is mandatory, cardinality 1
With (or) From: one of:
DB:gene_symbol
DB:gene_symbol[allele_symbol]
DB:gene_id
DB:protein_name
DB:sequence_id
GO:GO_id
this field is not mandatory, cardinality 0, 1, >1
Note: Cardinality = 0 is not recommended, but is permitted because cases can be found in literature where no database identifier can be found (e.g. physical interaction or sequence similarity to a protein, but no ID provided). Annotations where evidence is IGI, IPI, or ISS and 'with' cardinality = 0 should link to an explanation of why there is no entry in 'with.' Cardinality should be >1 only for IGI and IPI evidence codes (see evidence documentation for more information). For cardinality >1 use a pipe to separate entries (e.g. FB:FBgn1111111|FB:FBgn2222222).

"GO:GO_id" is used only when "Evidence = IC" and refers to the GO term(s) used as the basis of a curator inference. In these cases the "DB:Reference" will be that used to assign the GO term(s) from which the inference is made.
Aspect: one of P (biological process), F (molecular function) or C (cellular component)
this field is mandatory; cardinality 1
DB_Object_Name: name of gene or gene product
this field is not mandatory, cardinality 0, 1 [white space allowed]
Synonym: Gene_symbol [or other text]
this field is not mandatory, cardinality 0, 1, >1 [white space allowed]
DB_Object_Type: what kind of thing is being annotated
one of gene, transcript, protein, protein_structure
this field is mandatory cardinality 1
Taxon: taxonomic identifier(s)
For cardinality 1, the ID of the species encoding the gene product.
For cardinality 2, the first ID is that of the species encoding the gene product; the second ID is that of the species using the gene product.
this field is mandatory, cardinality 1, 2
Date: Date on which the annotation was made; format is YYYYMMDD
this field is mandatory, cardinality 1
Assigned_by: The database which made the annotation
one of the values in the table of database abbreviations. [Database abbreviations explanation]
Used for tracking the source of an individual annotation.
Default value is value entered in column 1 (DB).
Value will differ from column 1 for any that is made by one database and incorporated into another.
this field is mandatory, cardinality 1

Note that several fields contain database cross-reference (dbxrefs) in the format dbname:dbaccession. The fields are: GOid (where dbname is always GO), DB:Reference, With, Taxon (where dbname is always taxon).

Computational Annotation Methods

This section includes descriptions of automated annotation methods used by participating databases (descriptions have been provided by each group listed).

Copyright © 1999-2000 Gene Ontology Consortium. Permission to use the information contained in this database was given by the researchers/institutes who contributed or published the information. Users of the database are solely responsible for compliance with any copyright restrictions, including those applying to the author abstracts. Documents from this server are provided "AS-IS" without any warranty, expressed or implied.



last modified October-2003	Report problems with this website to webmaster@geneontology.org


	For problems with Netscape 4, please upgrade to Netscape 7