An introduction to Gene Ontology

What does the Gene Ontology Consortium do?  |  What GO is NOT  |  The ontologies  |  Gene products  |  Molecular function  |  Biological process  |  Cellular component  |  What do the ontologies look like?  |  Annotation and tools  |  File formats  |  Beyond GO  |  Cross-products  |  Mappings to other classification systems  |  Contributing to GO  | 


What does the Gene Ontology Consortium do?

Biologists currently waste a lot of time and effort in searching for all of the available information about each small area of research. This is hampered further by the wide variations in terminology that may be common usage at any given time, and that inhibit effective searching by computers as well as people. For example, if you were searching for new targets for antibiotics, you might want to find all the gene products that are involved in bacterial protein synthesis, and that have significantly different sequences or structures from those in humans. But if one database describes these molecules as being involved in 'translation', whereas another uses the phrase 'protein synthesis', it will be difficult for you — and even harder for a computer — to find functionally equivalent terms.

The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases. The project began as a collaboration between three model organism databases: FlyBase (Drosophila),the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD) in 1998. Since then, the GO Consortium has grown to include many databases, including several of the world's major repositories for plant, animal and microbial genomes. See the GO web page for a full list of member organizations.

The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. There are three separate aspects to this effort: first, we write and maintain the ontologies themselves; second, we make associations between the ontologies and the genes and gene products in the collaborating databases, and third, we develop tools that facilitate the creation, maintainence and use of ontologies.

The use of GO terms by several collaborating databases facilitates uniform queries across them. The controlled vocabularies are structured so that you can query them at different levels: for example, you can use GO to find all the gene products in the mouse genome that are involved in signal transduction, or you can zoom in on all the receptor tyrosine kinases. This structure also allows annotators to assign properties to gene products at different levels, depending on how much is known about a gene product.

What GO is NOT

  1. GO is not a database of gene sequences, nor a catalog of gene products. Rather, GO describes how gene products behave in a cellular context.
  2. GO is not a way to unify biological databases (i.e. GO is not a 'federated solution'). Sharing vocabulary is a step towards unification, but is not, in itself, sufficient. Reasons for this include the following.
    1. Knowledge changes and updates lag behind.
    2. Individual curators evaluate data differently. While we can agree to use the word 'kinase', we must also agree to support this by stating how and why we use 'kinase', and consistently apply it. Only in this way can we hope to compare gene products and determine whether they are related.
    3. GO does not attempt to describe every aspect of biology. For example, domain structure, 3D structure, evolution and expression are not described by GO.
  3. GO is not a dictated standard, mandating nomenclature across databases. Groups participate because of self-interest, and cooperate to arrive at a consensus.

The ontologies

The three organizing principles of GO are molecular function, biological process and cellular component. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. For example,the gene product cytochrome c can be described by the molecular function term electron transporter activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane. Before we go any further, here are some definitions that should help you to distinguish a gene product from what it does.

Gene products

It is easy to confuse a gene product and its molecular function, because very often these are described in exactly the same words. For example, 'alcohol dehydrogenase' can describe what you can put in an Eppendorf tube (the gene product) or it can describe the function of this stuff. There is, however, a formal difference: a single gene product might have several molecular functions, and many gene products can share a single molecular function. For example, there are many gene products that have the function 'alcohol dehydrogenase'. Some, but by no means all, of these are encoded by genes with the name alcohol dehydrogenase. A particular gene product might have both the functions 'alcohol dehydrogenase' and 'acetaldehyde dismutase', and perhaps other functions as well. It's important to grasp that, whenever we use terms such as 'alcohol dehydrogenase activity' in GO, we mean the function, not the entity. (For this reason, most GO molecular function terms include the word 'activity'.)

Many gene products associate into entities that function as complexes, or 'gene product groups', which often include small molecules. They range in complexity from the relatively simple (for example, hemoglobin contains the gene products alpha-globin and beta-globin, and the small molecule heme) to complex assemblies of numerous different gene products, such as the ribosome.

At present, small molecules are not represented in GO. In the future, we might be able to create cross products by linking GO to existing databases of small molecules such as Klotho or LIGAND.

Molecular function

Molecular function describes activities, such as catalytic or binding activities, at the molecular level. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where or when, or in what context, the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity, or binding; examples of narrower functional terms are adenylate cyclase activity or Toll receptor binding.

It is easy to confuse a gene product with its molecular function, and for that reason many GO molecular functions are appended with the word "activity". The documentation on gene products (above) explains this confusion in more depth.

Biological process

A biological process is accomplished by one or more ordered assemblies of molecular functions. Examples of broad biological process terms are cell growth and maintenance or signal transduction. Examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport. It can be difficult to distinguish between a biological process and a molecular function, but the general rule is that a process must have more than one distinct steps.

A biological process is not equivalent to a pathway. We are specifically not capturing or trying to represent any of the dynamics or dependencies that would be required to describe a pathway.

Cellular component

A cellular component is just that, a component of a cell but with the proviso that it is part of some larger object, which may be an anatomical structure (e.g. rough endoplasmic reticulum or nucleus) or a gene product group (e.g. ribosome, proteasome or a protein dimer).

What do the ontologies look like?

GO terms are organized in structures called directed acyclic graphs (DAGs), which differ from hierarchies in that a 'child' (more specialized term) can have many 'parents' (less specialized terms).

For example, the biological process term hexose biosynthesis has two parents, hexose metabolism and monosaccharide biosynthesis. This is because biosynthesis is a subtype of metabolism, and a hexose is a type of monosaccharide. When any gene involved in hexose biosynthesis is annotated to this term, it is automatically annotated to both hexose metabolism and monosaccharide biosynthesis, because every GO term must obey the 'true path rule': if the child term describes the gene product, then all its parent terms must also apply to that gene product.

Annotation and tools

How do the terms in GO become associated with their appropriate gene products? Collaborating databases annotate their gene products (or genes) with GO terms, providing references and indicating what kind of evidence is available to support the annotations. More information can be found in the GO Annotation Guide.

If you browse any of the contributing databases, you'll find that each gene or gene product has a list of associated GO terms. Each database also publishes a table of these associations, and these are freely available from the GO ftp site. You can also browse the ontologies using a range of web-based browsers. A full list of these, and other tools for analyzing gene function using GO, is available on the GO Tools page .

In addition, the GO consortium has prepared GO slims, 'slimmed down' versions of the ontologies that allow you to annotate genomes or sets of gene products to gain a high-level view of gene functions. Using GO slims you can, for example, work out what proportion of a genome is involved in signal transduction, biosynthesis or reproduction. More information on GO slims can be found in the GO slims readme file, and the GO slims can be downloaded from the GO_slims directory.

File formats

The GO datasets are freely available. You can download the GO files in three different formats: flat files (updated daily), XML (updated monthly) and MySQL (updated monthly). For more information on the syntax of these formats, see the GO File Format Guide.

There is a separate flat file for each ontology and a single text file containing the definitions of terms in all of the ontologies. You can download the ontology files and the definitions file from the GO ftp site.

If you need lists of the genes or gene products that have been associated with a particular GO term, a table that tracks the number of annotations and provides links to the gene-association files for each of the collaborating databases is available.

The XML and MySQL files are stored in a separate archive on the GO Database website. More information on these formats can be found in the GO File Format Guide.

Beyond GO

A family of open-source ontologies

GO allows us to annotate genes and their products with a limited set of attributes. For example, GO does not allow us to describe genes in terms of which cells or tissues they're expressed in, which developmental stages they're expressed at, or their involvement in disease. It is not necessary for GO to do these things because other ontologies are being developed for these purposes. The GO consortium supports the development of other ontologies and makes its tools for editing and curating ontologies freely available. A list of freely available ontologies that are relevant to genomics and proteomics and are structured similarly to GO can be found at the OBO (open biology ontologies) website. A larger list, which includes the ontologies listed at OBO and also other controlled vocabularies that do not fulfil the OBO criteria is available at the Ontology Working Group page of the Microarray Gene Expression Data Society (MGED).


The existence of several ontologies will also allow us to create 'cross-products' that maximize the utility of each ontology while avoiding redundancy. For example, by combining the developmental terms in the GO process ontology with a second ontology that describes Drosophila anatomical structures, we could create an ontology of fly development. We could repeat this process for other organisms without having to clutter up GO with large numbers of species-specific terms. Similarly, we could create an ontology of biosynthetic pathways by combining the biosynthesis terms in the GO process ontology with a chemical ontology.

Mappings to other classification systems

GO is not the only attempt to build structured controlled vocabularies for genome annotation. Nor is it the only such series of catalogs in current use. We have attempted to make translation tables between these catalogs and GO. We caution that these mappings are neither complete nor exact; they are to be used as a guide. One reason for this is absence of definitions from many of the other catalogs and of a complete set of definitions in GO itself. More information on the syntax of these mappings can be found in the GO File Format Guide.

Contributing to GO

The GO project is constantly evolving, and we welcome feedback from all users. If you need a new term or definition, or would like to suggest that we reorganize a section of one of the ontologies, please do so through our online request-tracking system, which is hosted by Any errors or omissions in annotations should be reported by writing to the GO mailing list:

You can also send questions or suggestions to the GO mailing list: More information on GO mailing lists is available on the GO contacts page.

Copyright © 1999-2003 Gene Ontology Consortium. Permission to use the information contained in this database was given by the researchers/institutes who contributed or published the information. Users of the database are solely responsible for compliance with any copyright restrictions, including those applying to the author abstracts. Documents from this server are provided 'AS-IS' without any warranty, expressed or implied.

 last modified October-2003 Report problems with this website to
For problems with Netscape 4, please upgrade to Netscape 7