File Format Guide


How to use this Guide  |  Anatomy of a GO Entry  |  Anatomy of a Flat File  |  Anatomy of the Definitions File  |  Syntax for Commonly used Definitions  |  Syntax for Comments  |  Obsoletions Merges and Splits  |  Mappings to Other Classification Systems  |  XML Version  |  MySQL Version  |  FASTA version

 

How to use this guide

This GOTM File Format Guide serves to introduce new users to (and remind old users of) the structure and syntax of the GO files. Its main aim is to help users who need to write to or parse the GO files. You will find it more useful if you first read An introduction to GO for more general background information about the GO project and how the ontology works.

Information on annotating genes and gene products to GO can be found in the GO Annotation Guide.

Advice on GO editorial style can be found in the GO Editorial Style Guide.

Anatomy of a GO entry

When you create or edit a GO entry, you need to think about its component parts. Editing GO has become much more straightforward, and much less error prone since we have had an editing tool (see the DAG-Edit user guide for more information), but it helps to familiarize yourself with the syntax of the GO flat files as you will undoubtedly need to edit the raw data at some point.

Terms and their unique identifiers

The structure of GO is very simple. At its bare minimum, each GO entry consists of a term name (e.g. cell) and a unique, zero-padded seven-digit identifier (also known as its accession number, e.g. GO:0005623), which is used as a database cross-reference in the collaborating databases. The same number range is used across all three ontologies. If you are a GO curator and you haven't already done so, you'll need to define your own range of numbers in go/numbers/go_numbers.

Synonyms

Any term may, but does not need to, include one or more synonyms (e.g. type I programmed cell death is a synonym of apoptosis). The syntax for synonyms is

 
synonym:[synomym1] ; synonym:[synonym2]

There are several different levels of similarity recognised between terms and their synonyms, as shown below:
    1.     ~     the terms are related (e.g. XXX complex and XXX)
    2.	   =	 the term is an exact synonym
    3.	   <	 the synonym is broader than the term name
    4.	   >	 the synonym is more precise than the term name
    5.	   !=	 the term is related, but the exact relationship is not specified
  
The relationships of these types of synonym to one another is represented as follow:
    related to
    is_a exact
    is_a broader
    is_a narrower
    is_a undefined
  
See the Synonym Guide for more information.

General database cross-references

Another optional extra is one or more general database cross references (dbxrefs), which refer to an identical object in another database. For instance, the molecular function term protein kinase CK2 activity has the database cross reference EC2.7.1.37 -- the accession number of this enzyme activity in the Enzyme Commission database. A complete list of database cross-references, together with their accepted abbreviations, is available from the GO ftp and cvs archives. The syntax for database cross-references is

 database_abbreviation:identifier 

Definitions

We are in the process of writing a definition for each GO term. However, you won't find these in the GO flat files as they're kept in a separate file called GO.defs (see Anatomy of the definitions file.



Anatomy of a flat file

Now that we're clear on what an individual GO entry comprises, we can think about them in the context of an entire flat file. The structure described below holds true for each of the ontology flat files:

Biological Process (process.ontology)
Molecular Function (function.ontology)
Cellular Component (component.ontology)

Front matter

The beginning of each file contains comments (lines that begin with a !) about how and when the file was generated. The first lines always carry information about the version, the date of last update, (optionally) the source of the file, the name of the database, the domain of the file and the editors of the file (except *.html files).

Lines in which the first non-space character is a $ either reflect the domain and aspect of the ontology (i.e. $text) or the end of file (i.e. the $ character on a line by itself).

Here's an example of the front matter of a GO flat file:


!autogenerated-by:     DAG-Edit version 1.315
!saved-by:             midori
!date:                 Fri Jan 03 17:14:37 GMT 2003
!version: $Revision: 1.47 $
!type: % ISA Is a
!type: < PARTOF Part of
$Gene_Ontology ; GO:0003673


Relationships between terms

In the GO flat files, the symbol % is used to represent an is-a relationship and the symbol < a part-of relationship. For more information on these relationships between terms, see the GO Editorial Style Guide. Parent-child relationships between terms are represented by indentation:


  parent_term
   child_term

Is-a relationships


  %term0
   %term1 % term2

means that term1 is a subclass of term0 and also a subclass of term2.

Part-of relationships


  %term0
    %term1 < term2 < term3

means that term1 is a subclass of term0 and also a part-of of term2 and term3.

Line syntax

The order in which items appear on a line (where [] indicates an optional item) is as follows:


< | % term [; db cross ref]* [; synonym:text]*  [ < | % term]*

Here's a real example from the molecular function ontology (it would appear on a single line in the actual file):

%UDPsulfoquinovose synthase ; GO:0046507 ; EC:3.13.1.1 ;
synonym:sulfite\:UDP-glucose sulfotransferase
  

Anatomy of the definitions file

As mentioned above, the definitions for terms in all three ontology files are stored in go/doc/GO.defs.

Mandatory tags

Each definition must contain the following:

  • term: the GO term to which the definition refers.
  • goid: the term's unique identifier.
  • definition: the definition of the term.
  • definition_reference: one or more references for the definition.
A definition may also have a comment:
  • comment: text (see comment syntax, below)

Syntax for commonly used definitions

A specific syntax (wording) is used for writing the more common definitions. These are:

  1. METABOLISM
    The chemical reactions and physical changes
    The chemical reactions and physical changes involving

  2. BIOSYNTHESIS
    The formation from simpler components of

  3. CATABOLISM
    The breakdown into simpler components of

  4. REGULATION
    Any process that modulates the frequency, rate or extent of

  5. NEGATIVE REGULATION
    Any process that stops, prevents or reduces the rate of

  6. POSITIVE REGULATION
    Any process that activates or increases the rate of


Where you are adding a definition with this kind of pre-set syntax the Dbxref will be GO:your-initials.

Syntax for comments

Obsoletions

If you have made a term obsolete, you should explain why, and if possible suggest an alternative term or terms.

First add the word 'OBSOLETE.' at the beginning of the term definition. Then, with the term highlighted, click the 'obsolete term' button. Then you should add a comment to explain why you obsoleted the term and to redirect people who wish to make annotations.

Use the following syntax for the reason for obsoletion:

    comment: This term was made obsolete because [reason]. 
To suggest alternative terms, use one of the following:

A. If exact replacement is possible (i.e. it is safe to move all existing annotations, keyword mappings, etc. to one term), precede the suggested term with 'use':

    To update annotations, use the [ontology name] term '[text] ; GO:[id].'

    example:
      term: transfer RNA
      goid: GO:0005563
      comment: This term was made obsolete because it represents a gene
      product. To update annotations, use the molecular function term
      'triplet codon-amino acid adaptor ; GO:0030533.'
B. In cases where all existing annotations and mappings can't necessarily be transferred to one term, put 'consider' in front of the suggested terms. Syntax for different situations:

1. There is only one suggestion, but it may not work for all annotations:

    To update annotations, consider the [ontology name] term '[text] ; GO:[id].'

    example:
      term: activation of MAPK (mating sensu Fungi)
      goid: GO:0030456
      comment: This term was made obsolete because it is a gene product
      specific term. To update annotations, consider the process term
      'signal transduction during conjugation with cellular fusion ;
      GO:0000750.'
2. To make more than one specific suggestion:

a) from a single ontology, separate terms with commas:

    To update annotations, consider the [ontology name] terms '[text1] ; GO:[id1]',
    '[text2] ;GO:[id2]', '[text3] ; GO:[id3]'.

    example:
      term: allantoin/allantoate transport
      goid: GO:0006838
      comment: This term was made obsolete because it is a composite term
      that represents two individual processes. To update annotations,
      consider the biological process term 'allantoin transport ;
      GO:0015720', 'allantoate transport ; GO:0015719'.
b) from more than one ontology, separate terms from one ontology with commas, and use 'and' between ontology names:
    To update annotations, consider the [ontology name] term '[text1]
    ; GO:[id1]' and the [ontology name] term '[text2]; GO:[id2]'.

    examples:
      term: expansin
      goid: GO:0009936
      comment: This term was made obsolete because it represents a gene
      product. To update annotations, consider the cellular component term
      'cell wall (sensu Magnoliophyta) ; GO:0009505' and the biological
      process term 'cell growth ; GO:0016049.'

      term: blue-sensitive opsin
      goid: GO:0015059
      comment: This term was made obsolete because it refers to a
      class of proteins. To update annotations, consider the molecular
      function terms 'photoreceptor ; GO:0009881',
      '3,4-didehydroretinal binding ; GO:0046876' and 'retinal binding
      ; GO:0016918' and its children, the cellular component term
      'integral to membrane ; GO:0016021' and the biological process
      terms 'phototransduction, visible light ; GO:0007603' and 'UV-A,
      blue light phototransduction ; GO:0009588'.

To suggest a term and all its children, say 'consider [term info as above] and its children' (as in the example above).

Restoring obsolete terms

If you need to reinstate an obsolete term back into the ontologies, use the following:

    comment: Note that this term was reinstated from obsolete.
    
    

Splits

If you have split, for example, term GO:000000A into GO:000000B and GO:000000C, you should add the following comment to the entry for GO:000000B (and the inverse to the GO:000000C entry):

    comment: This term was split from '[term name] ; GO:000000A'
    (sibling term '[term name] ; GO:000000C').

 
(all on a single line)

Also see

If you want to refer to other terms in the ontologies, use this format:

    comment: Also see '[term name] ; GO:0000000'.

Any other comments

If there is any other comment you would like to make, prefix it with the following:

    comment: Note that [comment].


Obsoletions, merges and splits

When a term is defunct, then it is removed from its place in the graph and made a child of the meta term obsolete. Its definition (in GO.defs) remains.

When two terms are merged, e.g. terma and termb are merged as terma, then the GO_id of termb is made a secondary GO_id. Usually, the ID that has existed longer is used as the primary ID, but exceptions can be made. Secondary GO_id's are presented as a comma separated list, e.g.:


terma ; GO:idterma, GO:idtermb, GO:idtermc, ...

When a term is split, each new term gets a new GO_id, and the original GO_id is included in a comment (see comment syntax above).



Mappings to other classification systems

Mappings of GO have been made to other many other classification systems. As cautioned in an introduction to GO, these mappings are neither complete nor exact, but are rather designed to be used as a guide. Here, we describe the syntax of these mapping files.

The line that begins:


!Uses:

for example:

!Uses:http://www.tigr.org/docs/tigr-scripts/egad_scripts/role_reports.spl,
15 aug 2000.

gives the source of the external file. The line syntax is:
   
database:<identifier> > GO:<term> ; GO:<GO_id>

for example:

TIGR_role:11030 73 Amino acid biosynthesis Glutamate family >
GO:glutamine family amino-acid biosynthesis ; GO:0009084

all on a single line. The relationship between terms from external systems > GO terms can also be one to many, and these should just be added with a further >. For example:

MultiFun:1.5.1.18 Isoleucine/valine > GO:isoleucine biosynthesis ; GO:0009097 
> GO:valine biosynthesis ; GO:0009099

If no equivalent GO term exists for a term from another classification system, the following should be added as a mapping:

> GO:.

for example:

MultiFun:1.5 Building block biosynthesis > GO:.

XML version

The XML version of GO, which includes all three ontologies and the definitions, can be downloaded from the GO database archive. The document type definition (DTD) is available from the GO FTP archive.

The XML file is built from the flat files and the gene association files on a monthly basis.

Here's an XML snapshot (with some lines wrapped for legibility):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE go:go>

<go:go xmlns:go="http://www.geneontology.org/xml-dtd/go.dtd#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <go:version timestamp="Wed May  9 23:55:02 2001" />
    <rdf:RDF>
        <go:term rdf:about="http://www.geneontology.org/go#GO:0003673">
            <go:accession>GO:0003673</go:accession>
            <go:name>Gene_Ontology</go:name>
            <go:definition></go:definition>
        </go:term>
	<go:term rdf:about="http://www.geneontology.org/go#GO:0003674">
            <go:accession>GO:0003674</go:accession>
            <go:name>molecular_function</go:name>
            <go:definition>The action characteristic of a gene
            product.</go:definition>
            <go:part-of rdf:resource="http://www.geneontology.org/go#GO:0003673" />
            <go:dbxref>
                <go:database_symbol>go</go:database_symbol>
                <go:reference>curators</go:reference>
            </go:dbxref>
        </go:term>
        <go:term rdf:about="http://www.geneontology.org/go#GO:0016209">
            <go:accession>GO:0016209</go:accession>
            <go:name>antioxidant</go:name>
            <go:definition></go:definition>
            <go:isa rdf:resource="http://www.geneontology.org/go#GO:0003674" />
            <go:association>
                <go:evidence evidence_code="ISS">
                    <go:dbxref>
                        <go:database_symbol>fb</go:database_symbol>
                        <go:reference>fbrf0105495</go:reference>
                    </go:dbxref>
                </go:evidence>
                <go:gene_product>
                    <go:name>CG7217</go:name>
                    <go:dbxref>
                        <go:database_symbol>fb</go:database_symbol>
                        <go:reference>FBgn0038570</go:reference>
                    </go:dbxref>
                </go:gene_product>
            </go:association>
            <go:association>
                <go:evidence evidence_code="ISS">
                    <go:dbxref>
                        <go:database_symbol>fb</go:database_symbol>
                        <go:reference>fbrf0105495</go:reference>
                    </go:dbxref>
                </go:evidence>
                <go:gene_product>
                    <go:name>Jafrac1</go:name>
                    <go:dbxref>
                        <go:database_symbol>fb</go:database_symbol>
                        <go:reference>FBgn0040309</go:reference>
                    </go:dbxref>
                </go:gene_product>
            </go:association>
        </go:term>
    </rdf:RDF>
%lt;/go:go>


The basic unit of the GO XML database is go:term. Owing to limitations of the XML id and idref attributes (for instance, multiple parentage cannot be represented), the linking mechanism is RDF. RDF provides a much more flexible system for representing trees . To follow the links, note that term GO:0003674 has the attribute

rdf:about="http://www.geneontology.org/go#GO:0003674"

This is roughly equivalent to

id="http://www.geneontology.org/go#GO:0003674"

In rdf, unique urls are used as ids to make them universally unique. Now, note that term GO:0016209 has the tag

<go:isa 
rdf:resource="http://www.geneontology.org/go#GO:0003674" />

This shows that its parent is GO:0003674. This tag represents the relationship "GO:0016209 isa GO:0003674" or, in plain English, "Antioxidant is a molecular function." The other type of parentage relationship is go:part-of. GO:0003674 has the tag

<go:part-of 
rdf:resource="http://www.geneontology.org/go#GO:0003673" />

This shows that "Molecular function is part of the Gene Ontology".

In addition, each term can have one go:name, go:accession, go:definition, or multiple go:dbxrefs or go:associations. go:name, go:accession and go:definition are self-explanatory. go:dbxref represents the term in an external database, and go:association represents the gene associations of each term. go:association can have both go:evidence, which holds a go:dbxref to the evidence supporting the association, and a go:gene_product, which has the gene symbol and go:dbxref.

MySQL version

The MySQL version of GO is built monthly, and can be downloaded from the MySQL archive 4 databases are built and made available for download:

  • termdb - ontologies, definitions and mappings to other dbs
  • assocdb - the above, plus associations to gene products
  • seqdb - the above, plus protein sequences for some of the gene products
  • seqdblite - the above, with IEA associations stripped out (this is the version that drives AmiGO)

There are two download options for each of these databases, giving 8 possible options. You only need to download one of these files. You should not attempt to parse these files yourself, they are meant to be loaded into a MySQL database. There is also a perl API for advanced queries on the database. For full details, see the README file in the archive. To obtain documentation for the GO database, you should should download either of two files from the archive:

go_YYYYMM-schema-mysql.sql - the MySQL table creation statements, plus documentation.

go_YYYYMM-schema-html - Designed for viewing with a web browser; does not contain full documentation.

FASTA Format

There is a FASTA version of a the database and it is available from the database archive

 last modified October-2003 Report problems with this website to webmaster@geneontology.org
For problems with Netscape 4, please upgrade to Netscape 7