This GOTM File Format Guide serves to introduce new users to (and remind old users of) the structure and syntax of the GO files. Its main aim is to help users who need to write to or parse the GO files. You will find it more useful if you first read An introduction to GO for more general background information about the GO project and how the ontology works.
Information on annotating genes and gene products to GO can be found in the GO Annotation Guide.
Advice on GO editorial style can be found in the GO Editorial Style Guide.
When you create or edit a GO entry, you need to think about its component parts. Editing GO has become much more straightforward, and much less error prone since we have had an editing tool (see the DAG-Edit user guide for more information), but it helps to familiarize yourself with the syntax of the GO flat files as you will undoubtedly need to edit the raw data at some point.
The structure of GO is very simple. At its bare minimum, each GO entry consists of a term name (e.g. cell) and a unique, zero-padded seven-digit identifier (also known as its accession number, e.g. GO:0005623), which is used as a database cross-reference in the collaborating databases. The same number range is used across all three ontologies. If you are a GO curator and you haven't already done so, you'll need to define your own range of numbers in go/numbers/go_numbers.
Any term may, but does not need to, include one or more synonyms (e.g. type I programmed cell death is a synonym of apoptosis). The syntax for synonyms is
synonym:[synomym1] ; synonym:[synonym2]There are several different levels of similarity recognised between terms and their synonyms, as shown below:
1. ~ the terms are related (e.g. XXX complex and XXX) 2. = the term is an exact synonym 3. < the synonym is broader than the term name 4. > the synonym is more precise than the term name 5. != the term is related, but the exact relationship is not specifiedThe relationships of these types of synonym to one another is represented as follow:
related to is_a exact is_a broader is_a narrower is_a undefinedSee the Synonym Guide for more information.
Another optional extra is one or more general database cross references (dbxrefs), which refer to an identical object in another database. For instance, the molecular function term protein kinase CK2 activity has the database cross reference EC220.127.116.11 -- the accession number of this enzyme activity in the Enzyme Commission database. A complete list of database cross-references, together with their accepted abbreviations, is available from the GO ftp and cvs archives. The syntax for database cross-references is
We are in the process of writing a definition for each GO term. However, you won't find these in the GO flat files as they're kept in a separate file called GO.defs (see Anatomy of the definitions file.
Now that we're clear on what an individual GO entry comprises, we can think about them in the context of an entire flat file. The structure described below holds true for each of the ontology flat files:
The beginning of each file contains comments (lines that begin with a !) about how and when the file was generated. The first lines always carry information about the version, the date of last update, (optionally) the source of the file, the name of the database, the domain of the file and the editors of the file (except *.html files).
Lines in which the first non-space character is a $ either reflect the domain and aspect of the ontology (i.e. $text) or the end of file (i.e. the $ character on a line by itself).
Here's an example of the front matter of a GO flat file:
!autogenerated-by: DAG-Edit version 1.315 !saved-by: midori !date: Fri Jan 03 17:14:37 GMT 2003 !version: $Revision: 1.47 $ !type: % ISA Is a !type: < PARTOF Part of $Gene_Ontology ; GO:0003673
In the GO flat files, the symbol % is used to represent an is-a relationship and the symbol < a part-of relationship. For more information on these relationships between terms, see the GO Editorial Style Guide. Parent-child relationships between terms are represented by indentation:
%term0 %term1 % term2means that term1 is a subclass of term0 and also a subclass of term2.
%term0 %term1 < term2 < term3means that term1 is a subclass of term0 and also a part-of of term2 and term3.
The order in which items appear on a line (where  indicates an optional item) is as follows:
< | % term [; db cross ref]* [; synonym:text]* [ < | % term]*Here's a real example from the molecular function ontology (it would appear on a single line in the actual file):
%UDPsulfoquinovose synthase ; GO:0046507 ; EC:18.104.22.168 ; synonym:sulfite\:UDP-glucose sulfotransferase
As mentioned above, the definitions for terms in all three ontology files are stored in go/doc/GO.defs.
Each definition must contain the following:
ObsoletionsIf you have made a term obsolete, you should explain why, and if possible suggest an alternative term or terms.
First add the word 'OBSOLETE.' at the beginning of the term definition. Then, with the term highlighted, click the 'obsolete term' button. Then you should add a comment to explain why you obsoleted the term and to redirect people who wish to make annotations.
Use the following syntax for the reason for obsoletion:
comment: This term was made obsolete because [reason].To suggest alternative terms, use one of the following:
A. If exact replacement is possible (i.e. it is safe to move all existing annotations, keyword mappings, etc. to one term), precede the suggested term with 'use':
To update annotations, use the [ontology name] term '[text] ; GO:[id].' example: term: transfer RNA goid: GO:0005563 comment: This term was made obsolete because it represents a gene product. To update annotations, use the molecular function term 'triplet codon-amino acid adaptor ; GO:0030533.'B. In cases where all existing annotations and mappings can't necessarily be transferred to one term, put 'consider' in front of the suggested terms. Syntax for different situations:
1. There is only one suggestion, but it may not work for all annotations:
To update annotations, consider the [ontology name] term '[text] ; GO:[id].' example: term: activation of MAPK (mating sensu Fungi) goid: GO:0030456 comment: This term was made obsolete because it is a gene product specific term. To update annotations, consider the process term 'signal transduction during conjugation with cellular fusion ; GO:0000750.'2. To make more than one specific suggestion:
a) from a single ontology, separate terms with commas:
To update annotations, consider the [ontology name] terms '[text1] ; GO:[id1]', '[text2] ;GO:[id2]', '[text3] ; GO:[id3]'. example: term: allantoin/allantoate transport goid: GO:0006838 comment: This term was made obsolete because it is a composite term that represents two individual processes. To update annotations, consider the biological process term 'allantoin transport ; GO:0015720', 'allantoate transport ; GO:0015719'.b) from more than one ontology, separate terms from one ontology with commas, and use 'and' between ontology names:
To update annotations, consider the [ontology name] term '[text1] ; GO:[id1]' and the [ontology name] term '[text2]; GO:[id2]'. examples: term: expansin goid: GO:0009936 comment: This term was made obsolete because it represents a gene product. To update annotations, consider the cellular component term 'cell wall (sensu Magnoliophyta) ; GO:0009505' and the biological process term 'cell growth ; GO:0016049.' term: blue-sensitive opsin goid: GO:0015059 comment: This term was made obsolete because it refers to a class of proteins. To update annotations, consider the molecular function terms 'photoreceptor ; GO:0009881', '3,4-didehydroretinal binding ; GO:0046876' and 'retinal binding ; GO:0016918' and its children, the cellular component term 'integral to membrane ; GO:0016021' and the biological process terms 'phototransduction, visible light ; GO:0007603' and 'UV-A, blue light phototransduction ; GO:0009588'.
To suggest a term and all its children, say 'consider [term info as above] and its children' (as in the example above).
Restoring obsolete termsIf you need to reinstate an obsolete term back into the ontologies, use the following:
comment: Note that this term was reinstated from obsolete.
SplitsIf you have split, for example, term GO:000000A into GO:000000B and GO:000000C, you should add the following comment to the entry for GO:000000B (and the inverse to the GO:000000C entry):
comment: This term was split from '[term name] ; GO:000000A' (sibling term '[term name] ; GO:000000C').(all on a single line)
Also seeIf you want to refer to other terms in the ontologies, use this format:
comment: Also see '[term name] ; GO:0000000'.
Any other commentsIf there is any other comment you would like to make, prefix it with the following:
comment: Note that [comment].
When two terms are merged, e.g. terma and termb are merged as terma, then the GO_id of termb is made a secondary GO_id. Usually, the ID that has existed longer is used as the primary ID, but exceptions can be made. Secondary GO_id's are presented as a comma separated list, e.g.:
terma ; GO:idterma, GO:idtermb, GO:idtermc, ...When a term is split, each new term gets a new GO_id, and the original GO_id is included in a comment (see comment syntax above).
Mappings of GO have been made to other many other classification systems. As cautioned in an introduction to GO, these mappings are neither complete nor exact, but are rather designed to be used as a guide. Here, we describe the syntax of these mapping files.
The line that begins:
!Uses:http://www.tigr.org/docs/tigr-scripts/egad_scripts/role_reports.spl, 15 aug 2000.gives the source of the external file. The line syntax is:
database:<identifier> > GO:<term> ; GO:<GO_id>for example:
TIGR_role:11030 73 Amino acid biosynthesis Glutamate family > GO:glutamine family amino-acid biosynthesis ; GO:0009084all on a single line. The relationship between terms from external systems > GO terms can also be one to many, and these should just be added with a further >. For example:
MultiFun:22.214.171.124 Isoleucine/valine > GO:isoleucine biosynthesis ; GO:0009097 > GO:valine biosynthesis ; GO:0009099If no equivalent GO term exists for a term from another classification system, the following should be added as a mapping:
> GO:.for example:
MultiFun:1.5 Building block biosynthesis > GO:.
The XML version of GO, which includes all three ontologies and the definitions, can be downloaded from the GO database archive. The document type definition (DTD) is available from the GO FTP archive.
The XML file is built from the flat files and the gene association files on a monthly basis.
Here's an XML snapshot (with some lines wrapped for legibility):
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE go:go> <go:go xmlns:go="http://www.geneontology.org/xml-dtd/go.dtd#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <go:version timestamp="Wed May 9 23:55:02 2001" /> <rdf:RDF> <go:term rdf:about="http://www.geneontology.org/go#GO:0003673"> <go:accession>GO:0003673</go:accession> <go:name>Gene_Ontology</go:name> <go:definition></go:definition> </go:term> <go:term rdf:about="http://www.geneontology.org/go#GO:0003674"> <go:accession>GO:0003674</go:accession> <go:name>molecular_function</go:name> <go:definition>The action characteristic of a gene product.</go:definition> <go:part-of rdf:resource="http://www.geneontology.org/go#GO:0003673" /> <go:dbxref> <go:database_symbol>go</go:database_symbol> <go:reference>curators</go:reference> </go:dbxref> </go:term> <go:term rdf:about="http://www.geneontology.org/go#GO:0016209"> <go:accession>GO:0016209</go:accession> <go:name>antioxidant</go:name> <go:definition></go:definition> <go:isa rdf:resource="http://www.geneontology.org/go#GO:0003674" /> <go:association> <go:evidence evidence_code="ISS"> <go:dbxref> <go:database_symbol>fb</go:database_symbol> <go:reference>fbrf0105495</go:reference> </go:dbxref> </go:evidence> <go:gene_product> <go:name>CG7217</go:name> <go:dbxref> <go:database_symbol>fb</go:database_symbol> <go:reference>FBgn0038570</go:reference> </go:dbxref> </go:gene_product> </go:association> <go:association> <go:evidence evidence_code="ISS"> <go:dbxref> <go:database_symbol>fb</go:database_symbol> <go:reference>fbrf0105495</go:reference> </go:dbxref> </go:evidence> <go:gene_product> <go:name>Jafrac1</go:name> <go:dbxref> <go:database_symbol>fb</go:database_symbol> <go:reference>FBgn0040309</go:reference> </go:dbxref> </go:gene_product> </go:association> </go:term> </rdf:RDF> %lt;/go:go>
The basic unit of the GO XML database is go:term. Owing to limitations of the XML id and idref attributes (for instance, multiple parentage cannot be represented), the linking mechanism is RDF. RDF provides a much more flexible system for representing trees . To follow the links, note that term GO:0003674 has the attribute
This is roughly equivalent to
In rdf, unique urls are used as ids to make them universally unique. Now, note that term GO:0016209 has the tag
<go:isa rdf:resource="http://www.geneontology.org/go#GO:0003674" />
This shows that its parent is GO:0003674. This tag represents the relationship "GO:0016209 isa GO:0003674" or, in plain English, "Antioxidant is a molecular function." The other type of parentage relationship is go:part-of. GO:0003674 has the tag
<go:part-of rdf:resource="http://www.geneontology.org/go#GO:0003673" />
This shows that "Molecular function is part of the Gene Ontology".
In addition, each term can have one go:name, go:accession, go:definition, or multiple go:dbxrefs or go:associations. go:name, go:accession and go:definition are self-explanatory. go:dbxref represents the term in an external database, and go:association represents the gene associations of each term. go:association can have both go:evidence, which holds a go:dbxref to the evidence supporting the association, and a go:gene_product, which has the gene symbol and go:dbxref.
There are two download options for each of these databases, giving 8 possible options. You only need to download one of these files. You should not attempt to parse these files yourself, they are meant to be loaded into a MySQL database. There is also a perl API for advanced queries on the database. For full details, see the README file in the archive. To obtain documentation for the GO database, you should should download either of two files from the archive:
go_YYYYMM-schema-mysql.sql - the MySQL table creation statements, plus documentation.
go_YYYYMM-schema-html - Designed for viewing with a web browser; does not contain full documentation.