This GOTM File Format Guide serves to introduce new users to (and remind old users of) the structure and syntax of the GO files. Its main aim is to help users who need to write to or parse the GO files. You will find it more useful if you first read An introduction to GO for more general background information about the GO project and how the ontology works.
Information on annotating genes and gene products to GO can be found in the GO Annotation Guide.
Advice on GO editorial style can be found in the GO Editorial Style Guide.
When you create or edit a GO entry, you need to think about its component parts. Editing GO has become much more straightforward, and much less error prone since we have had an editing tool (see the DAG-Edit user guide for more information), but it helps to familiarize yourself with the syntax of the GO flat files as you will undoubtedly need to edit the raw data at some point.
The structure of GO is very simple. At its bare minimum, each GO entry consists of a term name (e.g. cell) and a unique, zero-padded seven-digit identifier (also known as its accession number, e.g. GO:0005623), which is used as a database cross-reference in the collaborating databases. The same number range is used across all three ontologies. If you are a GO curator and you haven't already done so, you'll need to define your own range of numbers in go/numbers/go_numbers.
Any term may, but does not need to, include one or more synonyms (e.g. type I programmed cell death is a synonym of apoptosis). The syntax for synonyms is
synonym:[synomym1] ; synonym:[synonym2]
There are several different levels of similarity recognised between terms and their synonyms, as shown below:
1. ~ the terms are related (e.g. XXX complex and XXX)
2. = the term is an exact synonym
3. < the synonym is broader than the term name
4. > the synonym is more precise than the term name
5. != the term is related, but the exact relationship is not specified
The relationships of these types of synonym to one another is represented as follow:
related to
is_a exact
is_a broader
is_a narrower
is_a undefined
See the Synonym Guide for more information.
Another optional extra is one or more general database cross references (dbxrefs), which refer to an identical object in another database. For instance, the molecular function term protein kinase CK2 activity has the database cross reference EC2.7.1.37 -- the accession number of this enzyme activity in the Enzyme Commission database. A complete list of database cross-references, together with their accepted abbreviations, is available from the GO ftp and cvs archives. The syntax for database cross-references is database_abbreviation:identifier
We are in the process of writing a definition for each GO term. However, you won't find these in the GO flat files as they're kept in a separate file called GO.defs (see Anatomy of the definitions file.
Now that we're clear on what an individual GO entry comprises, we can think about them in the context of an entire flat file. The structure described below holds true for each of the ontology flat files:
Biological Process (process.ontology)
Molecular Function (function.ontology)
Cellular Component (component.ontology)
The beginning of each file contains comments (lines that begin with a !) about how and when the file was generated. The first lines always carry information about the version, the date of last update, (optionally) the source of the file, the name of the database, the domain of the file and the editors of the file (except *.html files).
Lines in which the first non-space character is a $ either reflect the domain and aspect of the ontology (i.e. $text) or the end of file (i.e. the $ character on a line by itself).
Here's an example of the front matter of a GO flat file:
!autogenerated-by: DAG-Edit version 1.315
!saved-by: midori
!date: Fri Jan 03 17:14:37 GMT 2003
!version: $Revision: 1.47 $
!type: % ISA Is a
!type: < PARTOF Part of
$Gene_Ontology ; GO:0003673
In the GO flat files, the symbol % is used to represent an is-a relationship and the symbol < a part-of relationship. For more information on these relationships between terms, see the GO Editorial Style Guide. Parent-child relationships between terms are represented by indentation:
parent_term
child_term
Is-a relationships
%term0
%term1 % term2
means that term1 is a subclass of term0 and also a subclass of term2. Part-of relationships
%term0
%term1 < term2 < term3
means that term1 is a subclass of term0 and also a part-of of term2 and term3.
The order in which items appear on a line (where [] indicates an optional item) is as follows:
< | % term [; db cross ref]* [; synonym:text]* [ < | % term]*
Here's a real example from the molecular function ontology (it would appear on a single line in the actual file):
%UDPsulfoquinovose synthase ; GO:0046507 ; EC:3.13.1.1 ;
synonym:sulfite\:UDP-glucose sulfotransferase
As mentioned above, the definitions for terms in all three ontology files are stored in go/doc/GO.defs.
Each definition must contain the following:
-
term: the GO term to which the definition refers.
-
goid: the term's unique identifier.
-
definition: the definition of the term.
-
definition_reference: one or more references for the definition.
A definition may also have a comment:
-
comment: text (see comment syntax, below)
A specific syntax (wording) is used for writing the more common definitions. These are:
-
METABOLISM
The chemical reactions and physical changes
The chemical reactions and physical changes involving
-
BIOSYNTHESIS
The formation from simpler components of
-
CATABOLISM
The breakdown into simpler components of
-
REGULATION
Any process that modulates the frequency, rate or extent of
-
NEGATIVE REGULATION
Any process that stops, prevents or reduces the rate of
-
POSITIVE REGULATION
Any process that activates or increases the rate of
Where you are adding a definition with this kind of pre-set syntax the Dbxref will be GO:your-initials.
Obsoletions If you have made a term obsolete, you should explain why, and if possible suggest an alternative term or terms.
First add the word 'OBSOLETE.' at the beginning of the term definition. Then, with the term highlighted, click the 'obsolete term' button. Then you should add a comment to explain why you obsoleted the term and to redirect people who wish to make annotations.
Use the following syntax for the reason for obsoletion:
comment: This term was made obsolete because [reason].
To suggest alternative terms, use one of the following:
A. If exact replacement is possible (i.e. it is safe to move all existing annotations, keyword mappings, etc. to one term), precede the suggested term with 'use':
To update annotations, use the [ontology name] term '[text] ; GO:[id].'
example:
term: transfer RNA
goid: GO:0005563
comment: This term was made obsolete because it represents a gene
product. To update annotations, use the molecular function term
'triplet codon-amino acid adaptor ; GO:0030533.'
B. In cases where all existing annotations and mappings can't necessarily be transferred to one term, put 'consider' in front of the suggested terms. Syntax for different situations:
1. There is only one suggestion, but it may not work for all annotations:
To update annotations, consider the [ontology name] term '[text] ; GO:[id].'
example:
term: activation of MAPK (mating sensu Fungi)
goid: GO:0030456
comment: This term was made obsolete because it is a gene product
specific term. To update annotations, consider the process term
'signal transduction during conjugation with cellular fusion ;
GO:0000750.'
2. To make more than one specific suggestion:
a) from a single ontology, separate terms with commas:
To update annotations, consider the [ontology name] terms '[text1] ; GO:[id1]',
'[text2] ;GO:[id2]', '[text3] ; GO:[id3]'.
example:
term: allantoin/allantoate transport
goid: GO:0006838
comment: This term was made obsolete because it is a composite term
that represents two individual processes. To update annotations,
consider the biological process term 'allantoin transport ;
GO:0015720', 'allantoate transport ; GO:0015719'.
b) from more than one ontology, separate terms from one ontology with commas, and use 'and' between ontology names:
To update annotations, consider the [ontology name] term '[text1]
; GO:[id1]' and the [ontology name] term '[text2]; GO:[id2]'.
examples:
term: expansin
goid: GO:0009936
comment: This term was made obsolete because it represents a gene
product. To update annotations, consider the cellular component term
'cell wall (sensu Magnoliophyta) ; GO:0009505' and the biological
process term 'cell growth ; GO:0016049.'
term: blue-sensitive opsin
goid: GO:0015059
comment: This term was made obsolete because it refers to a
class of proteins. To update annotations, consider the molecular
function terms 'photoreceptor ; GO:0009881',
'3,4-didehydroretinal binding ; GO:0046876' and 'retinal binding
; GO:0016918' and its children, the cellular component term
'integral to membrane ; GO:0016021' and the biological process
terms 'phototransduction, visible light ; GO:0007603' and 'UV-A,
blue light phototransduction ; GO:0009588'.
To suggest a term and all its children, say 'consider [term info as above] and its children' (as in the example above).
Restoring obsolete termsIf you need to reinstate an obsolete term back into the ontologies, use the following:
comment: Note that this term was reinstated from obsolete.
SplitsIf you have split, for example, term GO:000000A into GO:000000B and GO:000000C, you should add the following comment to the entry for GO:000000B (and the inverse to the GO:000000C entry):
comment: This term was split from '[term name] ; GO:000000A'
(sibling term '[term name] ; GO:000000C').
(all on a single line)
Also see If you want to refer to other terms in the ontologies, use this format:
comment: Also see '[term name] ; GO:0000000'.
Any other comments If there is any other comment you would like to make, prefix it with the following:
comment: Note that [comment].
When a term is defunct, then it is removed from its place in the graph and made a child of the meta term obsolete. Its definition (in GO.defs) remains.
When two terms are merged, e.g. terma and termb are merged as terma, then the GO_id of termb is made a secondary GO_id. Usually, the ID that has existed longer is used as the primary ID, but exceptions can be made. Secondary GO_id's are presented as a comma separated list, e.g.:
terma ; GO:idterma, GO:idtermb, GO:idtermc, ...
When a term is split, each new term gets a new GO_id, and the original GO_id is included in a comment (see comment syntax above).
Mappings of GO have been made to other many other classification systems. As cautioned in an introduction to GO, these mappings are neither complete nor exact, but are rather designed to be used as a guide. Here, we describe the syntax of these mapping files.
The line that begins:
!Uses:
for example:
!Uses:http://www.tigr.org/docs/tigr-scripts/egad_scripts/role_reports.spl,
15 aug 2000.
gives the source of the external file. The line syntax is:
database:<identifier> > GO:<term> ; GO:<GO_id>
for example:
TIGR_role:11030 73 Amino acid biosynthesis Glutamate family >
GO:glutamine family amino-acid biosynthesis ; GO:0009084
all on a single line. The relationship between terms from external systems > GO terms can also be one to many, and these should just be added with a further >. For example:
MultiFun:1.5.1.18 Isoleucine/valine > GO:isoleucine biosynthesis ; GO:0009097
> GO:valine biosynthesis ; GO:0009099
If no equivalent GO term exists for a term from another classification system, the following should be added as a mapping:
> GO:.
for example:
MultiFun:1.5 Building block biosynthesis > GO:.
The XML version of GO, which includes all three ontologies and the definitions, can be downloaded from the GO database archive. The document type definition (DTD) is available from the GO FTP archive.
The XML file is built from the flat files and the gene association files on a monthly basis.
Here's an XML snapshot (with some lines wrapped for legibility):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE go:go>
<go:go xmlns:go="http://www.geneontology.org/xml-dtd/go.dtd#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<go:version timestamp="Wed May 9 23:55:02 2001" />
<rdf:RDF>
<go:term rdf:about="http://www.geneontology.org/go#GO:0003673">
<go:accession>GO:0003673</go:accession>
<go:name>Gene_Ontology</go:name>
<go:definition></go:definition>
</go:term>
<go:term rdf:about="http://www.geneontology.org/go#GO:0003674">
<go:accession>GO:0003674</go:accession>
<go:name>molecular_function</go:name>
<go:definition>The action characteristic of a gene
product.</go:definition>
<go:part-of rdf:resource="http://www.geneontology.org/go#GO:0003673" />
<go:dbxref>
<go:database_symbol>go</go:database_symbol>
<go:reference>curators</go:reference>
</go:dbxref>
</go:term>
<go:term rdf:about="http://www.geneontology.org/go#GO:0016209">
<go:accession>GO:0016209</go:accession>
<go:name>antioxidant</go:name>
<go:definition></go:definition>
<go:isa rdf:resource="http://www.geneontology.org/go#GO:0003674" />
<go:association>
<go:evidence evidence_code="ISS">
<go:dbxref>
<go:database_symbol>fb</go:database_symbol>
<go:reference>fbrf0105495</go:reference>
</go:dbxref>
</go:evidence>
<go:gene_product>
<go:name>CG7217</go:name>
<go:dbxref>
<go:database_symbol>fb</go:database_symbol>
<go:reference>FBgn0038570</go:reference>
</go:dbxref>
</go:gene_product>
</go:association>
<go:association>
<go:evidence evidence_code="ISS">
<go:dbxref>
<go:database_symbol>fb</go:database_symbol>
<go:reference>fbrf0105495</go:reference>
</go:dbxref>
</go:evidence>
<go:gene_product>
<go:name>Jafrac1</go:name>
<go:dbxref>
<go:database_symbol>fb</go:database_symbol>
<go:reference>FBgn0040309</go:reference>
</go:dbxref>
</go:gene_product>
</go:association>
</go:term>
</rdf:RDF>
%lt;/go:go>
The basic unit of the GO XML database is go:term. Owing to limitations of the XML id and idref attributes (for instance, multiple parentage cannot be represented), the linking mechanism is RDF. RDF provides a much more flexible system for representing trees
. To follow the links, note that term GO:0003674 has the attribute rdf:about="http://www.geneontology.org/go#GO:0003674"
This is roughly equivalent to id="http://www.geneontology.org/go#GO:0003674"
In rdf, unique urls are used as ids to make them universally unique. Now, note that term GO:0016209 has the tag <go:isa
rdf:resource="http://www.geneontology.org/go#GO:0003674" />
This shows that its parent is GO:0003674. This tag represents the relationship "GO:0016209 isa GO:0003674" or, in plain English, "Antioxidant is a molecular function." The other type of parentage relationship is go:part-of. GO:0003674 has the tag <go:part-of
rdf:resource="http://www.geneontology.org/go#GO:0003673" />
This shows that "Molecular function is part of the Gene Ontology".
In addition, each term can have one go:name, go:accession, go:definition, or multiple go:dbxrefs or go:associations. go:name, go:accession and go:definition are self-explanatory. go:dbxref represents the term in an external database, and go:association represents the gene associations of each term. go:association can have both go:evidence, which holds a go:dbxref to the evidence supporting the association, and a go:gene_product, which has the gene symbol and go:dbxref.
The MySQL version of GO is built monthly, and can be downloaded from the MySQL archive 4 databases are built and made available for download:
-
termdb - ontologies, definitions and mappings to other dbs
-
assocdb - the above, plus associations to gene products
-
seqdb - the above, plus protein sequences for some of the gene products
-
seqdblite - the above, with IEA associations stripped out (this is the version that drives AmiGO)
There are two download options for each of these databases, giving 8 possible options. You only need to download one of these files. You should not attempt to parse these files yourself, they are meant to be loaded into a MySQL database. There is also a perl API for advanced queries on the database. For full details, see the README file in the archive. To obtain documentation for the GO database, you should should download either of two files from the archive:
go_YYYYMM-schema-mysql.sql - the MySQL table creation statements, plus documentation.
go_YYYYMM-schema-html - Designed for viewing with a web browser; does not contain full documentation.
There is a FASTA version of a the database and it is available from the database archive
|