General comments: This document is intended to help standardize the way the evidence codes are used for GO annotation of genes/gene products. Every GO annotation must indicate the type of evidence that supports it; these evidence codes correspond to broad categories of experimental or other support. The codes are listed along with examples (not exhaustive lists) of the kinds of experiments that would fall into each category.
Note that these evidence codes are intended for use in conjunction with GO terms, and should not be considered in isolation from the terms. In other words, an evidence code indicates how annotation to a particular term is supported, and is not necessarily a classification of an experiment.
For every evidence category, there is room for curators to exercise judgement about the quality of the evidence, and how well it supports annotation to a node within each ontology. The distinction between "TAS" and "NAS" is particularly sensitive to interpretation (see below).
IC inferred by curator
Comment: An example would be when there is evidence (be it direct assay, sequence similarity or even from electronic annotation) that a particular gene product has the function "transcription factor". There is no evidence whatsoever that this gene product has the cellular location "nucleus", but this would be a perfectly reasonable inference for a curator to make (if the curator is annotating a eukaryotic gene product, of course). This inference would be linked to the annotation "transcription factor" in two ways: (i) both annotations would share the same reference, and the inferred annotation would include one or more "from" statements pointing to the GO term(s) used by the curator for the inference.
To be used for those cases where an annotation is not supported by any evidence, but can be reasonably inferred by a curator from other GO annotations, for which evidence is available.
reference: Ashburner et al. 2006 J. irreprod. data 107:11989-11990
molecular_function: general RNA polymerase II transcription factor ; GO:0016251 | inferred from sequence similarity
cellular_location: nucleus ; GO:0005634 | inferred by curator from GO:0016251
IDA inferred from direct assay
Comment: Important: this code is used to indicate a direct assay for the function, process, or component indicated by the GO term. Curators therefore need to be careful, because an experiment considered as direct assay for a term from one ontology may be a different kind of evidence for the other ontologies. In particular, we thought of more kinds of direct assays for cellular component component than for function or process. For example, a fractionation experiment might provide "direct assay" evidence that a gene product is in the nucleus, but "protein interaction" evidence for its function or process. Binding assays can provide direct assay evidence for "... binding" molecular function terms.
In vitro reconstitution (e.g. transcription)
Immunofluorescence (for cellular component)
Cell fractionation (for cellular component)
Physical interaction/binding assay (sometimes appropriate for cellular component or molecular function)
IEA inferred from electronic annotation
Comment: Used for annotations that depend directly on computation or automated transfer of annotations from a database. The key feature that distinguishes this evidence code from others is what a curator has done--IEA is used when no curator has checked the annotation to verify its accuracy. The actual method used (BLAST search, SwissProt keyword mapping, etc.) doesn't matter.
Annotations based on "hits" in sequence similarity searchs, if they have not been reviewed by curators (curator-reviewed hits would get ISS)
Annotations transferred from database records, if not reviewed by curators (curator-reviewed items may use NAS, or the reviewing process may lead to print references for the annotation)
IEP inferred from expression pattern
Comment: Covers cases where the annotation is inferred from the timing or location of expression of a gene. Expression data will be most useful for process annotation rather than function. For example, several of the heat shock proteins are thought to be involved in the process of stress response because they are upregulated during stress conditions. Use this category with caution! Also see the additional notes below.
Transcript levels (e.g. Northerns, microarray data)
Protein levels (e.g. Western blots)
Note: The "database identifier" column in the gene_association file should be filled in whenever possible, to help avoid circular annotations between GO and other databases.
IGI inferred from genetic interaction
Comment: Includes any combination of alterations in the sequence (mutation) or expression of more than one gene/gene product. This category can therefore cover any of the IMP experiments that are done in a non-wild-type background, though we prefer to use it only when all mutations are documented. When redundant copies of a gene must all be mutated to see an informative phenotype, that's IGI. (Yes, we know that means some organisms, such as mouse, will have far, far more IGI than IMP annotations.)
"Traditional" genetic interactions such as suppressors, synthetic lethals, etc.
Inference about one gene drawn from the phenotype of a mutation in a different gene
IMP also covers phenotypic similarity: a phenotype that is informative because it is similar to that of another independent phenotype (which may have been described earlier or documented more fully) is IMP (not IGI).
We have also decided to use this category for situations where a mutation in one gene (gene A) provides information about the function, process, or component of another gene (gene B; i.e. annotate gene B using IGI).
We recommend making an entry in the "with" column when using this evidence code (i.e. include an identifier for the "other" gene involved in the interaction). If more than one independent genetic interaction supports the association, use separate lines for each. In cases where the gene of interest interacts simultaneously with more than one other gene, put both/all of the interacting genes on the same line (separate identifiers by pipes in the "with" column). To help clarify:
GOterm IGI FB:gene1|FB:gene2
means that the GO term is supported by evidence from its interaction
with *both* of these genes; i.e. neither of these statements are true:
GOterm IGI FB:gene1
GOterm IGI FB:gene2
See the GO Annotation Guide for more information.
IMP inferred from mutant phenotype
Comment: anything that is concluded from looking at mutations or abnormal levels of the product(s) only of the gene of interest is IMP (compare IGIs). Use this code for experiments that use antibodies or other specific inhibitors of RNA or protein activity, even though no gene may be mutated (the rationale is that IMP is used where an abnormal situation prevvails in a cell or organism).
Any gene mutation/knockout
Overexpression/ectopic expression of wild-type or mutant genes
Specific protein inhibitors
IPI inferred from physical interaction
Comment: Covers physical interactions between the gene product of interest and another molecule (or ion, or complex). For functions such as protein binding or nucleic acid binding, a binding assay is simultaneously IPI and IDA; IDA is preferred because the assay directly detects the binding. For both IPI and IGI, it would be good practice to qualify them with the gene/protein/ion. We thought that antibody binding experiments were not suitable as evidence for function or process.
Ion/protein binding experiments
We recommend making an entry in the "with" column when using this evidence code (i.e. include an identifier for the "other" protein involved in the interaction). If more than one independent physical interaction supports the association, use separate lines for each. In cases where the gene product of interest interacts simultaneously with more than one other protein, put both/all of the interacting things on the same line (separate identifiers by pipes in the "with" column). To help clarify:
GOterm IPI DB:id1|DB:id2
means that the GO term is supported by evidence from its
interaction with *both* of these proteins; i.e. neither of these
statements are true:
GOterm IPI DB:id1
GOterm IPI DB:id2
See the GO Annotation Guide for more information.
ISS inferred from sequence or structural similarity
Comment: Use this code for BLAST (or other sequence similarity detection method) results that have been reviewed for accuracy by a curator. If the result has not been reviewed, use IEA. ISS can also be used for sequence similarities reported in publishes papers, if the curator thinks the result is reliable enough. When the gene is a "homologue of," can infer fairly detailed function and location (cellular component) but err on the side of low resolution for processes. For recognized domains, attribution to any of the ontologies will probably be at low resolution.
Sequence similarity (homologue of/most closely related to)
We recommend making an entry in the "with" column when using this evidence code (i.e. include an identifier for the similar sequence). The 'with' column can have more than one identifier, separated by pipes.
NAS non-traceable author statement
Comment: Formerly NA (not available). See TAS, and see notes below. Also, note that "author" can be interpreted quite loosely for this code--for example, one doesn't have to know which curator entered an untraceable statement that appears on a database record to use this code
Database entries that don't cite a paper (e.g. SwissProt records, YPD protein reports)
Statements in papers (abstract, introduction, or discussion) that a curator cannot trace to another publication
ND no biological data available
Comment: This code is used only for annotations to "unknown," and it is the only evidence code recommended for annotations to unknown (except in cases where a cited source explicitly says that something is unknown). It should be accompanied by a reference that explains that curators looked but found no information. A web page is available that can serve as a generic reference to use with ND; to use it insert "GO_REF:nd" in the reference column of a gene_association file.
Used for annotations to "unknown" molecular function, biological process, or cellular component.
TAS traceable author statement
Comment: Formerly ASS ("author said so"). TAS and NAS are both used for cases where the publication that a curator uses to support an annotation doesn't show the evidence (experimental results, sequence comparison, etc.). TAS is meant for the more reliable cases, such as reviews (presumably written by experts) or material sufficiently well established to appear in a text book, but there isn't really a sharp cutoff between TAS and NAS. Curator discretion is advised! Also see notes below.
Anything in a review article where the original experiments are traceable through that article (material from introductions to non-review papers will sometimes meet this standard)
Anything found in a text book or dictionary; usually text book material has become common knowledge (e.g. "everybody" knows that enolase is a glycolytic enzyme).
NR not recorded
Used for annotations done before curators began tracking evidence types (appears in SGD and FlyBase annotations). It should not be used for new annotations--use TAS or NAS.
More comments and miscellaneous thoughts:
The evidence fields can be thought of in a loose hierachy:
This hierarchy should not be interpreted as a rigid ranking of evidence types; users can and should form their own conclusions as to the reliability of each type of evidence and each individual annotation. It is a loose hierarchy also partly because the strength of the evidence will also depend on to what resolution you are annotating, and because there is a range of reliability within each evidence category (e.g. 90% versus 20% identity for "sequence similarity" or a two-hybrid result versus co-purification over several columns for "physical interaction").
There may be different kinds of evidence available to support annotating a gene product to different levels within each ontology. For example, there might be a direct assay showing that a protein localizes to the mitochondrion, and a physical interaction suggesting localization to the mitochondrial matrix (more specific node, but less reliable evidence). Curators can annotate genes to both a parent and a child, and cite the same or different kinds of evidence for the annotations as appropriate.
Added 2000-11-08: Heather has seen cases where a paper presents several lines of evidence supporting a conclusion, of which each line of evidence alone is sufficient to annotate to a higher-level (more generic) node, but combining the lines of evidence gives the author (or curator) enough data to support annotating to a lower-level (more specific) node. We've decided to annotate each line of evidence singly, with the appropriate evidence code, for the higher node (e.g. have a line for IMP, another line for IPI, for one GO ID). The annotation to the lower node can then be included with 'TAS' as the evidence; cite the paper if the author draws the conclusion. If the curator draws the conclusion, keep some record of what went into the decision.
Notes on ASS versus NA (from Heather, 2000-02-26)
note added 2000-08-02: ASS is now TAS, and NA is now NAS (see above)
I previously used ASS for evidence from abstracts and to me it indicated less reliable evidence. For review articles I tended to look up the references they cited and take the evidence from the original papers. I have not used NA. Midori and other SGD curators have used ASS for traceable and non-traceable evidence i.e. whether she had just the authors word for it or whether she had a respectable review article where all the necessary references are cited - Midori doesn't go to the original papers but adds the terms qualfied by ASS. It seems that Midori's way of treating review articles makes annotating much faster and easier. If we are to use it in this way then I think it is necessary to have a different evidence field for cases where there is no way of finding the real evidence - call this NA instead. This will mean that we are not at both ends of the reliability spectrum for ASS evidence.
Notes added 2000-03-01 (MAH): SGD curators annotated several genes before the evidence fields were used. For a while, these were given "NA" as the evidence codes. These have since been changed to ASS where reviews were used, or "NR" for other papers. Also, future SGD annotations will use ASS and NA as described in the list above, so that ASS is used for reviews, dictionaries, texts, etc.
Note added 2000-08-02: SGD gene associations have been updated to use TAS and NAS.
Notes on IEP (added 2000-03-08; updated 2000-03-09 MAH): Addition of the IEP category generated a lot of discussion via email. One theme that emerged is that curators and users will have to be careful when interpreting expression results, especially if there's no other kind of evidence linking a gene product with a process. For instance, we certainly don't want to look at a cluster of genes, and, based on previous knowledge of one of them being involved in protein folding, annotate the rest of the genes in that cluster to the same process. This is certainly a dangerous thing to do. But having the IEP code allows curators to include expression data when they deem it appropriate, and allows researchers to make their own decisions/judgements about the reliability of the annotation.
Another important theme, indeed one of the reasons we opted to add the category, is that systematic analysis will prove to be very informative. It was especially well stated by Richard Baldarelli of MGI, so I've included his message here:
It seems that expression data will be very useful for process and cellular component mapping, but caution should be used for function mapping (as Allan and Kara point out [in email messages]). While conventional expression assays will provide useful evidence in several cases, the real benefit will come from expression profiling. The rationale behind expression profiling from chip data is that genes that are coordinately regulated over a range of environments are likely to be involved in the same biological processes, and thus may have interrelated functions. As expression technology evolves to consider other aspects of gene expression (e.g. transcription and post-transcription chips, Mass-spec on 2D protein data), profiling will become an even more valuable tool for process implication. With the genome sequences here or on the way, the most significant information we may have for many genes will be expression profiling data (at least for a while anyway). Accuracy levels for process implication aside, this type of evidence is necessarily indirect. Having an evidence type "expression" takes this into account and remains fairly non-specific.
For more details, check the GO mail archive for messages with the subject "evidence code comment."
Document composed by Heather Butler and Midori Harris 2000-02-26
Updated 2000-03-09 MAH
Updated 2000-08-02 MAH
Updated 2000-11-08 MAH
Updated 2001-05-17 MAH
Updated 2001-06-22 MAH
Updated 2001-07-15 MAH
Updated 2001-10-15 MAH
Copyright © 1999-2000 Gene Ontology Consortium. Permission to use the information contained in this database was given by the researchers/institutes who contributed or published the information. Users of the database are solely responsible for compliance with any copyright restrictions, including those applying to the author abstracts. Documents from this server are provided "AS-IS" without any warranty, expressed or implied.