VARSPLIC utility and alternative isoform files
----------------------------------------------

INTRODUCTION

Many proteins exist in more than one isoform, one cause of which is
alternative (differential) splicing. Splice isoforms may differ
considerably from one another, with potentially less than 50% sequence
similarity between isoforms. In the UniProt Protein Knowledgebase (UniProt KB) 
(comprised of the manually annotated database Swiss-Prot and its automatically
annotated supplement TrEMBL), one sequence (usually that of the longest 
isoform)  is displayed for each protein. Known variations of this sequence are
recorded in the feature table (using the 'VAR_SEQ' key), together with the 
name(s) of the isoform(s) in which each variant occurs.  With the release of 
UniProt 8.0, the same feature key is also used to describe variations in 
sequence resulting from alternative initiation (in previous releases, two 
separate feature keys, 'VARSPLIC' and 'INIT_MET', were used), alternative
promoter usage, and ribosomal frameshifting.

The results of database-wide sequence comparisons (such as FASTA or
BLAST) may, in some cases, be dependent on which isoform
sequence is used for each protein in the database. A more informative
set of results might be obtained if such comparisons were run against
a database containing all known isoforms.

The program varsplic.pl has been written to generate additional
records from the UniProtKB, one for each alternative isoform of
each protein. The default output is in FASTA format. There is also an
option to generate statistics on the number of new records generated,
and the % sequence change of each new record from its 'parent'
sequence.  Additionally, the user can choose to print new records only
for isoforms whose difference from their 'parent' exceeds a specified
threshold.

Subsequent versions of the program have extended its capacities to enable it to
handle (on request) sequence variants described in UniProtKB entries using the
'VARIANT' and 'CONFLICT' feature keys.  Information about the use of
these capabilities is documented internally within the program, and is 
available as perldoc.

varsplic.pl is written in PERL. It requires the Swissknife package.
Both are available from ftp://ftp.ebi.ac.uk/pub/software/uniprot/.  

A change log accompanying the program details developments between versions. 
Major changes are also described at the foot of this document.

DATA FILES

It is possible to download data files in which the current release of the
UniProt Knowledgebase has been expanded, with additional sequences generated
that represent all annotated splice variants.

The UniProt Knowledgebase has two sections, UniProtKB/Swiss-Prot and 
UniProtKB/TrEMBL.  Since UniProt release 5.5, varsplic annotation is only used 
in UniProtKB/Swiss-Prot. 

The file containing the additional sequences is named as follows:

uniprot_sprot_varsplic.fasta.gz

As indicated by the ".gz" extension, this is a Unix "compress" format
files which when decompressed will produce ASCII files in FASTA
format.

A new version of this file is rebuilt at each four-weekly release
and is included with the other UniProt Knowledgebase files. It is available 
at the following locations:

ftp://ftp.expasy.org/databases/uniprot/knowledgebase/
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/

IMPORTANT WARNING

It is intended that the original UniProtKB entries should
remain the primary source of information about all  isoforms of
each protein, and that the additional records generated by this
program will be used only in sequence similarity comparisons.  For
this reason, only output in FASTA format is being distributed with
this release. Please refer to the original UniProtKB records for
further details concerning each isoform.

USAGE

For usage of varsplic.pl see the program documentation (in the
program, or type: perldoc varsplic.pl)

CURRENT DISTRIBUTION

The current distribution of the varsplic program is 2.2.0.  See the changeLog
accompanying the program for a description of changes since v2.0.

NOTES

(i)'-which' option

The 'which' option allows the user to specify one of three alternative
output options:

1) 'full', i.e., a new record is generated for every existing sequence
in the database, plus one new record for each alternative isoform.
2) 'allforms' i.e. a new record is generated for every existing
sequence for which alternative isoforms exist, plus one new record for each 
alternative.
3) default (no option specified): new records are NOT generated for
isoforms whose sequence is displayed in an existing record. New
records are only generated for alternative isoforms.

(ii) IDs for child records

For new records in FASTA format, the default option for deriving IDs for the
child records his as follows:

1) for original records without alternative isoforms, no
changes have been made
2) for original records with alternative isoforms, the
original ID has been replaced by the primary accession number,
followed by a hyphen and the number '00'
3) for new variants, the accession number of the original record,
followed by a hyphen, followed by a number has been used as an ID.
A different number has been used for each isoform. For example,
record P05067 has three isoforms. If the user specifies 'full' or
'allforms', three FASTA records will be generated from this record,
with IDs P05067-00 (the FASTA record of the original sequence),
P05067-01 and P05067-02. Currently, the largest number of isoforms
described for any protein in Swiss-Prot or TrEMBL is 13.

For new records in FASTA format, the AC of the 'parent' record has
been retained as the AC number.

Alternative rules for deriving identifies can be specified using various
command line options, see perldoc for details.
 
(iii) Statistics 
 
The program can generate statistics, counting the number of variants whose
sequence has been changed from the parental record within a certain range.  
The 'statsfile' option allows the user to specify these ranges in a file:
otherwise default values are used.

The percentage change of each newly generated isoform from its parent
isoform is calculated as follows:

1 - (length of sequence common to both isoforms / Length of their gapped
alignment)

The alignment length is determined as follows. A gap is introduced
into the alignment at each position where variation is recorded
between isoforms, i.e. mismatches are not allowed at any position, and
no attempt is made to align divergent sequences with one another.  See
diagram below.

Key: + Shared Sequence
     * Sequence unique to isoform 1
     ^ Sequence unique to isoform 2

Isoform 1: +++++******++++++*****+++
Isoform 2: +++++^^^^^^^^^^^^+++++++++^^^^

These two isoforms would be aligned as below:

Isoform 1: +++++******            ++++++*****+++
Isoform 2: +++++      ^^^^^^^^^^^^++++++     +++^^^^

and a % change would be recorded of 1 - (14/41) = 0.66 for isoform 2, with
respect to isoform 1

CHANGES FOR VERSION 2.2

The general behaviour of the program has been altered to reflect changes in how 
alternative isoforms are represented in the CC and FT lines of UniProtKB 
entries in UniProt release 8.0. -varseq option introduced as successor to 
-varsplic option (deprecated) and -init_met option (discontinued). The default
option is now to generate variant sequence caused by alternative splicing and
alternative initiation together (and not separately as previously).

CHANGES FOR VERSION 2.1

The program has been extended to handle INIT_MET features.

Use of the INIT_MET feature key in a UniProtKB record indicates one of
two possible phenomena, according to the position of the feature.  In position
0, it indicates the presence of an initiator methionine cleaved off the primary
translation during the production of the functional protein product, and not
included in the sequence displayed in the entry.  In any other position, it
indicates an initiator methioine within the displayed sequence that is an
alternative to the usual initiator methioine at position 0 or 1.

Version 2.1 of varsplic.pl has been extended to provide options for using the
information contained in both types of this feature.  Full details of how to
run the revised program, and the output it produces, are available as perldoc.

CHANGES FOR VERSION 2.0

In Swiss-Prot release 41.01 (and in the accompanying TrEMBL release), a new
format was introduced for "CC ALTERNATIVE PRODUCTS" lines.  The new format is
more structured than the previous format.  Associated with these changes are the
introduction of stable identifiers for each named splice isoform in all entries
that describe more than splice isoform; the the extension of feature
identifiers, previously only used for HUMAN VARIANT features, to VARSPLIC
features in entries from all species.

The effects of these changes on varsplic.pl are as follows:

(i)   the program has been substantially rewritten
(ii)  the new program is now reliant on Swissknife v1.3 or later, which has also
      been rewritten to deal with the new format.  This now available from the
      EBI's ftp site at ftp://ftp.ebi.ac.uk/pub/software/swissprot/
(iii) there are some slight changes in the output format of files produced by
      varsplic.pl.  Particularly, if only splice variants are expanded, the
      program now displays the (stable) isoform identifier associated with each
      isoform.
(iv)  there has been some cleaning of options and available output formats

Full details of how to run the revised program, and the output it produces, are
available as perldoc.

For detailed information about the format of the ALTERNATIVE PRODUCTS comment,
please refer to the user manual for the UniProt Knowledgebase,
http://web.expasy.org/docs/userman.html#CCAP.

BUG REPORTS

Please report any bugs to pkersey@ebi.ac.uk

REFERENCE

Kersey P., Hermjakob H., Apweiler R.
VARSPLIC: alternatively-spliced protein sequences derived from
Swiss-Prot and TrEMBL.
Bioinformatics 16:1048-1049(2000).

Documentation last updated 17th August 2005


--------------------------------------------------------------------------------
  LICENSE
--------------------------------------------------------------------------------
We have chosen to apply the Creative Commons Attribution-NoDerivs License to all
copyrightable parts of our databases. This means that you are free to copy,
distribute, display and make commercial use of these databases in all
legislations, provided you give us credit. However, if you intend to distribute
a modified version of one of our databases, you must ask us for permission
first.

(c) 2002-2015 UniProt Consortium

--------------------------------------------------------------------------------
  DISCLAIMER
--------------------------------------------------------------------------------
We make no warranties regarding the correctness of the data, and disclaim
liability for damages resulting from its use. We cannot provide unrestricted
permission regarding the use of the data, as some data may be covered by patents
or other rights.

Any medical or genetic information is provided for research, educational and
informational purposes only. It is not in any way intended to be used as a
substitute for professional medical advice, diagnosis, treatment or care.