VARSPLIC utility and alternative isoform files ---------------------------------------------- INTRODUCTION Many proteins exist in more than one isoform, one cause of which is alternative (differential) splicing. Splice isoforms may differ considerably from one another, with potentially less than 50% sequence similarity between isoforms. In the UniProt Protein Knowledgebase (UniProt KB) (comprised of the manually annotated database Swiss-Prot and its automatically annotated supplement TrEMBL), one sequence (usually that of the longest isoform) is displayed for each protein. Known variations of this sequence are recorded in the feature table (using the 'VAR_SEQ' key), together with the name(s) of the isoform(s) in which each variant occurs. With the release of UniProt 8.0, the same feature key is also used to describe variations in sequence resulting from alternative initiation (in previous releases, two separate feature keys, 'VARSPLIC' and 'INIT_MET', were used), alternative promoter usage, and ribosomal frameshifting. The results of database-wide sequence comparisons (such as FASTA or BLAST) may, in some cases, be dependent on which isoform sequence is used for each protein in the database. A more informative set of results might be obtained if such comparisons were run against a database containing all known isoforms. The program varsplic.pl has been written to generate additional records from the UniProtKB, one for each alternative isoform of each protein. The default output is in FASTA format. There is also an option to generate statistics on the number of new records generated, and the % sequence change of each new record from its 'parent' sequence. Additionally, the user can choose to print new records only for isoforms whose difference from their 'parent' exceeds a specified threshold. Subsequent versions of the program have extended its capacities to enable it to handle (on request) sequence variants described in UniProtKB entries using the 'VARIANT' and 'CONFLICT' feature keys. Information about the use of these capabilities is documented internally within the program, and is available as perldoc. varsplic.pl is written in PERL. It requires the Swissknife package. Both are available from ftp://ftp.ebi.ac.uk/pub/software/uniprot/. A change log accompanying the program details developments between versions. Major changes are also described at the foot of this document. DATA FILES It is possible to download data files in which the current release of the UniProt Knowledgebase has been expanded, with additional sequences generated that represent all annotated splice variants. The UniProt Knowledgebase has two sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Since UniProt release 5.5, varsplic annotation is only used in UniProtKB/Swiss-Prot. The file containing the additional sequences is named as follows: uniprot_sprot_varsplic.fasta.gz As indicated by the ".gz" extension, this is a Unix "compress" format files which when decompressed will produce ASCII files in FASTA format. A new version of this file is rebuilt at each four-weekly release and is included with the other UniProt Knowledgebase files. It is available at the following locations: ftp://ftp.expasy.org/databases/uniprot/knowledgebase/ ftp://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/ IMPORTANT WARNING It is intended that the original UniProtKB entries should remain the primary source of information about all isoforms of each protein, and that the additional records generated by this program will be used only in sequence similarity comparisons. For this reason, only output in FASTA format is being distributed with this release. Please refer to the original UniProtKB records for further details concerning each isoform. USAGE For usage of varsplic.pl see the program documentation (in the program, or type: perldoc varsplic.pl) CURRENT DISTRIBUTION The current distribution of the varsplic program is 2.2.0. See the changeLog accompanying the program for a description of changes since v2.0. NOTES (i)'-which' option The 'which' option allows the user to specify one of three alternative output options: 1) 'full', i.e., a new record is generated for every existing sequence in the database, plus one new record for each alternative isoform. 2) 'allforms' i.e. a new record is generated for every existing sequence for which alternative isoforms exist, plus one new record for each alternative. 3) default (no option specified): new records are NOT generated for isoforms whose sequence is displayed in an existing record. New records are only generated for alternative isoforms. (ii) IDs for child records For new records in FASTA format, the default option for deriving IDs for the child records his as follows: 1) for original records without alternative isoforms, no changes have been made 2) for original records with alternative isoforms, the original ID has been replaced by the primary accession number, followed by a hyphen and the number '00' 3) for new variants, the accession number of the original record, followed by a hyphen, followed by a number has been used as an ID. A different number has been used for each isoform. For example, record P05067 has three isoforms. If the user specifies 'full' or 'allforms', three FASTA records will be generated from this record, with IDs P05067-00 (the FASTA record of the original sequence), P05067-01 and P05067-02. Currently, the largest number of isoforms described for any protein in Swiss-Prot or TrEMBL is 13. For new records in FASTA format, the AC of the 'parent' record has been retained as the AC number. Alternative rules for deriving identifies can be specified using various command line options, see perldoc for details. (iii) Statistics The program can generate statistics, counting the number of variants whose sequence has been changed from the parental record within a certain range. The 'statsfile' option allows the user to specify these ranges in a file: otherwise default values are used. The percentage change of each newly generated isoform from its parent isoform is calculated as follows: 1 - (length of sequence common to both isoforms / Length of their gapped alignment) The alignment length is determined as follows. A gap is introduced into the alignment at each position where variation is recorded between isoforms, i.e. mismatches are not allowed at any position, and no attempt is made to align divergent sequences with one another. See diagram below. Key: + Shared Sequence * Sequence unique to isoform 1 ^ Sequence unique to isoform 2 Isoform 1: +++++******++++++*****+++ Isoform 2: +++++^^^^^^^^^^^^+++++++++^^^^ These two isoforms would be aligned as below: Isoform 1: +++++****** ++++++*****+++ Isoform 2: +++++ ^^^^^^^^^^^^++++++ +++^^^^ and a % change would be recorded of 1 - (14/41) = 0.66 for isoform 2, with respect to isoform 1 CHANGES FOR VERSION 2.2 The general behaviour of the program has been altered to reflect changes in how alternative isoforms are represented in the CC and FT lines of UniProtKB entries in UniProt release 8.0. -varseq option introduced as successor to -varsplic option (deprecated) and -init_met option (discontinued). The default option is now to generate variant sequence caused by alternative splicing and alternative initiation together (and not separately as previously). CHANGES FOR VERSION 2.1 The program has been extended to handle INIT_MET features. Use of the INIT_MET feature key in a UniProtKB record indicates one of two possible phenomena, according to the position of the feature. In position 0, it indicates the presence of an initiator methionine cleaved off the primary translation during the production of the functional protein product, and not included in the sequence displayed in the entry. In any other position, it indicates an initiator methioine within the displayed sequence that is an alternative to the usual initiator methioine at position 0 or 1. Version 2.1 of varsplic.pl has been extended to provide options for using the information contained in both types of this feature. Full details of how to run the revised program, and the output it produces, are available as perldoc. CHANGES FOR VERSION 2.0 In Swiss-Prot release 41.01 (and in the accompanying TrEMBL release), a new format was introduced for "CC ALTERNATIVE PRODUCTS" lines. The new format is more structured than the previous format. Associated with these changes are the introduction of stable identifiers for each named splice isoform in all entries that describe more than splice isoform; the the extension of feature identifiers, previously only used for HUMAN VARIANT features, to VARSPLIC features in entries from all species. The effects of these changes on varsplic.pl are as follows: (i) the program has been substantially rewritten (ii) the new program is now reliant on Swissknife v1.3 or later, which has also been rewritten to deal with the new format. This now available from the EBI's ftp site at ftp://ftp.ebi.ac.uk/pub/software/swissprot/ (iii) there are some slight changes in the output format of files produced by varsplic.pl. Particularly, if only splice variants are expanded, the program now displays the (stable) isoform identifier associated with each isoform. (iv) there has been some cleaning of options and available output formats Full details of how to run the revised program, and the output it produces, are available as perldoc. For detailed information about the format of the ALTERNATIVE PRODUCTS comment, please refer to the user manual for the UniProt Knowledgebase, http://web.expasy.org/docs/userman.html#CCAP. BUG REPORTS Please report any bugs to pkersey@ebi.ac.uk REFERENCE Kersey P., Hermjakob H., Apweiler R. VARSPLIC: alternatively-spliced protein sequences derived from Swiss-Prot and TrEMBL. Bioinformatics 16:1048-1049(2000). Documentation last updated 17th August 2005 -------------------------------------------------------------------------------- LICENSE -------------------------------------------------------------------------------- We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases in all legislations, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first. (c) 2002-2015 UniProt Consortium -------------------------------------------------------------------------------- DISCLAIMER -------------------------------------------------------------------------------- We make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights. Any medical or genetic information is provided for research, educational and informational purposes only. It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.