######################################################################
 
README
 
######################################################################
 
=====================================
Directory Contents
=====================================
 
This directory includes sequence records and map data generated at
NCBI or used in NCBI resources.
 
Sequence data include chromosomes, contigs, RNAs, and proteins
generated through the NCBI Reference Sequence and NCBI Genome
Annotation projects.
See more details in:
  http://www.ncbi.nlm.nih.gov/genome/annotation_euk
and
  http://www.ncbi.nlm.nih.gov/books/NBK169439/
 
Map data presented in the Map Viewer resource are also provided here.
The NCBI Map Viewer provides graphical views of the genome data. See:
  http://www.ncbi.nlm.nih.gov/mapview/
 
The sections below include:
 
    README_CURRENT_RELEASE file
	annotation_report.xml
    Scaffold assembly & information files
      allcontig.agp.gz
	  masking_coordinates.gz
      seq_contig.md.gz
      windowmasker_nmer.oascii.gz
      scaffold_names
    CHR## - Chromosome directories
    Assembled_chromosomes directory & chr_NC_gi file
    RNA, protein and other directories
    GFF
    Gnomon
    mapview directory
    ARCHIVE directory
    File extensions
 
Sequence data are in the Chromosome, RNA, and protein directories.
 
 
=====================================
README_CURRENT_RELEASE file
=====================================
 
This file provides information specific to the current annotation
release, including data freeze dates, release date and release 
number, and the annotated assemblies. This file also indicates if 
updates are made to correct an error or to provide updated 
information.
 
 
==========================
annotation_report.xml file
==========================

This file is the XML version of the HTML report for the organism:
http://www.ncbi.nlm.nih.gov/genome/annotation_euk/{org_name}/{annotation_release_id}/
(i.e. http://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/106/)

It contains information on the annotation release, including:
* Important dates associated with the annotation
* Assemblies
* Gene and feature statistics
* Masking results
* Transcript and protein alignments used for the annotation
* Assembly-assembly alignments used to track genes from the previous 
assembly to the current, or from the reference to an alternate assembly
 

====================================================
Scaffold and chromosome assembly & information files
====================================================
 
allcontig.agp.gz file:
----------------------
This file provides detailed information about the scaffold assembly.
 
columns:
 
 1:   scaffold accession.version
  
 2:   beginning base on scaffold
  
 3:   ending base on scaffold
  
 4:   scaffold fragment number
  
 5:   fragment type (D=Draft, F=Finished, W=Whole genome shotgun
      (WGS) N=NN gap)
  
 6:   if sequence, value = accession.version of the component
      sequence from which bases are derived
      if N-gap, value = number of N's
  
 7:   if sequence, value = beginning base of component sequence
      if N-gap, value = keyword "fragment"
      {fragment keyword indicates gap between fragments within a
      clone or between fragments of overlapping clones}
  
 8:   if sequence, value = ending base of component sequence
      if N-gap, value=yes - some sort of order and orientation by
      mRNA, EST or BAC end pair
      if N-gap,value=no - no order and orientation between
      flanking fragments
  
 9:   + if accession is positive orientation to scaffold,
      - otherwise
      (column 9 for sequence only)
 
 
windowmasker_nmer.oascii.gz
---------------------------
The windowmasker_nmer.oascii.gz file gives Nmer counts generated by
running the first phase of WindowMasker (Morgulis A, Gertz EM,
Schaffer AA, Agarwala R. 2006. Bioinformatics 22:134-41) on the
genomic sequences of the reference assembly. These counts can be used
as input for the second phase of WindowMasker to mask any nucleotide
sequence for the genome. N and other default parameter settings are
computed within WindowMasker depending on the input genome sequence.
The windowmasker_nmer.oascii.gz file is in WindowMasker optimized
ASCII format and is not human readable. Alternate human readable
formats are supported and can be generated by running WindowMasker.
WindowMasker is available at:
ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/
 
 
masking_coordinates.gz:
-----------------------
The masking_coordinates.gz file lists locations for segments of
repetitive sequence in the genomic scaffolds (determined using
RepeatMasker http://www.repeatmasker.org/). These coordinates can be
used to mask the repetitive sequences in the scaffolds.
 
columns:
 
 1.   scaffold accession.version
 
 2.   beginning base on scaffold
 
 3.   ending base on scaffold
 
 4.   class of repetitive sequence, or list of classes when
      overlapping repeats have been merged into a single
      span.
 
 
seq_contig.md.gz file:
----------------------
The seq_contig.md file provides information on the order and
orientation of the scaffolds along the chromosome.
 
columns:
 
1. tax_id:       Taxonomy ID for the annotated organism
 
2. chromosome:   * or *|scaffold where * is the chromosome and
                 *|scaffold indicates the scaffold is associated
                 with the chromosome
 
3. from:         chromosome coordinate, reported in 1 base coordinates
 
4. to:           chromosome coordinate, reported in 1 base coordinates
 
5. orientation:  +, -, 0 - where 0 indicates uncertainty in
                 orientation
 
6. accession:    accession.version format
 
7. id:           internal ID
 
8. type:         designates the type of feature (e.g. scaffold)
 
9. assembly      this value is used to associate scaffolds with a
                 particular assembly (e.g., reference assembly vs
                 alternate assemblies provided by other groups or
                 representing other strains)
 
10. weight       weight value for object. For all maps, a lower
                 weight signifies a higher confidence value for the
                 map object.
                 1= finished sequence (Blue in MapViewer)
                 3= WGS sequence (Green in MapViewer)
                 5= Draft sequence (Orange in MapViewer)
 
 
scaffold_names file:
--------------------
This file provides alternative names used for the genomic scaffolds in
each specified assembly.
 
columns:
 
1:   Assembly label
 
2:   Genome Center name or na
 
3:   Genomic RefSeq Accession.version
 
4:   GenBank Accession.version
 
5:   NCBI name
     (used prior to assignment of the RefSeq Accession.version).
 
na: not applicable. na in column 4 indicates that the scaffold
sequence was revised and that no GenBank version of the scaffold
exist. This can be due to replacement of foreign contaminants by gaps
in the RefSeq sequence or a difference in orientation.
 
 
=====================================
CHR_## - Chromosome directories
=====================================
 
The files in the chromosome directories provide concatenated sequence
data for scaffolds that have been assembled from individual GenBank
records.
 
The order of the scaffolds in these files does not represent their
order on the chromosome.
 
The scaffolds in the chromosome FTP directories are the same ones that
are presented on the NCBI Map Viewer; the sequences include the
reference assembly and may include alternate assemblies when
available.
 
The constructed scaffolds are reference sequences (RefSeq) and are not
part of the GenBank database. GenBank contains archival sequence
records as they were submitted by the producers of the data. See the
RefSeq web site for more information:
  http://www.ncbi.nih.gov/RefSeq/
 
 
=====================================
Assembled_chromosomes directory
=====================================
 
(directory available if at least some of the scaffolds are assembled
into chromosomes)
 
The files in this directory, and its sub-directories, provide data for
all the top-level objects in each assembly: assembled chromosomes,
unlocalized scaffolds (those scaffolds that are associated with a
specific chromosome but which cannot be ordered or oriented on that
chromosome), unplaced scaffolds (those scaffolds that are not
associated with any chromosome), and in some cases scaffolds from
alternate locus groups or genome patches (see the NCBI Assembly Model
web page for an explanation of these terms:
http://www.ncbi.nlm.nih.gov/genome/assembly/model).
 
The filenames include the assembly name. To obtain the complete set of
data for an assembly, download all the files for the desired format
that contain the same assembly name. Depending on the particular
assembly, this set may include multiple chromosomes files with names
including a "chr*" term, an unlocalized scaffold file with
"unlocalized" in its name, an unplaced scaffold file with "unplaced"
in its name, and an alternate scaffold file with "alts" in its name.
 
 
chr_NC_gi file:
---------------
The chr_NC_gi file provides the accession and gi for the reference
sequence (RefSeq) chromosome records, and any complete chromosomes
from alternate assemblies.
 
columns:
 1. chromosome
 2. chromosome accession.version
 3. chromosome gi
 4. assembly name
 5. assembly accession.version
 
 
chr_accessions_{assembly name} file:
------------------------------------
The chr_accessions_* file provides the correspondence between the
RefSeq and GenBank records for each chromosome in the assembly.
 
columns:
 1. Chromosome
 2. RefSeq Accession.version
 3. RefSeq gi
 4. GenBank Accession.version
 5. GenBank gi
 
na: not applicable. na in column 4 and 5 indicates that the
chromosome sequence was revised and that no GenBank version of the
scaffold exist.
 
 
unlocalized_ and unplaced_accessions_{assembly name} files:
-----------------------------------------------------------
The unlocalized_* and unplaced_* files provide the correspondence
between the RefSeq and GenBank records for scaffolds that are,
respectively, unlocalized on a chromosome or unplaced. If an assembly
includes scaffolds from alternate locus groups or genome patches, then
accession, version and gi data for these scaffolds is provided in a
file named alts_accessions_{assembly name}.
 
columns:
 1. Chromosome (Un, if unplaced scaffold)
 2. RefSeq Accession.version
 3. RefSeq gi
 4. GenBank Accession.version
 5. GenBank gi
 
na: not applicable. na in column 4 and 5 indicates that the
scaffold sequence was revised and that the GenBank and RefSeq
sequences differ or that no GenBank accession was assigned.
 
 
seq sub-directory:
------------------
The files in this directory provide assembled sequences for the
chromosomes and other top-level objects in FASTA format. Runs of Ns
are inserted into the chromosome sequence wherever there is a gap in
the scaffold layout, e.g. between scaffolds, at the centromere, at the
telomeres, or at large regions of heterochromatin. The chromosome
coordinates of features placed on chromosomes, as displayed in Map
Viewer or provided in the sequence based map files located in the
/mapview directory, correspond to positions on these assembled
chromosome sequences. The feature coordinates used for unlocalized or
unplaced scaffolds use the coordinate system of each scaffold.
 
Files with the suffix .fa.gz contain unmasked sequences; files with
the suffix .mfa.gz contain sequences masked using WindowMasker or
RepeatMasker (lower case) and the results of a screen against
foreign sequences (N's).
 
Each file is named according to the abbreviation for the species,
whether the assembly is the reference assembly (_ref_) or an alternate
assembly (_alt_), the assembly name, and either the chromosome label
or the scaffold group (unlocalized, unplaced, or alts).
 
 
agp sub-directory:
------------------
Files describing, in AGP format, how the chromosomes and other
top-level objects are assembled from their component sequence
records. Filenames follow the convention described for the seq
sub-directory and have the suffix .agp.gz.
 
columns:
 
 1:   chromosome, as chr+chromosome designation, or scaffold name
  
 2:   beginning base on chromosome or scaffold
  
 3:   ending base on chromosome or scaffold
  
 4:   fragment number
  
 5:   fragment type (D=Draft, F=Finished, W=Whole genome shotgun
      (WGS) N=NN gap)
  
 6:   if sequence, value = accession.version of the component
      sequence from which bases are derived
      if N-gap, value = number of N's
  
 7:   if sequence, value = beginning base of component sequence
      if N-gap, value = keyword "fragment"
{fragment keyword indicates gap between fragments within a
      clone or between fragments of overlapping clones}
  
 8:   if sequence, value = ending base of component sequence
      if N-gap, value=yes - some sort of order and orientation by
      mRNA, EST or BAC end pair
      if N-gap,value=no - no order and orientation between
      flanking fragments
  
 9:   + if accession is positive orientation to chromosome
      - otherwise
      (column 9 for sequence only)
 
 
gbs sub-directory:
------------------
Files providing annotation, in GenBank flat file format, for the
chromosomes and other top-level objects. Filenames follow the
convention described for the seq sub-directory and have the suffix
.gbs.gz.
 
 
=====================================
RNA, protein and other directories
=====================================
 
The RNA and protein directories provide sequence files in three
formats representing all of the mRNA, non-coding transcript, and
protein model reference sequences (RefSeq) exported as part of the
genome annotation process.
 
In addition, fasta files containing the comprehensive set of Gnomon
predictions are also provided. These correspond to the Map Viewer
'Model Transcripts' map and include a supported subset that is
instantiated as model RefSeq records (with accession prefix XM_,
XR_, or XP_) and an 'Ab initio' subset that is not instantiated into
model RefSeq. These purely 'Ab initio' models are not assigned
accession numbers, or tracked between annotation releases. They are an
experimental dataset. Additional information about this prediction
program is available at:
  http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.shtml
 
 
RNA directory:
--------------
File Name             Format         Contents
---------------------------------------------------------------------
Gnomon_mRNA.fsa.gz    FASTA          transcript predictions
rna.asn.gz            ASN.1          annotated transcripts
rna.fa.gz             FASTA          annotated transcripts
rna.gbk.gz            Flat File      annotated transcripts
 
protein directory:
------------------
File Name             Format         Contents
--------------------------------------------------------------------
Gnomon_prot.fsa.gz    FASTA          protein predictions
protein.fa.gz         FASTA          annotated proteins
protein.gbk.gz        Flat File      annotated proteins
 
 
Accession Format      Molecule       Type
----------------------------------------------------
NM_xxxxxx             mRNA           curated RefSeq*
NR_xxxxxx             transcript     curated RefSeq*
NP_xxxxxx             protein        curated RefSeq*
YP_xxxxxx             protein        curated RefSeq*
XM_xxxxxx             mRNA           model@
XR_xxxxxx             transcript     model@
XP_xxxxxx             protein        model@
 
* curated RefSeq= these RefSeq records are subject to review and
curation by NCBI's RefSeq staff, and may be updated between annotation
releases. Note that the curation process is ongoing. Note that the
accession prefix may be followed by either 6 or 9 digits (e.g.,
NM_123456 and NM_123456789).
 
@ model RefSeq= these RefSeq records are products of the genome
annotation processing and are not subject to curation and updates
between annotation releases. Model RefSeqs represent Gnomon
predictions that are supported by transcript and/or protein homology.
 
Additional information about the curated RefSeqs (NM_, NR_, NP_
accession prefix) is available at:
  http://www.ncbi.nlm.nih.gov/RefSeq/
  ftp://ftp.ncbi.nih.gov/refseq/
 
Additional information about the gene models is available at
  http://www.ncbi.nlm.nih.gov/genome/annotation_euk/process/
 
 
other directory:
----------------
File Name                     Format     Contents
---------------------------------------------------------------------
pseudo_without_product.fa.gz  FASTA      pseudogenes without products
 
This file provides the genomic sequence corresponding to pseudogene
and other gene regions which do not have any associated transcribed
RNA products or translated protein products. It includes annotated
gene regions that require rearrangement to provide the final product,
e.g. immunoglobulin segments. These sequences are not assigned
accession numbers, and are derived directly from the assembled genomic
sequences.
 
 
====
GFF
====
The files in this directory provide the features annotated on the
genomic sequences of the assembly(ies) in GFF version 3 format,
according to specifications version 1.20 at:
http://www.sequenceontology.org/gff3.shtml
 
{alt,ref}_{assembly_name}_scaffolds.gff3.gz
-------------------------------------------
Features annotated on {assembly_name} in scaffold coordinates.
 
{alt,ref}_{assembly_name}_top_level.gff3.gz
-------------------------------------------
Features annotated on {assembly_name} in top-level object coordinates.
The top-level objects are: assembled chromosomes, unlocalized
scaffolds (those scaffolds that are associated with a specific
chromosome but which cannot be ordered or oriented on that
chromosome), unplaced scaffolds (those scaffolds that are not
associated with any chromosome), and in some cases scaffolds from
alternate locus groups or genome patches (see the NCBI Assembly Model
web page for an explanation of these terms:
http://www.ncbi.nlm.nih.gov/genome/assembly/model).
 
 
======
Gnomon
======
The files in this directory provide the Gnomon models predicted on
the genomic sequences of the assembly(ies) in GFF version 3 format,
according to specifications version 1.20 at:
http://www.sequenceontology.org/gff3.shtml
 
These models correspond to the Map Viewer 'Model Transcripts' map and
include a supported subset that is instantiated as model RefSeq
records (with accession prefix XM_, XR_, or XP_) and an 'Ab initio'
subset that is not instantiated into model RefSeq. These purely
'Ab initio' models are not assigned accession numbers, or tracked
between annotation releases. They are an experimental dataset.
Additional information about this prediction program is available at:
  http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.shtml
 
{alt,ref}_{assembly_name}_gnomon_scaffolds.gff3.gz
--------------------------------------------------
Gnomon models predicted on {assembly_name} in scaffold coordinates.
 
{alt,ref}_{assembly_name}_gnomon_top_level.gff3.gz
--------------------------------------------------
Gnomon models predicted on {assembly_name} in top-level object
coordinates. The top-level objects are: assembled chromosomes,
unlocalized scaffolds (those scaffolds that are associated with a
specific chromosome but which cannot be ordered or oriented on that
chromosome), unplaced scaffolds (those scaffolds that are not
associated with any chromosome), and in some cases scaffolds from
alternate locus groups or genome patches (see the NCBI Assembly Model
web page for an explanation of these terms:
http://www.ncbi.nlm.nih.gov/genome/assembly/model).
 
 
=====================================
mapview directory
=====================================
 
This directory contains assembly and annotation data used to provide
the displays available for this organism in Map Viewer:
http://www.ncbi.nlm.nih.gov/mapview
 
Most of the files in this directory contain headers that document the
content of the fields in each file. Additional information on some
files is provided below.
 
org_transcript.gff.gz and zoo_transcript.gff.gz files
-----------------------------------------------------
These files provide cDNA-to-Genomic, or spliced sequence
alignments. These files include same-species and cross-species
alignments, respectively. Alignments are generated via the Splign
alignment tool:
http://www.ncbi.nlm.nih.gov/sutils/splign
Information on indels has not been included.
 
The file format is GFF version 3 according to specifications version
1.07:
http://song.sourceforge.net/gff3.shtml
 
The content is in chromosomal coordinates or scaffold coordinates for
unplaced scaffolds. The accession.version of a genomic reference
sequence (NCBI RefSeq) is used as the value of the GTF/GFF 'seqid'
column. (Examples of accession.version are NC_* or AC_* for
chromosomes and NW_* or NT_* for scaffolds.) The genome assembly and
chromosome names for the chromosome sequences can be obtained from the
file Assembled_chromosomes/chr_NC_gi. Likewise, the file
mapviewer/seq_contig.md.gz provides the genome assembly and chromosome
assignment, if any, for the unplaced scaffolds.
 
These files replace org_transcript.gtf.gz and zoo_transcript.gtf.gz
which were in a format compatible with GFF version 2 and GTF.
 
 
=====================================
Mapping_data directory
=====================================
 
This data is a link to the UniSTS ftp site containing non-sequence
based mapping information for this organism's STS.
 
 
=====================================
ARCHIVE directory
=====================================
 
This directory is provided to maintain archival annotation release
data.
 
 
======================================
FOSMIDS directory
======================================
 
Directory for FOSMID sequence data.
 
 
=====================================
File extensions
=====================================
 
File extensions impart information about the file format as follows:
 
*.asn.gz = ASN.1 file, print form
 
*.fa.gz  = FASTA file format, compressed
 
*.fsa.gz = FASTA file format, compressed
 
*.mfa.gz = masked FASTA file format, compressed
           (repeats identified with WindowMasker or RepeatMasker are
           lower case and foreign spans are replaced with N's).
 
*.gbk.gz = GenBank flat file format (annotation + sequence),
           compressed
 
*.gbs.gz = GenBank summary file format (annotation only), compressed
           The *.gbs file format does not contain sequence data,
           but instead contains a "CONTIG" field showing how the
           scaffold or chromosome is assembled from its components.
 
*.gff3.gz = GFF version 3 file format, compressed
 
 
=====
Notes
=====
* The annotations in the *.gbk and *.gbs files currently include genes,
conserved protein domains, as well as microRNAs, defined by sequence
obtained from miRBase (Griffiths-Jones S, Grocock RJ, van Dongen S,
Bateman A, Enright AJ. 2006. Nucleic Acids Res. 34:D140-D144) and
placed by Splign (Kapustin Y, Souvorov A, Tatusova T and Lipman D.
2008. Biology Direct 3:20), and tRNA features annotated by
tRNAscan-SE (Lowe TM, Eddy SR. 1997. Nucleic Acids Res. 25:955-64).
 
* Variation data from the most recent dbSNP build can be obtained from
the dbSNP FTP site:
 ftp://ftp.ncbi.nih.gov/snp/
 
* Gene symbols in this directory are not updated with every update
to Entrez Gene. Suggestions for how to convert a set of GeneIDs into
current symbols and names is provided in this FAQ from Entrez Gene:
 
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/genefaq.html#faq_g4
 
######################################################################