######################################################################

README_ASSEMBLIES from ftp://ftp.ncbi.nih.gov/genbank/genomes

Updated: December 27, 2012

######################################################################

========
Outline
========

1. Introduction
2. Organization of directories from taxonomic group to assembly
3. Organization of data under the assembly directory
4. File names, contents and formats
5. Definitions

===============
1. Introduction
===============

The FTP directory ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes
contains eukaryote genome assemblies that have been released after 
having been submitted to GenBank. The files provided include sequences
for chromosomes and scaffolds in FASTA format, and AGP format files 
that describe how the chromosomes and scaffolds were assembled from 
the component sequences.


===============================================================
2. Organization of directories from taxonomic group to assembly
===============================================================

There are six directories beneath the Eukaryotes directory, each named
for a broad taxonomic group:
Eukaryotes/fungi
Eukaryotes/invertebrates
Eukaryotes/plants
Eukaryotes/protozoa
Eukaryotes/vertebrates_mammals
Eukaryotes/vertebrates_other

The next directory level will be species, named by scientific name. 
e.g. vertebrates_mammals/Bos_taurus
     vertebrates_mammals/Homo_sapiens

The directory level below species will be assembly. Names for assembly
directories will be based on the submitter's assembly name with any 
spaces converted to underscores. 
e.g. vertebrates_mammals/Bos_taurus/Btau_3.1
     vertebrates_mammals/Bos_taurus/Btau_4.0
     vertebrates_mammals/Homo_sapiens/Celera
     vertebrates_mammals/Homo_sapiens/CRA_TCAGchr7v2
     vertebrates_mammals/Homo_sapiens/GRCh37
     vertebrates_mammals/Homo_sapiens/HuRef


====================================================
3. Organization of data under the assembly directory
====================================================

Each assembly directory will contain an ASSEMBLY_INFO file and one or 
more assembly-unit directories. Many assemblies consist of a single 
assembly-unit, the Primary Assembly; other assemblies may be comprised
of multiple assembly-units. For example, the human GRCh37 assembly 
contains ten assembly-units, and the assembly directory content is:
vertebrates_mammals/Homo_sapiens/GRCh37/
     ASSEMBLY_INFO
     RepeatMasker.run
     windowmasker_nmer.oascii.gz
     genomic_regions_definitions.txt
     Primary_Assembly/
     ALT_REF_LOCI_1/
     ALT_REF_LOCI_2/
     ALT_REF_LOCI_3/
     ALT_REF_LOCI_4/
     ALT_REF_LOCI_5/
     ALT_REF_LOCI_6/
     ALT_REF_LOCI_7/
     ALT_REF_LOCI_8/
     ALT_REF_LOCI_9/

Each assembly-unit directory contains the following files:
     component_localID2acc
     scaffold_localID2acc
     join_certificate.xml (only present for some assemblies from the 
                           Genome Reference Consortium)

Each assembly-unit directory will also contain one or more of the 
following directories (depending on the particular assembly):
     assembled_chromosomes/
     placed_scaffolds/
     unlocalized_scaffolds/
     unplaced_scaffolds/
     alt_scaffolds/ (only in alternate loci and patch assembly-units)
     pseudoautosomal_region/ (only for mammmals)

The content of the assembled_chromosomes, placed_scaffolds,
unlocalized_scaffolds, unplaced_scaffolds, alt_scaffolds and
pseudoautosomal_region directories is:

assembled_chromosomes/
     chr2acc
     FASTA/
          chr?.fa.gz
          chr?.rm.out.gz
     AGP/
          chr?.comp.agp.gz
          chr?.agp.gz

placed_scaffolds/
     FASTA/
          chr?.placed.scaf.fa.gz
          chr?.placed.scaf.rm.out.gz
     AGP/
          chr?.placed.scaf.agp.gz

unlocalized_scaffolds/
     unlocalized.chr2scaf
     FASTA/
          chr?.unlocalized.scaf.fa.gz
          chr?.unlocalized.scaf.rm.out.gz
     AGP/
          chr?.unlocalized.scaf.agp.gz

unplaced_scaffolds/
     FASTA/
          unplaced.scaf.fa.gz
          unplaced.scaf.rm.out.gz
     AGP/
          unplaced.scaf.agp.gz

alt_scaffolds/
     FASTA/
          alt.scaf.fa.gz
          alt.scaf.rm.out.gz
     AGP/
          alt.scaf.agp.gz
     alt_scaffold_placement.txt
     alignments/
          {scaffold accession.version}_{chromosome accession.version}.asn
          {scaffold accession.version}_{chromosome accession.version}.gff

pseudoautosomal_region/
     par.txt
     par_align.asn
     par_align.gff

Notes
-----
1. The sequences of the placed scaffolds are redundant with the
sequences of the assembled chromosomes. The placed scaffolds are 
provided for users who prefer to work with scaffolds rather than with
chromosomes.

2. Eukaryote genome assemblies may include an assembly-unit named
"non-nuclear" which contains data from organelle genomes, for example
the mitochondrion or chloroplast.
 
3. If the assembly is comprised of more than one assembly-unit, the
names for the assembly-units, other than a "non-nuclear"
assembly-unit, are supplied by the submitter.

4. The chromosome-from-scaffold AGP file (chr?.agp.gz), and the
placed_scaffolds directory, may be omitted if the chromosome is
assembled directly from components, or if the chromosome is a complete
sequence with no gaps.


===================================
4. File names, contents and formats
===================================
------------------
File name prefixes
------------------
The prefix chr? indicates one file for each chromosome or linkage 
group.

------------------
File name suffixes
------------------
.gz         - file compressed with the unix gzip program
.fa.gz      - sequence in fasta format

.rm.out.gz  - repeat coordinates in RepeatMasker .out format 
              Smit, AFA, Hubley, R & Green, P. 
              RepeatMasker Open-3.0.
              1996-2004 <http://www.repeatmasker.org>.

.agp.gz     - AGP files (for format specification see 
              http://www.ncbi.nlm.nih.gov/genome/assembly/agp/
              AGP_Specification.shtml)

----------------------------------
Files containing genomic sequences
----------------------------------
FILENAME                         CONTENT
chr?.fa.gz                       chromosome sequence
chr?.placed.scaf.fa.gz           placed scaffold sequences
chr?.unlocalized.scaf.fa.gz      unlocalized scaffold sequences
unplaced.scaf.fa.gz              unplaced scaffold sequences
alt.scaf.fa.gz                   alternate loci or patch scaffold 
                                 sequences

---------
AGP files
---------
The AGP files in this directory tree use GenBank accession.versions as
the identifiers for components, scaffolds, and chromosomes.

FILENAME                         CONTENT
chr?.comp.agp.gz                 chromosome from component AGP
chr?.agp.gz                      chromosome from scaffold AGP
chr?.placed.scaf.agp.gz          placed scaffold from component AGP
chr?.unlocalized.scaf.agp.gz     unlocalized scaffold from component AGP
unplaced.scaf.agp.gz             unplaced scaffold from component AGP
alt.scaf.agp.gz                  alternate loci or patch scaffold from
                                 component AGP

------------------------------------
RepeatMasker repeat coordinate files
------------------------------------
FILENAME                         CONTENT
chr?.rm.out.gz                   repeats in chromosomes
chr?.placed.scaf.rm.out.gz       repeats in placed scaffolds
chr?.unlocalized.scaf.rm.out.gz  repeats in unlocalized scaffolds
unplaced.scaf.rm.out.gz          repeats in unplaced scaffolds
alt.scaf.rm.out.gz               alternate loci or patch scaffolds 

-----------
Other files
-----------
1. ASSEMBLY_INFO
The ASSEMBLY_INFO contains assembly meta data. 
The file structure is as in this example from the GRCh37 assembly.

DATE:<tab>24-FEB-2009 (date assembly was submitted)
ORGANISM:<tab>Homo sapiens
TAXID:<tab>9606
ASSEMBLY LONG NAME:<tab>Genome Reference Consortium Human Reference 37
ASSEMBLY SHORT NAME:<tab>GRCh37
ASSEMBLY SUBMITTER:<tab>Genome Reference Consortium
ASSEMBLY TYPE:<tab>Haploid + alternate loci
NUMBER OF ASSEMBLY-UNITS:<tab>10
Assembly Accession:<tab>GCA_000001405.1
##Below is a 2 column list with assembly-unit id and name.
##The Primary Assembly unit is listed first.
GCA_000001305.1<tab>Primary Assembly
GCA_000001315.1<tab>ALT_REF_LOCI_1
GCA_000001325.1<tab>ALT_REF_LOCI_2
GCA_000001335.1<tab>ALT_REF_LOCI_3
GCA_000001345.1<tab>ALT_REF_LOCI_4
GCA_000001355.1<tab>ALT_REF_LOCI_5
GCA_000001365.1<tab>ALT_REF_LOCI_6
GCA_000001375.1<tab>ALT_REF_LOCI_7
GCA_000001385.1<tab>ALT_REF_LOCI_8
GCA_000001395.1<tab>ALT_REF_LOCI_9

2. component_localID2acc
A two column file associating the submitter component ID with the 
accession.version. 'na' is shown in the ID column if the submitter
did not provide a name for the component.

3. scaffold_localID2acc
A two column file associating the submitter scaffold ID with the 
accession.version. (Named localID2acc in some older directories.)
'na' is shown in the ID column if the submitter did not provide a 
name for the scaffold.

4. chr2acc
A two column file associating the chromosome, or linkage group name,
with the accession.version.

5. unlocalized.chr2scaf
A two column file giving the chromosome or linkage group assignment 
for each unlocalized scaffold.

6. join_certificate.xml
This file provides data on joins in the assembly that were curated by
the Genome Reference Consortium (GRC). This file will not be present 
for assemblies submitted by other groups.

7. alt_scaffold_placement.txt
A file associating alternate loci or patch scaffolds with the 
corresponding primary assembly chromosome, providing the location on 
the chromosome, the genomic region name, and the length of any
unaligned tails.
The file is tab delimited (including a #header) with the following 
columns:
col 1: alt_asm_name: name of the assembly-unit that includes the 
       alternate scaffold
col 2: prim_asm_name: name of the primary assembly-unit on which the
       alternate scaffold is being placed
col 3: alt_scaf_name: name of the alternate scaffold being placed
col 4: alt_scaf_acc: accession.version of the alternate scaffold being
       placed
col 5: parent_type: type of object on which the alternate scaffold is
       being placed, either CHROMOSOME or SCAFFOLD
col 6: parent_name: name of the object on which the alternate scaffold
       is being placed (can be either a chromosome or a scaffold)
col 7: parent_acc: accession.version of the sequence on which the
       alternate scaffold is being aligned
col 8: region_name: name of the genomic region on the parent within 
       which the alterante scaffold is placed
col 9: ori: orientation of the alignment, '+', '-' or 'b' (mixed)
col10: alt_scaf_start: start of the placement on the alternate 
       scaffold (in 1 base coordinates)
col11: alt_scaf_stop: end of the placement on the alternate scaffold
       (in 1 base coordinates)
col12: parent_start: start of the placement on the parent sequence
       (in 1 base coordinates)
col13: parent_stop: end of the placement on the parent sequence 
       (in 1 base coordinates)
col14: alt_start_tail: number of bases at the start of the alternate 
       scaffold not involved in the placement
col15: alt_stop_tail: number of bases at the end of the alternate 
       scaffold not involved in the placement

Note: Every alternate scaffold associated with the assembly-unit will
be listed in this file. Any alternate scaffold that has no placement
will have 'na' in columns 5 to 15. Any alternate scaffold that has a 
chromosome assignment, but no alignment, would have the chromosome 
name in column 6 and 'na' in columns 7 to 15.

8. alignments/{scaffold accession.version}_{chromosome accession.version}.asn
Files providing alignments of the alternate loci or patch scaffolds to
the corresponding primary assembly chromosome, in ASN.1 format. These
alignments indicate how the alternate loci and patch scaffold
sequences differ from the chromosomes of the primary assembly.
[Note: some older files do not have versions in the file names.]

9. alignments/{{scaffold accession.version}_{chromosome accession.version}.gff
Files providing alignments of the alternate loci or patch scaffolds to
the corresponding primary assembly chromosome, in CIGAR format 
embedded within a GFF format file. These alignments indicate how the
alternate loci and patch scaffold sequences differ from the 
chromosomes of the primary assembly.
[Note: some older files do not have versions in the file names.]

10. RepeatMasker.run
A file providing details on which version of RepeatMasker, and which
command line parameters, were used to generate the repeat data.

11. windowmasker_nmer.oascii.gz
A file of Nmer counts generated by running the first phase of
WindowMasker (Morgulis A, Gertz EM, Schaffer AA, Agarwala
R. 2006. Bioinformatics 22:134-41) on the genomic sequences of the
Primary Assembly. These counts can be used as input for the second
phase of WindowMasker to mask any nucleotide sequence for the
genome. N and other default parameter settings are computed within
WindowMasker depending on the input genome sequence.  The
windowmasker_nmer.oascii.gz file is in WindowMasker optimized ASCII
format and is not human readable. Alternate human readable formats are
supported and can be generated by running WindowMasker.

12. genomic_regions_definitions.txt
A file defining the regions on the primary assembly for which 
alternate loci or patch scaffolds are available.
The file is tab delimited (including a #header) with the following 
columns:
col 1: region_name: name for the genomic region
col 2: chromosome: accession.version for the chromosome or 
       unlocalized/unplaced scaffold
col 3: start: the starting position on the chromosome or scaffold
       (in 1 base coordinates)
col 4: stop: the ending position on the chromosome or scaffold
       (in 1 base coordinates)

13. patch_type
A file providing the patch type for each of the scaffolds in a patch
assembly-unit.
The file is tab delimited (including a #header) with the following 
columns:
col 1: alt_scaf_name: local name for the patch scaffold
col 2: alt_scaf_acc: the accession.version for the patch scaffold
col 3: patch_type: FIX or NOVEL (defined below)

14. par.txt
A file defining the pseudo-autosomal regions (PARs) when the sequences
of the sex chromosomes in a mammalian genome assembly are known to
include the pseudo-autosomal regions.
The file is tab delimited (including a #header) with the following 
columns:
col 1: Chr: chromosome name
col 2: PAR-Name: name of the PAR region
col 3: Start: the starting position of the PAR on the chromosome
       (in 1 base coordinates)
col 4: Stop: the ending position of the PAR on the chromosome
       (in 1 base coordinates)

15. par_align.asn
A file providing alignments between each pseudoautosomal region (PAR) 
on the X chromosome and the corresponding PAR on the Y chromosome, in 
ASN.1 format. 

16. par_align.gff
A file providing alignments between each pseudoautosomal region (PAR) 
on the X chromosome and the corresponding PAR on the Y chromosome, in 
CIGAR format embedded within a GFF format file.

17. alt_locus_scaf2primary.pos
[Deprecated. Replaced by alt_scaffold_placement.txt]
A file associating alternate loci scaffolds with the corresponding
primary assembly chromosome, providing the location on the chromosome
and the confidence of the placement.
The file is tab delimited (including a #header) with the following 
columns:
col 1: alt_loci_name: local scaffold name for the alternate locus
col 2: alt_loci_acc: the accession.version for the alternate locus
col 3: chr_name: chromosome to which the alternate locus is aligned
col 4: chrom_start: the starting position on the chromosome
       (in 1 base coordinates)
col 5: is_fuzzy: is the chrom_start defined by an alignment or an 
       estimate
col 6: chrom_end: the ending position on the chromosome
       (in 1 base coordinates)
col 7: is_fuzzy: is the chrom_end defined by an alignment or an
       estimate
col 8: alt_loci_start: start of the alignment on the alternate locus
       scaffold
col 9: alt_loci_end: end of the alignment on the alternate locus
       scaffold


==============
5. Definitions
==============

Assembly:
A set of chromosome assemblies, unlocalized and unplaced sequences and
alternate loci used to represent an organisms genome. Most current 
assemblies are a haploid representation of an organisms genome, 
although some loci may be represented more than once (see Alternate 
locus, below). This representation may be obtained from a single 
individual (e.g. chimp or mouse) or multiple individuals (e.g. human 
Genome Reference Consortium assembly). Except in the case of organisms
which have been bred to homozygosity, the haploid assembly does not 
typically represent a single haplotype, but rather a mixture of 
haplotypes.

Chromosome Assembly: 
A relatively complete pseudo-molecule assembled from smaller sequences
(components) that represent a biological chromosome. Relatively 
complete implies that some gaps may still be present in the assembly, 
but independent measures suggest that most of the sequence is 
represented by sequenced bases. Completeness is submitter defined.

Unlocalized sequence:
A sequence found in an assembly that is associated with a specific 
chromosome but cannot be ordered or oriented on that chromosome. 

Unplaced sequence:
A sequence found in an assembly that is not associated with any 
chromosome.  

Primary assembly:
An assembly-unit representing the collection of assembled chromosomes,
unlocalized and unplaced sequences that, when combined, should 
represent a non-redundant haploid genome. This excludes any alternate
loci.

Alternate locus:
A sequence that provides an alternate representation of a locus found
in the primary assembly. These sequences do not represent a complete 
chromosome sequence although there is no hard limit on the size of the
alternate locus; currently these are less than 1 Mb.

Alternate locus group:
An assembly-unit consisting of scaffolds from different loci that are
considered to be part of the same haplotype (e.g. mouse 129/Sv group).

Genomic region:
A defined span on the primary assembly for which alternate loci or
patch scaffolds are available. Genomic regions may be named after a
gene or gene cluster, or may be given arbitrary region numbers.

Major release:
The formal release of a genome assembly, e.g. GRCh37.

Minor release:
A release of a genome assembly including patches that occurs between
major releases.

Genome Patch:
A sequence contig/scaffold that corrects sequence in a major release
of the genome, or adds sequence to it.

FIX patch:
A patch that corrects sequence or reduces an assembly gap in a given
major release. FIX patch sequences are meant to be incorporated into
the primary or existing alt-loci assembly units at the next major
release, and their accessions will then be deprecated.

NOVEL patch:
A patch that adds sequence to a major release. Typically, NOVEL patch
sequences are meant to be incorporated into the assembly as new
alternate loci at the next major release, and their accessions will
not be deprecated.

######################################################################