###################################################################### README_ASSEMBLIES from ftp://ftp.ncbi.nih.gov/genbank/genomes Updated: December 27, 2012 ###################################################################### ======== Outline ======== 1. Introduction 2. Organization of directories from taxonomic group to assembly 3. Organization of data under the assembly directory 4. File names, contents and formats 5. Definitions =============== 1. Introduction =============== The FTP directory ftp://ftp.ncbi.nih.gov/genbank/genomes/Eukaryotes contains eukaryote genome assemblies that have been released after having been submitted to GenBank. The files provided include sequences for chromosomes and scaffolds in FASTA format, and AGP format files that describe how the chromosomes and scaffolds were assembled from the component sequences. =============================================================== 2. Organization of directories from taxonomic group to assembly =============================================================== There are six directories beneath the Eukaryotes directory, each named for a broad taxonomic group: Eukaryotes/fungi Eukaryotes/invertebrates Eukaryotes/plants Eukaryotes/protozoa Eukaryotes/vertebrates_mammals Eukaryotes/vertebrates_other The next directory level will be species, named by scientific name. e.g. vertebrates_mammals/Bos_taurus vertebrates_mammals/Homo_sapiens The directory level below species will be assembly. Names for assembly directories will be based on the submitter's assembly name with any spaces converted to underscores. e.g. vertebrates_mammals/Bos_taurus/Btau_3.1 vertebrates_mammals/Bos_taurus/Btau_4.0 vertebrates_mammals/Homo_sapiens/Celera vertebrates_mammals/Homo_sapiens/CRA_TCAGchr7v2 vertebrates_mammals/Homo_sapiens/GRCh37 vertebrates_mammals/Homo_sapiens/HuRef ==================================================== 3. Organization of data under the assembly directory ==================================================== Each assembly directory will contain an ASSEMBLY_INFO file and one or more assembly-unit directories. Many assemblies consist of a single assembly-unit, the Primary Assembly; other assemblies may be comprised of multiple assembly-units. For example, the human GRCh37 assembly contains ten assembly-units, and the assembly directory content is: vertebrates_mammals/Homo_sapiens/GRCh37/ ASSEMBLY_INFO RepeatMasker.run windowmasker_nmer.oascii.gz genomic_regions_definitions.txt Primary_Assembly/ ALT_REF_LOCI_1/ ALT_REF_LOCI_2/ ALT_REF_LOCI_3/ ALT_REF_LOCI_4/ ALT_REF_LOCI_5/ ALT_REF_LOCI_6/ ALT_REF_LOCI_7/ ALT_REF_LOCI_8/ ALT_REF_LOCI_9/ Each assembly-unit directory contains the following files: component_localID2acc scaffold_localID2acc join_certificate.xml (only present for some assemblies from the Genome Reference Consortium) Each assembly-unit directory will also contain one or more of the following directories (depending on the particular assembly): assembled_chromosomes/ placed_scaffolds/ unlocalized_scaffolds/ unplaced_scaffolds/ alt_scaffolds/ (only in alternate loci and patch assembly-units) pseudoautosomal_region/ (only for mammmals) The content of the assembled_chromosomes, placed_scaffolds, unlocalized_scaffolds, unplaced_scaffolds, alt_scaffolds and pseudoautosomal_region directories is: assembled_chromosomes/ chr2acc FASTA/ chr?.fa.gz chr?.rm.out.gz AGP/ chr?.comp.agp.gz chr?.agp.gz placed_scaffolds/ FASTA/ chr?.placed.scaf.fa.gz chr?.placed.scaf.rm.out.gz AGP/ chr?.placed.scaf.agp.gz unlocalized_scaffolds/ unlocalized.chr2scaf FASTA/ chr?.unlocalized.scaf.fa.gz chr?.unlocalized.scaf.rm.out.gz AGP/ chr?.unlocalized.scaf.agp.gz unplaced_scaffolds/ FASTA/ unplaced.scaf.fa.gz unplaced.scaf.rm.out.gz AGP/ unplaced.scaf.agp.gz alt_scaffolds/ FASTA/ alt.scaf.fa.gz alt.scaf.rm.out.gz AGP/ alt.scaf.agp.gz alt_scaffold_placement.txt alignments/ {scaffold accession.version}_{chromosome accession.version}.asn {scaffold accession.version}_{chromosome accession.version}.gff pseudoautosomal_region/ par.txt par_align.asn par_align.gff Notes ----- 1. The sequences of the placed scaffolds are redundant with the sequences of the assembled chromosomes. The placed scaffolds are provided for users who prefer to work with scaffolds rather than with chromosomes. 2. Eukaryote genome assemblies may include an assembly-unit named "non-nuclear" which contains data from organelle genomes, for example the mitochondrion or chloroplast. 3. If the assembly is comprised of more than one assembly-unit, the names for the assembly-units, other than a "non-nuclear" assembly-unit, are supplied by the submitter. 4. The chromosome-from-scaffold AGP file (chr?.agp.gz), and the placed_scaffolds directory, may be omitted if the chromosome is assembled directly from components, or if the chromosome is a complete sequence with no gaps. =================================== 4. File names, contents and formats =================================== ------------------ File name prefixes ------------------ The prefix chr? indicates one file for each chromosome or linkage group. ------------------ File name suffixes ------------------ .gz - file compressed with the unix gzip program .fa.gz - sequence in fasta format .rm.out.gz - repeat coordinates in RepeatMasker .out format Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-3.0. 1996-2004 . .agp.gz - AGP files (for format specification see http://www.ncbi.nlm.nih.gov/genome/assembly/agp/ AGP_Specification.shtml) ---------------------------------- Files containing genomic sequences ---------------------------------- FILENAME CONTENT chr?.fa.gz chromosome sequence chr?.placed.scaf.fa.gz placed scaffold sequences chr?.unlocalized.scaf.fa.gz unlocalized scaffold sequences unplaced.scaf.fa.gz unplaced scaffold sequences alt.scaf.fa.gz alternate loci or patch scaffold sequences --------- AGP files --------- The AGP files in this directory tree use GenBank accession.versions as the identifiers for components, scaffolds, and chromosomes. FILENAME CONTENT chr?.comp.agp.gz chromosome from component AGP chr?.agp.gz chromosome from scaffold AGP chr?.placed.scaf.agp.gz placed scaffold from component AGP chr?.unlocalized.scaf.agp.gz unlocalized scaffold from component AGP unplaced.scaf.agp.gz unplaced scaffold from component AGP alt.scaf.agp.gz alternate loci or patch scaffold from component AGP ------------------------------------ RepeatMasker repeat coordinate files ------------------------------------ FILENAME CONTENT chr?.rm.out.gz repeats in chromosomes chr?.placed.scaf.rm.out.gz repeats in placed scaffolds chr?.unlocalized.scaf.rm.out.gz repeats in unlocalized scaffolds unplaced.scaf.rm.out.gz repeats in unplaced scaffolds alt.scaf.rm.out.gz alternate loci or patch scaffolds ----------- Other files ----------- 1. ASSEMBLY_INFO The ASSEMBLY_INFO contains assembly meta data. The file structure is as in this example from the GRCh37 assembly. DATE:24-FEB-2009 (date assembly was submitted) ORGANISM:Homo sapiens TAXID:9606 ASSEMBLY LONG NAME:Genome Reference Consortium Human Reference 37 ASSEMBLY SHORT NAME:GRCh37 ASSEMBLY SUBMITTER:Genome Reference Consortium ASSEMBLY TYPE:Haploid + alternate loci NUMBER OF ASSEMBLY-UNITS:10 Assembly Accession:GCA_000001405.1 ##Below is a 2 column list with assembly-unit id and name. ##The Primary Assembly unit is listed first. GCA_000001305.1Primary Assembly GCA_000001315.1ALT_REF_LOCI_1 GCA_000001325.1ALT_REF_LOCI_2 GCA_000001335.1ALT_REF_LOCI_3 GCA_000001345.1ALT_REF_LOCI_4 GCA_000001355.1ALT_REF_LOCI_5 GCA_000001365.1ALT_REF_LOCI_6 GCA_000001375.1ALT_REF_LOCI_7 GCA_000001385.1ALT_REF_LOCI_8 GCA_000001395.1ALT_REF_LOCI_9 2. component_localID2acc A two column file associating the submitter component ID with the accession.version. 'na' is shown in the ID column if the submitter did not provide a name for the component. 3. scaffold_localID2acc A two column file associating the submitter scaffold ID with the accession.version. (Named localID2acc in some older directories.) 'na' is shown in the ID column if the submitter did not provide a name for the scaffold. 4. chr2acc A two column file associating the chromosome, or linkage group name, with the accession.version. 5. unlocalized.chr2scaf A two column file giving the chromosome or linkage group assignment for each unlocalized scaffold. 6. join_certificate.xml This file provides data on joins in the assembly that were curated by the Genome Reference Consortium (GRC). This file will not be present for assemblies submitted by other groups. 7. alt_scaffold_placement.txt A file associating alternate loci or patch scaffolds with the corresponding primary assembly chromosome, providing the location on the chromosome, the genomic region name, and the length of any unaligned tails. The file is tab delimited (including a #header) with the following columns: col 1: alt_asm_name: name of the assembly-unit that includes the alternate scaffold col 2: prim_asm_name: name of the primary assembly-unit on which the alternate scaffold is being placed col 3: alt_scaf_name: name of the alternate scaffold being placed col 4: alt_scaf_acc: accession.version of the alternate scaffold being placed col 5: parent_type: type of object on which the alternate scaffold is being placed, either CHROMOSOME or SCAFFOLD col 6: parent_name: name of the object on which the alternate scaffold is being placed (can be either a chromosome or a scaffold) col 7: parent_acc: accession.version of the sequence on which the alternate scaffold is being aligned col 8: region_name: name of the genomic region on the parent within which the alterante scaffold is placed col 9: ori: orientation of the alignment, '+', '-' or 'b' (mixed) col10: alt_scaf_start: start of the placement on the alternate scaffold (in 1 base coordinates) col11: alt_scaf_stop: end of the placement on the alternate scaffold (in 1 base coordinates) col12: parent_start: start of the placement on the parent sequence (in 1 base coordinates) col13: parent_stop: end of the placement on the parent sequence (in 1 base coordinates) col14: alt_start_tail: number of bases at the start of the alternate scaffold not involved in the placement col15: alt_stop_tail: number of bases at the end of the alternate scaffold not involved in the placement Note: Every alternate scaffold associated with the assembly-unit will be listed in this file. Any alternate scaffold that has no placement will have 'na' in columns 5 to 15. Any alternate scaffold that has a chromosome assignment, but no alignment, would have the chromosome name in column 6 and 'na' in columns 7 to 15. 8. alignments/{scaffold accession.version}_{chromosome accession.version}.asn Files providing alignments of the alternate loci or patch scaffolds to the corresponding primary assembly chromosome, in ASN.1 format. These alignments indicate how the alternate loci and patch scaffold sequences differ from the chromosomes of the primary assembly. [Note: some older files do not have versions in the file names.] 9. alignments/{{scaffold accession.version}_{chromosome accession.version}.gff Files providing alignments of the alternate loci or patch scaffolds to the corresponding primary assembly chromosome, in CIGAR format embedded within a GFF format file. These alignments indicate how the alternate loci and patch scaffold sequences differ from the chromosomes of the primary assembly. [Note: some older files do not have versions in the file names.] 10. RepeatMasker.run A file providing details on which version of RepeatMasker, and which command line parameters, were used to generate the repeat data. 11. windowmasker_nmer.oascii.gz A file of Nmer counts generated by running the first phase of WindowMasker (Morgulis A, Gertz EM, Schaffer AA, Agarwala R. 2006. Bioinformatics 22:134-41) on the genomic sequences of the Primary Assembly. These counts can be used as input for the second phase of WindowMasker to mask any nucleotide sequence for the genome. N and other default parameter settings are computed within WindowMasker depending on the input genome sequence. The windowmasker_nmer.oascii.gz file is in WindowMasker optimized ASCII format and is not human readable. Alternate human readable formats are supported and can be generated by running WindowMasker. 12. genomic_regions_definitions.txt A file defining the regions on the primary assembly for which alternate loci or patch scaffolds are available. The file is tab delimited (including a #header) with the following columns: col 1: region_name: name for the genomic region col 2: chromosome: accession.version for the chromosome or unlocalized/unplaced scaffold col 3: start: the starting position on the chromosome or scaffold (in 1 base coordinates) col 4: stop: the ending position on the chromosome or scaffold (in 1 base coordinates) 13. patch_type A file providing the patch type for each of the scaffolds in a patch assembly-unit. The file is tab delimited (including a #header) with the following columns: col 1: alt_scaf_name: local name for the patch scaffold col 2: alt_scaf_acc: the accession.version for the patch scaffold col 3: patch_type: FIX or NOVEL (defined below) 14. par.txt A file defining the pseudo-autosomal regions (PARs) when the sequences of the sex chromosomes in a mammalian genome assembly are known to include the pseudo-autosomal regions. The file is tab delimited (including a #header) with the following columns: col 1: Chr: chromosome name col 2: PAR-Name: name of the PAR region col 3: Start: the starting position of the PAR on the chromosome (in 1 base coordinates) col 4: Stop: the ending position of the PAR on the chromosome (in 1 base coordinates) 15. par_align.asn A file providing alignments between each pseudoautosomal region (PAR) on the X chromosome and the corresponding PAR on the Y chromosome, in ASN.1 format. 16. par_align.gff A file providing alignments between each pseudoautosomal region (PAR) on the X chromosome and the corresponding PAR on the Y chromosome, in CIGAR format embedded within a GFF format file. 17. alt_locus_scaf2primary.pos [Deprecated. Replaced by alt_scaffold_placement.txt] A file associating alternate loci scaffolds with the corresponding primary assembly chromosome, providing the location on the chromosome and the confidence of the placement. The file is tab delimited (including a #header) with the following columns: col 1: alt_loci_name: local scaffold name for the alternate locus col 2: alt_loci_acc: the accession.version for the alternate locus col 3: chr_name: chromosome to which the alternate locus is aligned col 4: chrom_start: the starting position on the chromosome (in 1 base coordinates) col 5: is_fuzzy: is the chrom_start defined by an alignment or an estimate col 6: chrom_end: the ending position on the chromosome (in 1 base coordinates) col 7: is_fuzzy: is the chrom_end defined by an alignment or an estimate col 8: alt_loci_start: start of the alignment on the alternate locus scaffold col 9: alt_loci_end: end of the alignment on the alternate locus scaffold ============== 5. Definitions ============== Assembly: A set of chromosome assemblies, unlocalized and unplaced sequences and alternate loci used to represent an organisms genome. Most current assemblies are a haploid representation of an organisms genome, although some loci may be represented more than once (see Alternate locus, below). This representation may be obtained from a single individual (e.g. chimp or mouse) or multiple individuals (e.g. human Genome Reference Consortium assembly). Except in the case of organisms which have been bred to homozygosity, the haploid assembly does not typically represent a single haplotype, but rather a mixture of haplotypes. Chromosome Assembly: A relatively complete pseudo-molecule assembled from smaller sequences (components) that represent a biological chromosome. Relatively complete implies that some gaps may still be present in the assembly, but independent measures suggest that most of the sequence is represented by sequenced bases. Completeness is submitter defined. Unlocalized sequence: A sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome. Unplaced sequence: A sequence found in an assembly that is not associated with any chromosome. Primary assembly: An assembly-unit representing the collection of assembled chromosomes, unlocalized and unplaced sequences that, when combined, should represent a non-redundant haploid genome. This excludes any alternate loci. Alternate locus: A sequence that provides an alternate representation of a locus found in the primary assembly. These sequences do not represent a complete chromosome sequence although there is no hard limit on the size of the alternate locus; currently these are less than 1 Mb. Alternate locus group: An assembly-unit consisting of scaffolds from different loci that are considered to be part of the same haplotype (e.g. mouse 129/Sv group). Genomic region: A defined span on the primary assembly for which alternate loci or patch scaffolds are available. Genomic regions may be named after a gene or gene cluster, or may be given arbitrary region numbers. Major release: The formal release of a genome assembly, e.g. GRCh37. Minor release: A release of a genome assembly including patches that occurs between major releases. Genome Patch: A sequence contig/scaffold that corrects sequence in a major release of the genome, or adds sequence to it. FIX patch: A patch that corrects sequence or reduces an assembly gap in a given major release. FIX patch sequences are meant to be incorporated into the primary or existing alt-loci assembly units at the next major release, and their accessions will then be deprecated. NOVEL patch: A patch that adds sequence to a major release. Typically, NOVEL patch sequences are meant to be incorporated into the assembly as new alternate loci at the next major release, and their accessions will not be deprecated. ######################################################################