################################################################################ README for assembly_structure directories under: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/ ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/ ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ Last updated: August 9, 2016 ################################################################################ ============================ Assembly structure directory ============================ An assembly structure directory is provided to report detailed information about the internal structure of the assembly. The directory contains AGP files that define how component sequences are organized into scaffolds and/or chromosomes. Other files define how scaffolds and chromosomes are organized into non-nuclear and other assembly-units, and how any alternate or patch scaffolds are placed relative to the chromosomes. The assembly structure directory is named as: [assembly accession.version]_[assembly name]_assembly_structure ============= Organization: ============= Each assembly_structure directory will contain one or more assembly-unit directories. Many assemblies consist of a single assembly-unit, the Primary Assembly; other assemblies may be comprised of multiple assembly-units. For example, the human GRCh38 assembly contains a Primary Assembly, a series of ALT_REF_LOCI assembly-units and a non-nuclear (organelle) assembly-unit. Primary_Assembly/ ALT_REF_LOCI_*/ non-nuclear/ Each assembly-unit directory contains the following files: component_localID2acc scaffold_localID2acc join_certificate.xml (optional - only present for some assemblies from the Genome Reference Consortium) Each assembly-unit directory also contains one or more of the following directories (depending on the particular assembly): assembled_chromosomes/ placed_scaffolds/ unlocalized_scaffolds/ unplaced_scaffolds/ alt_scaffolds/ (only in alternate loci and patch assembly-units) pseudoautosomal_region/ (only for mammals) The content of the assembled_chromosomes, placed_scaffolds, unlocalized_scaffolds, unplaced_scaffolds, alt_scaffolds and pseudoautosomal_region directories is: assembled_chromosomes/ chr2acc FASTA/ chr*.fna.gz AGP/ chr*.comp.agp.gz chr*.agp.gz placed_scaffolds/ FASTA/ chr*.placed.scaf.fna.gz AGP/ chr*.placed.scaf.agp.gz unlocalized_scaffolds/ unlocalized.chr2scaf FASTA/ chr*.unlocalized.scaf.fna.gz AGP/ chr*.unlocalized.scaf.agp.gz unplaced_scaffolds/ FASTA/ unplaced.scaf.fna.gz AGP/ unplaced.scaf.agp.gz alt_scaffolds/ FASTA/ alt.scaf.fna.gz AGP/ alt.scaf.agp.gz alt_scaffold_placement.txt alignments/ {scaffold accession.version}_{chromosome accession.version}.asn {scaffold accession.version}_{chromosome accession.version}.gff pseudoautosomal_region/ par_align.asn par_align.gff assembly_structure directories for assemblies with alternate loci or patch scaffolds may also contain some additional files: genomic_regions_definitions.txt This file is provided for assemblies that have defined REGIONS on the Primary Assembly for which alternative loci or patch scaffolds have been provided. The file reports: region name | chromosome accession.version | start | stop all_alt_scaffold_placement.txt A file providing the alternate locus scaffold placements for all the alternate assembly-units. See below for the file format. ------ Notes: ------ 1. The sequences of the placed scaffolds are redundant with the sequences of the assembled chromosomes. The placed scaffolds are provided for users who prefer to work with scaffolds rather than with chromosomes. 2. Eukaryote genome assemblies may include an assembly-unit named "non-nuclear" which contains data from organelle genomes, for example the mitochondrion or chloroplast. 3. If the assembly is comprised of more than one assembly-unit, the names for the assembly-units, other than a "non-nuclear" assembly-unit, are supplied by the submitter. 4. The chromosome-from-scaffold AGP file (chr?.agp.gz), and the placed_scaffolds directory, may be omitted if the chromosome is assembled directly from components, or if the chromosome is a complete sequence with no gaps. 5. The file suffix .agp.gz indicates AGP files. See format specification: https://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml ===================== Description of files: ===================== 1. Files containing genomic sequences in nucleotide fasta format (unmasked) FILENAME CONTENT chr?.fna.gz chromosome sequence chr?.placed.scaf.fna.gz placed scaffold sequences chr?.unlocalized.scaf.fna.gz unlocalized scaffold sequences unplaced.scaf.fna.gz unplaced scaffold sequences alt.scaf.fna.gz alternate loci or patch scaffold sequences 2. AGP files See format specification: https://www.ncbi.nlm.nih.gov/genome/assembly/agp/AGP_Specification.shtml The AGP files in this directory tree use GenBank or RefSeq accession.versions as the identifiers for components, scaffolds, and chromosomes. FILENAME CONTENT chr?.comp.agp.gz chromosome from component AGP chr?.agp.gz chromosome from scaffold AGP chr?.placed.scaf.agp.gz placed scaffold from component AGP chr?.unlocalized.scaf.agp.gz unlocalized scaffold from component AGP unplaced.scaf.agp.gz unplaced scaffold from component AGP alt.scaf.agp.gz alternate loci or patch scaffold from component AGP 3. component_localID2acc A two column file associating the submitter component ID with the accession.version. 'na' is shown in the ID column if the submitter did not provide a name for the component. 4. scaffold_localID2acc A two column file associating the submitter scaffold ID with the accession.version. 'na' is shown in the ID column if the submitter did not provide a name for the scaffold. 5. chr2acc A two column file associating the chromosome, or linkage group name, with the accession.version. 6. unlocalized.chr2scaf A two column file giving the chromosome or linkage group assignment for each unlocalized scaffold. 7. join_certificate.xml This file provides data on joins in the assembly that were curated by the Genome Reference Consortium (GRC). This file will not be present for assemblies submitted by other groups. 8. genomic_regions_definitions.txt A file defining the regions on the primary assembly for which alternate loci or patch scaffolds are available. May also include pseudoautosomal regions, centromere regions or heterochromatin regions if these have been defined for an assembly. The file is tab delimited (including a #header) with the following columns: col 1: region_name: name for the genomic region col 2: chromosome: accession.version for the chromosome or unlocalized/unplaced scaffold col 3: start: the starting position on the chromosome or scaffold (in 1 base coordinates) col 4: stop: the ending position on the chromosome or scaffold (in 1 base coordinates) 9. alt_scaffold_placement.txt & all_alt_scaffold_placement.txt Files associating alternate loci or patch scaffolds with the corresponding primary assembly chromosome, providing the location on the chromosome, the genomic region name, and the length of any unaligned tails. The file is tab delimited (including a #header) with the following columns: col 1: alt_asm_name: name of the assembly-unit that includes the alternate scaffold col 2: prim_asm_name: name of the primary assembly-unit on which the alternate scaffold is being placed col 3: alt_scaf_name: name of the alternate scaffold being placed col 4: alt_scaf_acc: accession.version of the alternate scaffold being placed col 5: parent_type: type of object on which the alternate scaffold is being placed, either CHROMOSOME or SCAFFOLD col 6: parent_name: name of the object on which the alternate scaffold is being placed (can be either a chromosome or a scaffold) col 7: parent_acc: accession.version of the sequence on which the alternate scaffold is being aligned col 8: region_name: name of the genomic region on the parent within which the alternate scaffold is placed col 9: ori: orientation of the alignment, '+', '-' or 'b' (mixed) col10: alt_scaf_start: start of the placement on the alternate scaffold (in 1 base coordinates) col11: alt_scaf_stop: end of the placement on the alternate scaffold (in 1 base coordinates) col12: parent_start: start of the placement on the parent sequence (in 1 base coordinates) col13: parent_stop: end of the placement on the parent sequence (in 1 base coordinates) col14: alt_start_tail: number of bases at the start of the alternate scaffold not involved in the placement col15: alt_stop_tail: number of bases at the end of the alternate scaffold not involved in the placement Note: Every alternate scaffold associated with the assembly-unit will be listed in this file. Any alternate scaffold that has no placement will have 'na' in columns 5 to 15. Any alternate scaffold that has a chromosome assignment, but no alignment, would have the chromosome name in column 6 and 'na' in columns 7 to 15. 10. alignments/{scaffold accession.version}_{chromosome accession.version}.asn Files providing alignments of the alternate loci or patch scaffolds to the corresponding primary assembly chromosome, in ASN.1 format. These alignments indicate how the alternate loci and patch scaffold sequences differ from the chromosomes of the primary assembly. 11. alignments/{{scaffold accession.version}_{chromosome accession.version}.gff Files providing alignments of the alternate loci or patch scaffolds to the corresponding primary assembly chromosome, in CIGAR format embedded within a GFF format file. These alignments indicate how the alternate loci and patch scaffold sequences differ from the chromosomes of the primary assembly. 12. par_align.asn A file providing alignments between each pseudoautosomal region (PAR) on the X chromosome and the corresponding PAR on the Y chromosome, in ASN.1 format. 13. par_align.gff A file providing alignments between each pseudoautosomal region (PAR) on the X chromosome and the corresponding PAR on the Y chromosome, in CIGAR format embedded within a GFF format file. 14. patch_type A file providing the patch type for each of the scaffolds in a patch assembly- unit. The file is tab delimited (including a #header) with the following columns: col 1: alt_scaf_name: local name for the patch scaffold col 2: alt_scaf_acc: the accession.version for the patch scaffold col 3: patch_type: FIX or NOVEL (defined below) ============ Definitions: ============ Assembly: A set of chromosome assemblies, unlocalized and unplaced sequences and alternate loci used to represent an organisms genome. Most current assemblies are a haploid representation of an organisms genome, although some loci may be represented more than once (see Alternate locus, below). This representation may be obtained from a single individual (e.g. chimp or mouse) or multiple individuals (e.g. human Genome Reference Consortium assembly). Except in the case of organisms that have been bred to homozygosity, the haploid assembly does not typically represent a single haplotype, but rather a mixture of haplotypes. Chromosome Assembly: A relatively complete pseudo-molecule assembled from smaller sequences (components) that represent a biological chromosome. Relatively complete implies that some gaps may still be present in the assembly, but independent measures suggest that most of the sequence is represented by sequenced bases. Completeness is submitter defined. Unlocalized sequence: A sequence found in an assembly that is associated with a specific chromosome but cannot be ordered or oriented on that chromosome. Unplaced sequence: A sequence found in an assembly that is not associated with any chromosome. Primary assembly: An assembly-unit representing the collection of assembled chromosomes, unlocalized and unplaced sequences that, when combined, should represent a non-redundant haploid genome. This excludes any alternate loci. Alternate locus: A sequence that provides an alternate representation of a locus found in the primary assembly. These sequences do not represent a complete chromosome sequence although there is no hard limit on the size of the alternate locus; currently these are less than 1 Mb. Alternate locus group: An assembly-unit consisting of scaffolds from different loci that are considered to be part of the same haplotype (e.g. mouse 129/Sv group). Genomic region: A defined span on the primary assembly for which alternate loci or patch scaffolds are available. Genomic regions may be named after a gene or gene cluster, or may be given arbitrary region numbers. Major release: The formal release of a genome assembly, e.g. GRCh38. Minor release: A release of a genome assembly including patches that occurs between major releases. Genome Patch: A sequence contig/scaffold that corrects sequence in a major release of the genome, or adds sequence to it. FIX patch: A patch that corrects sequence or reduces an assembly gap in a given major release. FIX patch sequences are meant to be incorporated into the primary or existing alt-loci assembly units at the next major release, and their accessions will then be deprecated. NOVEL patch: A patch that adds sequence to a major release. Typically, NOVEL patch sequences are meant to be incorporated into the assembly as new alternate loci at the next major release, and their accessions will not be deprecated. ________________________________________________________________________________ National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, MD 20894, USA tel: (301) 496-2475 fax: (301) 480-9241 e-mail: info@ncbi.nlm.nih.gov ________________________________________________________________________________