README DATE: 30/12/2021 TITLE: De novo assembly and annotation of Bread wheat cv. Kariega AUTHORS: Michael Abrouk, Naveenkumar Athiyannan, Simon Krattinger CONTRIBUTORS: Willem H. P. Boshoff, Stéphane Cauet, Nathalie Rodde, David Kudrna, Nahed Mohammed, Jan Bettgenhaeuser, Kirsty Botha, Shannon Derman, Rod A. Wing, Renée Prins CONTACT: michael.abrouk@kaust.edu.sa The folder 'Kariega_v1_pseudomolecules.tar.gz' contains DNA sequence file and masked sequences in FASTA format for chromosomal pseudomolecules of bread wheat (Triticum aestivum) cv. Kariega. Primary contig assembly from PacBio Hifi reads was done with HiFiAsm. Contigs were scaffolded with Bionano data and arranged into chromosomal pseudomolecules with Omni-C data using the Juicer / 3D-DNA / Juicebox pipeline. An AGP file specifying the placement of sequence scaffolds in the pseudomolecules is provided. The folder 'Kariega_v1_annotations.tar.gz' holds the structural gene annotation based on evidences derived from protein homology, RNAseq and IsoSeq datasets, as well as ab initio gene prediction incorporating TE annotations: gene models in GFF3 format, their functional descriptions as well as coding and protein sequences of high- and low-confidence genes. De novo annotations were subject of a confidence classification step, which includes homology to existing proteins, robustness of functional assessment and results in high- (HC) and low- (LC) confidence genes. A fixed cut-off threshold was applied for protein homology and functional assessment was carried out using an Interproscan pipeline and was not manually curated. A GFF file containing the NLR genes predicted with NLR-Annotator pipeline. It contains also the GFF files specifying the positions of transposable elements and a fasta file with a transposable elements library annotated with EDTA software. The Yr27 CDS and genomic sequences are present in FASTA format. Provided in these folders: - Umasked and Masked DNA sequences - Annotation in gff3 format including isoforms, UTRs and description line putative functional assignments - separate gff3 files for high- (HC)/ low- (LC) confidence genes and transposable elements (TE) - CDS and protein sequences - Transposable elements library - NLR annotation - Yr27 CDS and genomic sequences Detailed list of files in '.tar.gz' folders: Kariega_v1_pseudomolecules.tar.gz --> Kariega_v1.fasta (Unmasked pseudomolecules) --> Kariega_v1-masked.fasta (masked pseudomolecules) --> Kariega_v1.agp (scaffold order in each pseudomolecules) --> Kariega_v1.length (pseudomolecule size) Kariega_v1_annotations.tar.gz (High- and Low-confidence gene models) --> Kariega_v1.gff3 --> Kariega_v1-cds.fasta --> Kariega_v1-prot.fasta (High-confidence gene models) --> Kariega_v1_HC.gff3 --> Kariega_v1_HC-cds.fasta --> Kariega_v1_HC-prot.fasta (Low-confidence gene models) --> Kariega_v1_LC.gff3 --> Kariega_v1_LC-cds.fasta --> Kariega_v1_LC-prot.fasta (NLR gene prediction) --> Kariega_v1_NLR.gff3 (Transposable elements) --> Kariega_v1_TE-intact.gff3 --> Kariega_v1_TE-lib.fasta