Supporting data for "Genome sequence of Malania oleifera, a tree with great value for nervonic acid production" =============================================================================================================== Xu C; Liu H; Zhou S; Zhang D; Zhao W; Wang S; Chen F; Sun Y; Nie S; Jia K; Jiao S; Zhang R; Yun Q; Guan W; Wang X; Bennetzen JL; Maghuly F; Porth I; de Peer YV; Wang X; Ma Y; Mao JF (2018): Supporting data for "Genome sequence of Malania oleifera, a tree with great value for nervonic acid production" GigaScience Database. http://dx.doi.org/10.5524/100549 Summary: -------- Malania oleifera, a member of the Olacaceae family, is an IUCN Red Listed tree, endemic and restricted to the Karst region of southwest China. This tree's seed is valued for its high content of precious fatty acids (especially nervonic acid). However, studies on its genetic make-up, and fatty acid biogenesis are severely hampered by a lack of molecular and genetic tools. We generated 51 Gigabases (Gb) and 135 Gb of raw DNA sequences, using PacBio Single-Molecule Real-Time (SMRT) and 10x Genomics sequencing, respectively. A final genome assembly, with a scaffold N50 size of 4.65 Megabases (Mb) and a total length of 1.51 Gb, was obtained by primary assembly based on PacBio long reads plus scaffolding with 10x Genomics reads. Identified repeats constituted ~82% of the genome, and 24,064 protein-coding genes were predicted with high support. The genome has low heterozygosity and shows no evidence for recent whole genome duplication. Metabolic pathway genes relating to the accumulation of long chain fatty acid were identified and studied in detail. Here, we provide the first genome assembly and gene annotation for M. oleifera. The availability of these resources will be of great importance for conservation biology, and for the functional genomics of nervonic acid biosynthesis. Files: ------ genome.fa assembly fasta final.gene.cds.fa coding gene nucleotide sequences (fasta): CDS final.gene.pep.fa coding gene translated sequences (protein fasta) final.gene.pep.gff3 coding gene annotations (GFF) final.repeatmask.gff3 repeats/transposable elements (GFF) final.gene.ncRNA.gff3 ncRNA, including tRNA, rRNA (GFF) singlecopy_aligned.faa gene family alignments (multi-fasta singlecopy_aligned.phy_phyml_tree.tre phylogenetic tree files (newick) genome_BUSCO.tgz BUSCO output files OrthoMcl2rawcafe.py python script used in phylogenetic analysis Table_S1.csv Summary of Pacbio and Illumina sequencing data (10x Genomics and RNA sequencing) generated in the present study. IDs of the study, sample, library and accessions in NCBI SRA and employed sequencing platform, material origins of the sequenced DNA or RNA, statistics of raw and cleaned data, and mapping rates are shown. Table_S2.csv Data summary from 10x Genomics based on GemCode index multiplicity. Read subsets are based on number associated reads for each index. For raw reads, all indices (including those with N's) are included in the count. For all other read sets, only the indices without N's were used for binning. Table_S3.csv Estimation of genome characteristics based on 17-mer statistics. Table_S4.csv Statistics of the different versions of M. oleifera genome assembly in ascending order. N50: shortest sequence length at 50% of the genome; L50: smallest number of contigs whose length sum produces N50. NA: data not available; * statistics for contigs/scaffolds. Gene completeness was generated by assessment with 1,440 single copy orthologs from the BUSCO embryophyta_odb9 database. Table_S5.csv Summary of the annotated interspersed repeats in the genome assembly for M. oleifera. LTR: Long Terminal Repeat retrotransposons; LINE: Long interspersed nuclear elements, a group of non-LTR (long terminal repeat) retrotransposons; SINE: Short Interspersed Nuclear Elements (SINEs), non-autonomous, non-coding transposable elements (TEs); RC: Rolling-circle transposons. Table_S6.csv Summary of transcriptome assemblies using three different analysis pipelines. Table_S7.csv Summary of annotated genes. Table_S8.csv Summary of BUSCO evaluation for gene prediction. Table_S9.csv Summary of functional annotation of predicted genes. Table_S10.csv Subgroups of Gypsy and Copia superfamily of LTR-RTs. Table_S11.csv Gene proximity of subgroups of Gypsy and Copia superfamily of LTR-RTs. Table_S12.csv Comparison of the number of original and filtered intact LTR-RT, solo-LTR and Truncated LTR among 9 plant species. Table_S13.csv Genomic data used for phylogenomic and gene family analyses. Origins, download links, assembly versions, genome properties and references of 14 genomes are shown. Table_S14.csv Summary of gene family analyses. Unique groups and genes, single-copy and duplicated groups and genes are summarized for the 15 analyzed plant genomes. Table_S15.csv GO enrichment of expanded gene families. (A) 'Category' is the Gene Ontology (GO) term ID; (B) 'P_value' is the overrepresentation p-value indicating the observed frequency of a given term among analyzed genes is equal to the expected frequency based on the null distribution; i.e. lower p-values indicate stronger evidence for overrepresentation; (C) 'Q_value' is the Benjamini and Hochberg adjusted p-value, (D) 'numEPInCat' is the number of expanded gene families in the corresponding GO category; (E) 'numInCat' is the number of detected gene families in the corresponding GO category; (F) 'Term' is the GO term; (G) 'Ontology' indicates which ontology the term comes from. Significant at q < 0.05. Table_S16.csv KEGG enrichment of expanded gene families. (A) 'KO category' is the KEGG Orthology (KO) category ID; (B) 'P_value' is the over represented p-value indicating the observed frequency of a given term among analyzed genes is equal to the expected frequency based on the null distribution; i.e. lower p-values indicate stronger evidence for overrepresentation; (C) 'Q_value' is the Benjamini and Hochberg adjusted p-value, (D) 'numEPInCat' is the number of expanded gene families in the corresponding KO category; (E) 'numInCat' is the number of detected gene families in the corresponding KO category; (F) 'Pathway' is the KEGG pathway; (G) 'Class' indicates which KEGG class the pathway comes from. Significant at q < 0.05. Table_S17.csv GO enrichment of contracted gene families. (A) 'Category' is the Gene Ontology (GO) term ID; (B) 'P_value' is the over represented p-value indicating the observed frequency of a given term among analyzed genes is equal to the expected frequency based on the null distribution; i.e. lower p-values indicate stronger evidence for overrepresentation; (C) 'Q_value' is the Benjamini and Hochberg adjusted p-value, (D) 'numEPInCat' is the number of expanded gene families in the corresponding GO category; (E) 'numInCat' is the number of detected gene families in the corresponding GO category; (F) 'Term' is the GO term; (G) 'Ontology' indicates which ontology the term comes from. Significant at q < 0.05. Table_S18.csv KEGG enrichment of contracted gene families. (A) 'KO category' is the KEGG Orthology (KO) category ID; (B) 'P_value' is the over represented p-value indicating the observed frequency of a given term among analyzed genes is equal to the expected frequency based on the null distribution; i.e. lower p-values indicate stronger evidence for overrepresentation; (C) 'Q_value' is the Benjamini and Hochberg adjusted p-value, (D) 'numEPInCat' is the number of expanded gene families in the corresponding KO category; (E) 'numInCat' is the number of detected gene families in the corresponding KO category; (F) 'Pathway' is the KEGG pathway; (G) 'Class' indicates which KEGG class the pathway comes from. Significant at q < 0.05. Table_S19.csv GO enrichment of fast evolving gene families. (A) 'Category' is the Gene Ontology (GO) term ID; (B) 'P_value' is the over represented p-value indicating the observed frequency of a given term among analyzed genes is equal to the expected frequency based on the null distribution; i.e. lower p-values indicate stronger evidence for overrepresentation; (C) 'Q_value' is the Benjamini and Hochberg adjusted p-value, (D) 'numEPInCat' is the number of expanded gene families in the corresponding GO category; (E) 'numInCat' is the number of detected gene families in the corresponding GO category; (F) 'Term' is the GO term; (G) 'Ontology' indicates which ontology the term comes from. Significant at q < 0.05. Table_S20.csv KEGG enrichment of fast evolving gene families. (A) 'KO category' is the KEGG Orthology (KO) category ID; (B) 'P_value' is the over represented p-value indicating the observed frequency of a given term among analyzed genes is equal to the expected frequency based on the null distribution; i.e. lower p-values indicate stronger evidence for overrepresentation; (C) 'Q_value' is the Benjamini and Hochberg adjusted p-value, (D) 'numEPInCat' is the number of expanded gene families in the corresponding KO category; (E) 'numInCat' is the number of detected gene families in the corresponding KO category; (F) 'Pathway' is the KEGG pathway; (G) 'Class' indicates which KEGG class the pathway comes from. Significant at q < 0.05. Table_S21.csv Summary of 23 metabolic gene clusters in the M. oleifera genome. Genomic coordinates, gene composition, core protein domains related to metabolism and pathway assignment are shown.