Supporting data for "Hybrid de novo genome assembly of the Chinese herbal fleabane Erigeron breviscapus" =============================================================================== Yang, J; Zhang, G; Zhang, J; Liu, H; Chen, W; Li, Y; Wang, X; Dong, Y; Yang, S (2017): GigaScience Database. http://dx.doi.org/10.5524/100290 Summary: ------- The plants in the Erigeron genus of the Compositae (Asteraceae) family are commonly called fleabanes, possibly due to the belief that certain chemicals in these plants repel fleas. In the traditional Chinese medicine, Erigeron breviscapus, which is native to China, was widely used in the treatment of cerebrovascular disease. A handful of bioactive compounds, including scutellarin, 3,5-dicaffeoylquinic acid, and 3,4-dicaffeoylquinic acid, have been isolated from the plant. With the purpose of finding novel medicinal compounds and understanding their biosynthetic pathways, we propose to sequence the genome of E. breviscapus. We assembled the highly heterozygous E. breviscapus genome using a combination of PacBio single-molecular real-time sequencing method and next-generation sequencing method on the Illumina HiSeq platform. The final draft genome is approximately 1.2 Gb, with the contig and scaffold N50 sizes of 18.8 kb and 31.5 kb, respectively. Further analyses predicted 37,504 protein-coding genes in the E. breviscapus genome, and 8,172 shared gene families among Compositae species. The E. breviscapus genome provides a valuable resource for the investigation of novel bioactive compounds in this Chinese herb. Files: ----- DZH_genome.fasta - genome assembly file polished_contigs.fa - polished contigs file DZH.evm.out.finally.filter.gff - gene annotation gff file DZH.evm.out.finally.filter.cds - gene annotation cds fasta file DZH.evm.out.finally.filter.pep - gene annotation pep fasta file all_orthomcl.out - OrthoMCL output file global.out.cafe - CAFE output file fifteen_RNA-seq_fpkm - fifteen RNA-seq FPKM matrix file repeat_annotations/DZH_denovo.gff - denovo annotation gff repeat_annotations/DZH_RepeatMasker.gff - RepeatMasker annotation gff repeat_annotations/DZH_RepeatProteinMask.gff - RepeatProteinMask annotation gff repeat_annotations/DZH_TRF.gff - Tandem Repeat Finder annotation gff ncRNA_annotations/DZH_miRNA.gff - miRNA annotation gff ncRNA_annotations/DZH_rRNA.gff - rRNA annotation gff ncRNA_annotations/DZH_snRNA.gff - snRNA annotation gff ncRNA_annotations/DZH_tRNA.gff - snRNA annotation gff phylogenetic_analysis/389_cds.fasta - 389 single copy genes fasta file phylogenetic_analysis/PhyML.nwk - phylogenetic tree newick file scripts.dir/filter_adapter.pl - #function: remove adapter #use: perl filter_adapter.pl -fq1 read1.fq -fq2 read2.fq -key filter.out scripts.dir/filter_low_quality.pl - #function: remove low_quality reads #use: perl filter_low_quality.pl -fq1 read1.fq -fq2 read2.fq -key filter.out scripts.dir/filter_duplication.pl - #function: remove duplication #use: perl filter_duplication.pl read1.fq read2.fq clean1.fq clean2.fq dup.stat scripts.dir/get_less_than_150bp.pl - #function: use to get gene ID which cds is less than 150bp #use: perl get_less_than_150bp.pl test.cds.fasta less_than_150bp.list scripts.dir/cds_check.pl - #function: use to get gene ID which cds is incorrect (genes with incomplete ORFs and stop codons present in the middle of the gene) #use: perl cds_check.pl -check test.cds.fasta > incorret.list scripts.dir/del_filter_gene_ID_cds_from_gff.pl - #function: remove specified gene ID from gff file #use: perl del_filter_gene_ID_cds_from_gff.pl gene_ID.list test.gff filter.gff scripts.dir/del_one_cds_from_gff.pl - #function: remove one cds gene from gff file #use: perl del_one_cds_from_gff.pl test.gff out.gff