Supporting data for "Hybrid de novo genome assembly of the Chinese herbal fleabane Erigeron breviscapus"
===============================================================================

Yang, J; Zhang, G; Zhang, J; Liu, H; Chen, W; Li, Y; Wang, X; Dong, Y; Yang, S (2017): GigaScience Database. http://dx.doi.org/10.5524/100290

Summary:
-------
The plants in the Erigeron genus of the Compositae (Asteraceae) family are commonly called fleabanes, possibly due to the belief that certain chemicals in these plants repel fleas. In the traditional Chinese medicine, Erigeron breviscapus, which is native to China, was widely used in the treatment of cerebrovascular disease. A handful of bioactive compounds, including scutellarin, 3,5-dicaffeoylquinic acid, and 3,4-dicaffeoylquinic acid, have been isolated from the plant. With the purpose of finding novel medicinal compounds and understanding their biosynthetic pathways, we propose to sequence the genome of E. breviscapus.
We assembled the highly heterozygous E. breviscapus genome using a combination of PacBio single-molecular real-time sequencing method and next-generation sequencing method on the Illumina HiSeq platform. The final draft genome is approximately 1.2 Gb, with the contig and scaffold N50 sizes of 18.8 kb and 31.5 kb, respectively. Further analyses predicted 37,504 protein-coding genes in the E. breviscapus genome, and 8,172 shared gene families among Compositae species.
The E. breviscapus genome provides a valuable resource for the investigation of novel bioactive compounds in this Chinese herb.

Files:
-----
DZH_genome.fasta - genome assembly file
polished_contigs.fa - polished contigs file
DZH.evm.out.finally.filter.gff - gene annotation gff file
DZH.evm.out.finally.filter.cds - gene annotation cds fasta file
DZH.evm.out.finally.filter.pep - gene annotation pep fasta file
all_orthomcl.out - OrthoMCL output file
global.out.cafe - CAFE output file
fifteen_RNA-seq_fpkm - fifteen RNA-seq FPKM matrix file
repeat_annotations/DZH_denovo.gff - denovo annotation gff
repeat_annotations/DZH_RepeatMasker.gff - RepeatMasker annotation gff
repeat_annotations/DZH_RepeatProteinMask.gff - RepeatProteinMask annotation gff
repeat_annotations/DZH_TRF.gff - Tandem Repeat Finder annotation gff
ncRNA_annotations/DZH_miRNA.gff - miRNA annotation gff
ncRNA_annotations/DZH_rRNA.gff - rRNA annotation gff
ncRNA_annotations/DZH_snRNA.gff - snRNA annotation gff
ncRNA_annotations/DZH_tRNA.gff - snRNA annotation gff
phylogenetic_analysis/389_cds.fasta - 389 single copy genes fasta file 
phylogenetic_analysis/PhyML.nwk -  phylogenetic tree newick file
scripts.dir/filter_adapter.pl - #function: remove adapter #use: perl filter_adapter.pl -fq1 read1.fq -fq2 read2.fq -key filter.out
scripts.dir/filter_low_quality.pl - #function: remove low_quality reads #use: perl filter_low_quality.pl -fq1 read1.fq -fq2 read2.fq -key filter.out
scripts.dir/filter_duplication.pl - #function: remove duplication #use: perl filter_duplication.pl read1.fq read2.fq clean1.fq clean2.fq dup.stat
scripts.dir/get_less_than_150bp.pl - #function: use to get gene ID which cds is less than 150bp #use: perl get_less_than_150bp.pl test.cds.fasta less_than_150bp.list
scripts.dir/cds_check.pl - #function: use to get gene ID which cds is incorrect (genes with incomplete ORFs and stop codons present in the middle of the gene) #use: perl cds_check.pl -check test.cds.fasta > incorret.list
scripts.dir/del_filter_gene_ID_cds_from_gff.pl  - #function: remove specified gene ID from gff file #use: perl del_filter_gene_ID_cds_from_gff.pl gene_ID.list test.gff filter.gff
scripts.dir/del_one_cds_from_gff.pl - #function: remove one cds gene from gff file #use: perl del_one_cds_from_gff.pl test.gff out.gff