############################################### # Date Palm Genome Draft Sequence Version 3.0 # ############################################### March 28, 2010 Genomics Core Team: Eman K. Al-Dous, Binu George, Maryam E. Al-Mahmoud, Yasmeen M.Salameh, Eman K. Al-Azwani, Moneera Y. Al-Jaber, and Joel A. Malek Weill Cornell Medical College in Qatar (WCMC-Q) http://www.qatar-weill.cornell.edu Contact: Joel Malek jom2042@qatar-med.cornell.edu 1.INTRODUCTION 2.FILES 3.ADDITIONAL INFORMATION ____________________ 1. PDK30 INTRODUCTION --------------------- The files on this website are from the Version 3.0 draft assembly of the Date Palm Genome (Khalas Female variety) using Next Generation DNA Sequencing (Illumina GAIIx). Please see the "README.txt" file from the Version 1.0 and 2.0 on the same website for more detail. NOTE: CONTIG IDs from V3.0 and V2.0 are NOT interchangeable. Major differences in this release are: 1. Doubling of sequence and clone coverage (54X and ~30X respectively. 2. Use of SOAPdenovo for assembly 3. Significantly improved contiguity (6kb N50 contig, 30kb N50 scaffold) 4. Significantly improved gene prediction resulting in ~90% of genes captured 5. Sequencing of additional 8 genomes including Deglet Noor and Medjool 6. Polymorphism data (SNP and CNV) from 8 additional genomes 7. Use of BWA/SAMTOOLS for SNP calling There are major differences in the PDK30 (Phoenix dactylifera 'Khalas' 3.0) assembly. In this round we doubled sequence coverage to approximately 54X from short reads and increased 2-5kb clone coverage to ~30X. We switched from VELVET to SOAPdenovo (http://soap.genomics.org.cn/soapdenovo.html) for the assembly stage which we found to be more RAM efficient. We still used BAMBUS (part of the AMOS package: http://amos.sourceforge.net) to create scaffolds. We then created 'pseudocontigs' which are basically the scaffolds with 60 "N's" representing gaps that have been linked by paired-end information. For SNP calling we switched from MAQ to BWA/SAMTOOLS which allowed for better alignment in the highly polymorphic (including small indels) date palm genome. 2. FILES -------- PDK30-mrna.fsa.gz: a gzipped multi-fasta DNA file with ~28,000 gene predictions (full and partial). Sequences are spliced and titles contain functional annotation. PDK30-pep.fsa.gz: a gzipped multi-fasta amino acid file with ~28,000 translated gene predictions. Titles contain functional annotation. PDK30.gbf.gz: a gzipped GENBANK format file with annotation information including sequences, mRNA sequences, protein sequences, Gene Ontology Info and Enzyme Commission Numbers. SNPs are in a separate file. PDK30_9genomes_SNPs.tab.txt.gz: a gzipped, tab delimited file of SNP calls at ~3.5M polymorphic sites for each of 9 genomes. 0 indicates no detected difference from the Khalas female reference (PdactyKAsm30_r20101206.fasta.gz), A/C/G/T indicates homozygous difference from the Khalas reference and heterozygous calls are indicated with IUPAC codes. PDK30_CNVs.tab.txt.gz: a gzipped, tab delimited file of CNVs/ISCRs (Imbalanced Sequence Count Regions) that overlap gene regions PDK30_README.txt: this file PdactyKAsm30_r20101206.fasta.gz: a gzipped multi-fasta file with all scaffold sequences 3. ADDITIONAL INFORMATION ------------------------- For details on the assembly methods, annotation, and polymoprhism calling please see the publication and supplementary methods (Al-Dous et al., Nature Biotechnology, 2011). The 9 sequenced genomes listed in the SNP file are as follows: Khsfem: The Khalas female genome sequenced and assembled for this project from Qatar. KhBC2fem: A Khalas BC2 female from USDA, California (originally documented as male but phenotyped as female) DNfem: Deglet Noor female from USDA, California. DNBC5male: Deglet Noor male backcrossed 5 generations from USDA, California. Mdjlfem: Medjool female from USDA, California. MdjlBC4male: Medjool male backcrossed 4 generations from USDA, California. Alrjfem: Seed grown female with no pedigree from Qatar. KhltMale: Seed grown male with no pedigree from Qatar. KhFxfem: a female tree resulting from a cross with Khalas female from USDA, California. ########################################################## ###########Information from Version 2 Assembly############ 1.INTRODUCTION 2.FILES 3.ASSEMBLY INFORMATION 4.ANNOTATION INFORMATION 5.POLYMORPHISM INFORMATION ____________________ 1.PDK20 INTRODUCTION -------------- The files on this website are from the Version 2.0 draft assembly of the Date Palm Genome generated by whole genome shotgun next generation DNA sequencing. Please see the "README.txt" file from the Version 1.0 on the same website for more detail information. NOTE: CONTIG IDs from V1.0 and V2.0 are NOT interchangeable. The main differences in the PDK20 (which stands for Phoenix dactylifera 'Khalas' 2.0) assembly is that the contigs from Version 1.0 have been scaffolded. This was accomplished using paired-end sequences from 1.4-4kb inserts from at TypeIII restriction enzyme library (EcoP15I library). We then added the linking information to BAMBUS (part of the AMOS package: http://amos.sourceforge.net) to create scaffolds. We then created 'pseudocontigs' which are basically the scaffolds with 60 "N's" representing gaps that have been linked by paired-end information. 2.PDK20 FILES ------- PDK20.fsa.gz: a gzipped mulit-fasta file with all scaffold sequences. PDK20.gbf.gz: a gzipped GENBANK format file with ALL annotation information including sequences, mRNA sequences, proteins sequences, SNPs, Enzyme Commission Numbers, Gene Ontology annotation, etc. NOTE: THIS EXPANDS to ~ 1.5Gig. PDK20.mRNA.fsa: a multi-fasta file of all 19,414 predicted genes (full and partial). The sequences are spliced and titles contain functional annotation. PDK20.pep.fsa: a multi-fasta file of all 19,414 predicted gene translated as proteins. The titles contain functional annotations. PDK20.snp.txt.gz: a text, tab delimited file of all SNP locations with PDK20 assembly coordinates. Please do not confuse with V1.0 coordinates. See MAQ or V1.0 README for more detail. Essentially the columns are: Scaffold name, position, reference base, consenus base, Phred-like consensus quality, read depth, the average number of hits of reads covering this position, the highest quality of reads covering the position, the minimum consensus quality in the 3bp flanking regions at each side of this site, the second best call, the log likelihood ratio of the second best and the third best call, and the third best call. 3.ASSEMBLY INFORMATION ---------------------- This project utilized a whole genome shotgun approach. We used VELVET version 0.7.27 (http://www.ebi.ac.uk/~zerbino/velvet) to assemble the shotgun reads into contigs. PDK20 (V2.0) essentially uses the same contigs from V1.0 (VELVET output) and further scaffolds them using BAMBUS. This improves overall contiguity of the assembly and improved the chances of finding an entire gene on a single scaffold. 4.ANNOTATION INFORMATION ------------------------ The PDK20 annotation is a big improvement over the 1.0 annotation. Gene finding was done with FGENESH++ (http://www.softberry.com) using the Plant REFSEQ protein database for homology searching. The Functional annotation, EC numbers, Gene Ontology information, etc. was generated using BLAST2GO. We saw a significant increase in the number of full length gene predictions from the V1.0 to V2.0 assembly. There are still a number of genes that are not full length as should be expected in a draft sequences. We also expect a certain level of mitochondria/chloroplast genes within the predictions listed here though we attempted to remove most of them. 5.POLYMORPHISM INFORMATION -------------------------- SNP calling was as in V1.0 with MAQ and there were no significant changes other than switching to scaffold names from V2.0. We used lower and upper coverage cutoffs of 5 and 40 respectively to avoid low quality or repeat based SNPs. See the V1.0 README for more details on SNP calling