############################################### # Date Palm Genome Draft Sequence Version 2.0 # ############################################### October 1, 2009 Genomics Core Team: Eman K. Al-Dous, Binu George, Yasmeen M. Salameh, Eman K. Al-Azwani, Moneera Y. Al-Jaber, and Joel A. Malek Weill Cornell Medical College in Qatar WCMC-Q http://www.qatar-weill.cornell.edu Contact: Joel Malek jom2042@qatar-med.cornell.edu 1.INTRODUCTION 2.FILES 3.ASSEMBLY INFORMATION 4.ANNOTATION INFORMATION 5.POLYMORPHISM INFORMATION ____________________ 1.PDK20 INTRODUCTION -------------- The files on this website are from the Version 2.0 draft assembly of the Date Palm Genome generated by whole genome shotgun next generation DNA sequencing. Please see the "README.txt" file from the Version 1.0 on the same website for more detail information. NOTE: CONTIG IDs from V1.0 and V2.0 are NOT interchangeable. The main differences in the PDK20 (which stands for Phoenix dactylifera 'Khalas' 2.0) assembly is that the contigs from Version 1.0 have been scaffolded. This was accomplished using paired-end sequences from 1.4-4kb inserts from at TypeIII restriction enzyme library (EcoP15I library). We then added the linking information to BAMBUS (part of the AMOS package: http://amos.sourceforge.net) to create scaffolds. We then created 'pseudocontigs' which are basically the scaffolds with 60 "N's" representing gaps that have been linked by paired-end information. 2.PDK20 FILES ------- PDK20.fsa.gz: a gzipped mulit-fasta file with all scaffold sequences. PDK20.gbf.gz: a gzipped GENBANK format file with ALL annotation information including sequences, mRNA sequences, proteins sequences, SNPs, Enzyme Commission Numbers, Gene Ontology annotation, etc. NOTE: THIS EXPANDS to ~ 1.5Gig. PDK20.mRNA.fsa: a multi-fasta file of all 19,414 predicted genes (full and partial). The sequences are spliced and titles contain functional annotation. PDK20.pep.fsa: a multi-fasta file of all 19,414 predicted gene translated as proteins. The titles contain functional annotations. PDK20.snp.txt.gz: a text, tab delimited file of all SNP locations with PDK20 assembly coordinates. Please do not confuse with V1.0 coordinates. See MAQ or V1.0 README for more detail. Essentially the columns are: Scaffold name, position, reference base, consenus base, Phred-like consensus quality, read depth, the average number of hits of reads covering this position, the highest quality of reads covering the position, the minimum consensus quality in the 3bp flanking regions at each side of this site, the second best call, the log likelihood ratio of the second best and the third best call, and the third best call. 3.ASSEMBLY INFORMATION ---------------------- This project utilized a whole genome shotgun approach. We used VELVET version 0.7.27 (http://www.ebi.ac.uk/~zerbino/velvet) to assemble the shotgun reads into contigs. PDK20 (V2.0) essentially uses the same contigs from V1.0 (VELVET output) and further scaffolds them using BAMBUS. This improves overall contiguity of the assembly and improved the chances of finding an entire gene on a single scaffold. 4.ANNOTATION INFORMATION ------------------------ The PDK20 annotation is a big improvement over the 1.0 annotation. Gene finding was done with FGENESH++ (http://www.softberry.com) using the Plant REFSEQ protein database for homology searching. The Functional annotation, EC numbers, Gene Ontology information, etc. was generated using BLAST2GO. We saw a significant increase in the number of full length gene predictions from the V1.0 to V2.0 assembly. There are still a number of genes that are not full length as should be expected in a draft sequences. We also expect a certain level of mitochondria/chloroplast genes within the predictions listed here though we attempted to remove most of them. 5.POLYMORPHISM INFORMATION -------------------------- SNP calling was as in V1.0 with MAQ and there were no significant changes other than switching to scaffold names from V2.0. We used lower and upper coverage cutoffs of 5 and 40 respectively to avoid low quality or repeat based SNPs. See the V1.0 README for more details on SNP calling