###############################################
# Date Palm Genome Draft Sequence Version 2.0 #
###############################################

October 1, 2009
Genomics Core Team:
Eman K. Al-Dous, Binu George, Yasmeen M. Salameh, Eman K. Al-Azwani, Moneera Y. Al-Jaber,
and Joel A. Malek
Weill Cornell Medical College in Qatar
WCMC-Q
http://www.qatar-weill.cornell.edu


Contact:
Joel Malek
jom2042@qatar-med.cornell.edu

1.INTRODUCTION
2.FILES
3.ASSEMBLY INFORMATION
4.ANNOTATION INFORMATION
5.POLYMORPHISM INFORMATION
____________________


1.PDK20 INTRODUCTION
--------------
The files on this website are from the Version 2.0 draft assembly of 
the Date Palm Genome generated by whole genome shotgun next generation 
DNA sequencing.

Please see the "README.txt" file from the Version 1.0 on the same website
for more detail information.

NOTE: CONTIG IDs from V1.0 and V2.0 are NOT interchangeable.

The main differences in the PDK20 (which stands for Phoenix dactylifera 'Khalas'
2.0) assembly is that the contigs from Version 1.0 have been scaffolded.  This
was accomplished using paired-end sequences from 1.4-4kb inserts from at TypeIII
restriction enzyme library (EcoP15I library).  We then added the linking information
to BAMBUS (part of the AMOS package: http://amos.sourceforge.net) to create scaffolds.
We then created 'pseudocontigs' which are basically the scaffolds with 60 "N's" 
representing gaps that have been linked by paired-end information. 

2.PDK20 FILES
-------
PDK20.fsa.gz:	a gzipped mulit-fasta file with all scaffold sequences.

PDK20.gbf.gz:	a gzipped GENBANK format file with ALL annotation information
		including sequences, mRNA sequences, proteins sequences, SNPs,
		Enzyme Commission Numbers, Gene Ontology annotation, etc.
		NOTE: THIS EXPANDS to ~ 1.5Gig.

PDK20.mRNA.fsa:	a multi-fasta file of all 19,414 predicted genes (full and partial). 
		The sequences are spliced and titles contain functional annotation.

PDK20.pep.fsa:	a multi-fasta file of all 19,414 predicted gene translated as proteins.
		The titles contain functional annotations.

PDK20.snp.txt.gz: a text, tab delimited file of all SNP locations with PDK20 assembly
		  coordinates.  Please do not confuse with V1.0 coordinates.  See MAQ
		  or V1.0 README for more detail.  Essentially the columns are: Scaffold 
		  name, position, reference base, consenus base, Phred-like consensus quality, 
		  read depth, the average number of hits of reads covering this position, 
		  the highest quality of reads covering the position, the minimum consensus 
		  quality in the 3bp flanking regions at each side of this site, the second best 
		  call, the log likelihood ratio of the second best and the third best call, and 
		  the third best call. 


3.ASSEMBLY INFORMATION
----------------------			       
This project utilized a whole genome shotgun approach.

We used VELVET version 0.7.27 (http://www.ebi.ac.uk/~zerbino/velvet) to 
assemble the shotgun reads into contigs.

PDK20 (V2.0) essentially uses the same contigs from V1.0 (VELVET output) and
further scaffolds them using BAMBUS.  This improves overall contiguity of the 
assembly and improved the chances of finding an entire gene on a single scaffold.


4.ANNOTATION INFORMATION
------------------------
The PDK20 annotation is a big improvement over the 1.0 annotation.  Gene finding
was done with FGENESH++ (http://www.softberry.com) using the Plant REFSEQ protein
database for homology searching.  The Functional annotation, EC numbers, Gene 
Ontology information, etc. was generated using BLAST2GO.  We saw a significant
increase in the number of full length gene predictions from the V1.0 to V2.0 assembly.
There are still a number of genes that are not full length as should be 
expected in a draft sequences.  We also expect a certain level of mitochondria/chloroplast
genes within the predictions listed here though we attempted to remove most of them.


5.POLYMORPHISM INFORMATION
--------------------------
SNP calling was as in V1.0 with MAQ and there were no significant changes 
other than switching to scaffold names from V2.0.  We used lower and upper
coverage cutoffs of 5 and 40 respectively to avoid low quality or repeat based
SNPs.  See the V1.0 README for more details on SNP calling