############################################### # Date Palm Genome Draft Sequence Version 1.0 # ############################################### April 8, 2009 Genomics Core Team: Eman K. Al-Dous, Yasmeen M. Salameh, Eman K. Al-Azwani, Moneera Y. Al-Jaber, and Joel A. Malek Weill Cornell Medical College in Qatar WCMC-Q http://www.qatar-weill.cornell.edu Contact: Joel Malek jom2042@qatar-med.cornell.edu 1.INTRODUCTION 2.FILES 3.ASSEMBLY INFORMATION 4.ANNOTATION INFORMATION 5.POLYMORPHISM INFORMATION 6.HOW DO I USE THIS? A Case Study ____________________ 1.INTRODUCTION -------------- The files on this website are from the draft assembly of the Date Palm Genome generated by whole genome shotgun next generation DNA sequencing. Highlights include: a predicted genome size of ~550Mbp, a scaffold N50 of 4250bp with most ordered gaps being extremely short, ~45,000 scaffolds greater than 2kb, 850,000 novel high quality SNPs between parental alleles, 37% GC in the nuclear genome, 302Mb of assembled sequence with 18.5Mb of ordered gaps, ~292Mb of sequence unique at the 24mer level, and draft Chloroplast gene sequences. The scientific name is Phoenix dactylifera L., while the variety name is 'Khalas'. Combining the two we get 'PdactyK' for short. It is our hope that the results provided here will be a starting point for researchers doing genetic studies of date palm. The assembly is a draft assembly using next generation sequencing reads and as such requires caution in it usage. While short range contiguity is of high quality, longer range contiguity (spanning gaps) is less certain. Manual inspection of contigs based on mate pair validity showed contigs/scaffolds up to 12kb were consistently assembled correctly. The quality of these scaffolds is roughly equivalent to other plant draft sequences such as rice and papaya. Larger scaffolds are more likely to have errors. As such researchers should be careful in operations such as PCR primer design when spanning gaps in the assembly (denoted by N's in the sequence). DNA for this project was obtained from leaves kinldy provided by the Qatar Plant Tissue Culture Lab in the Dept of Agriculture and Water Research (Qatar Ministry of Municipal Affairs and Agriculture). 2.FILES ------- PdactyKAssembly1.0.fasta.gz : A gzipped multi-fasta file with DNA sequence of all contigs/scaffolds as output by the VELVET assembler. In this file you will find the sequence of most Date Palm Genes. README.txt: This file PdactyKAnnotation1.1.fasta.gz: Full file coming soon! At present a file with de novo protein predictions based on Assembly 1.0 PdactyKSNPs1.1.txt.gz: A gzipped file containing heterozygous positions in the 'Khalas' variety based on Assembly 1.0. These are single nucleotide differences between parental alleles as output by MAQ. Please see the MAQ documentation for a full description of this tab delimited file. PdactyK1.0ChloroplastPseudo.gbk A genbank file containing a pseudo molecule of the Date Palm Chloroplast genome with associated Annotation. See the file comment for more info. 3.ASSEMBLY INFORMATION ---------------------- This project utilized a whole genome shotgun approach. We used VELVET version 0.7.27 (http://www.ebi.ac.uk/~zerbino/velvet) to assemble the shotgun reads into contigs. The Assembly of this data is based on approximatley 220M paired end sequence reads ranging in size from 36 to 64bp. They were generated from two paired libraries with inserts of ~140bp and ~350bp. Sequencing was on an Illumina GA2 in the genomics core at WCMC-Q. After an initial assembly, a rough draft of the chloroplast genome and other high coverage contigs was generated and used to screen out matching reads. We estimate that chloroplast DNA content was on the order of 15-20% for our DNA preparation. After high coverage contig screening we trimmed reads requiring that they have at least 36bp after removal of any bases following and including two consecutive bases with Quality values less than or equal to 4. Assembly was conducted on the Cornell CAC and WCMCQ cluster using one node with 64G RAM. We conducted a 'staged' assembly to conserve RAM. In the first pass, screened reads were assembled as single reads with relatively high stringency (kmer of 25). Contigs from this stage were then passed as 'short reads' to the next stage where ~60M paired reads from the 350bp library were used to scaffold the first stage contigs and fill in repeat gaps if possible. Final scaffold N50 was 4243bp with a total assemblable contig length of 302.7Mb including 18.6Mb of ordered gaps. Ordered gap (gaps for which we know how to connect them based on mate pair information) sizes are very small (typically less than 50bp) and are denoted by N's in the sequence. We are happy with this assembly as the N50 scaffold length is around the size of a typical gene (see rice genome for similar experience). Once again, this offers gene prediction capability across the spectrum of Date Palm Genes (as opposed to an EST approach). Our goal is to improve the present assembly using larger insert libraries (in progress). As next-gen data improves (paired 100bp reads) we believe the quality of this assembly will improve to near finished form 4.ANNOTATION INFORMATION ------------------------ All annotation files will be based on an Assembly version. So if they are based on Assembly version 1.0, they will be called 1.X, etc. We have conducted an initial 'place holder' de novo gene prediction using AUGUSTUS (http://augustus.gobics.de) trained on Maize. This produced approximately 12,000 gene predictions of which 85% contain at least partial protein similarity to a rice gene. We are in the process of generating a thorough annotation using protein similarity based methods and this will be available shortly. The chloroplast pseudo molecule was generated using MUMmer (http://mummmer.sourceforge.net) to align assembled contigs to a reference chloroplast genome. As such, contig order should be viewed cautiously. Annotation only included full length genes though partial genes do exist in the assembly. 5.POLYMORPHISM INFORMATION -------------------------- We report 850,000 high quality snps between parental alleles. In cases of polymorphism between parental chromomosomes the VELVET assembler will only report one parental allele in the consensus sequence. Documentation of polymorphic positions in the genome is a post assembly process. Essentially we match the sequences back to the assebmly and look for reads that have high quality discrepancies. We accomplished a first pass of this using MAQ (http://maq.sourceforge.net) to map the reads and call polymorphic positions. We report the results in the SNP file mentioned in the FILES section. See the MAQ documentation for more detail on the column descriptions in this tab delimited file. Essentially the columns are: Scaffold name, position, reference base, consenus base, Phred-like consensus quality, read depth, the average number of hits of reads covering this position, the highest quality of reads covering the position, the minimum consensus quality in the 3bp flanking regions at each side of this site, the second best call, the log likelihood ratio of the second best and the third best call, and the third best call. This is our first pass because we only used high quality reads with exactly 50bp of sequence. This yielded 14X coverage of the genome. To attempt to avoid SNPs due to repetitive sequence polymorphism we required that SNPs not be called in regions with greater than 40X coverage. This cutoff was based on Poisson modelling of 14x coverage. We know from unpublished results that thorough SNP calling with these next generation sequencing technologies requires on the order of 15-20X coverage for heterozygous detection. As such, the data provided is not complete but gives an initial set of 850,000 polymorphic positions distributed relatively randomly across the genome. It is a resource for genotyping varietal differences, especially in coding sequences as shown in the case study. 6.HOW DO I USE THIS? A Case Study --------------------------------- People experienced in bioinformatics can skip this section but I thought it might be useful to give an example of how I might use this data. As I mentioned, the quality of this data is somewhere between that of a relatively deep EST sequencing project and a full blown whole genome assembly. What's nice is this approach offers intronic and surrounding (such as promoter) sequences that an EST project would not. As an example - let's say I'm very interested in Date Palm fruit ripening. In that case I head to the QuickGO website at http://www.ebi.ac.uk/QuickGO and find the Gene Ontology term 'ripening'. From that I obtain proteins found in other plants that are important to fruit ripening. As an example, I am most interested in the protein "1-aminocylcopropane-1-carboxylate synthase 3" or "ACC synthase 3". I find a tomato protein homolog and use tblastn from the BLAST suite ftp://ftp.ncbi.nih.gov/blast to blast the tomato homolog of "ACC synthase 3" against the Date Palm contigs provided here. I find that a contig in the Date Palm sequence has homology to that protein (contig is called NODE_1284864 and it is ~5200bp long). To better define the exact boundaries and sequence of the Date Palm homolog I plug the full contig DNA Sequence and tomato homolog protein sequence into the WISE2 (http://www.ebi.ac.uk/Wise2/) or FGENESH+ (http://www.softberry.com) website. Using the FGENESH+ website gives a full length Date Palm mRNA and protein prediction as follows: ########## FGENESH+ 2.6 Prediction of potential genes in Monocot genomic DNA Time : Thu Apr 9 01:12:06 2009 Seq name: NODE_1284864_length_5147_cov_3.007771 Length of sequence: 5169 Homology: UniProtKB/SwissProt|Q42881|ACS3|1-aminocyclopropane-1-carboxylate synthase 3 Length of homolog: 469 Number of predicted genes 1: in +chain 1, in -chain 0. Number of predicted exons 4: in +chain 4, in -chain 0. Positions of predicted genes and exons: Variant 1 from 1, Score:1354.856836 G Str Feature Start End Score ORF Len 1 + TSS 1599 -3.88 1 + 1 CDSf 1715 - 1876 160.89 1715 - 1876 162 3 49 64 1 + 2 CDSi 2022 - 2150 155.70 2022 - 2150 129 50 93 60 1 + 3 CDSi 2283 - 2443 161.97 2283 - 2441 159 94 146 62 1 + 4 CDSi 2543 - 3412 896.45 2544 - 3410 867 148 429 60 Predicted protein(s): >FGENESH:[mRNA] 1 4 exon (s) 1715 - 3412 1320 bp, chain + ATGGGGGTCGAGTTTGGTGTTCTGCTGTCGGAAATTGCAACCTCCGACGCACATGGTGAA GACTCCCCTTATTTTGCTGGATGGAAAGCCTATGATGAAGATCCTTATGATGCTGTCAGC AATCCTTCAGGAGTCATTCAGATGGGATTGGCAGAAAACCAAGTTTCATTTTATCTACTG GAGAACTATTTGGAGCAACACCCAGAAATATCCAACTGGGAAAGTGGAATCTCTAGCTTT AGAGAGAATGCCTTATTTCAAGACTACCATGGGCTCAAAACATTCAGAAAGGCAATGGCG AGTTTTATGGAGCAAATAAGAGGGGGGAGAGCGAAATTTGACCCCGACCGCATTGTTCTC ACCGCAGGCGCCACCGCAGCAAATGAGTTGCTGACCTTCATCTTAGCAGACCCAGGAGAT GCTTTGCTAATTCCTATTCCTTACTACCCAGGATTCGATAGAGATCTAAGATGGCGAACT GGAGTGCACATAATCCCAGTCCACTGCAACGGCTCGAATGGCTTCCAAATCACTGTGAAA GCCTTGGAAGAAGCATATGCTGAAGCAGGAGCTGCGAACATCAGAGTCAGAGGACTTCTG CTGACAAATCCATCGAACCCCCTAGGAACTGCAATCACAAGGTCTGTTCTTGAAGAGATC CTAGACTTCGCTACGCAAAAGGACATCAACTTGATATCAGACGAGATCTACTCGGGTTCC GTATTCTCCTCGGCCGAGTTTGTGAGCATGGCTGAGATTGTTGAAGCCCGGGGTTATGAA AATTCTGACAGGGTTCACATTGTCTATAGCCTTTCCAAGGATCTTGGTCTGCCTGGCTTT AGGGTGGGGACAATATACTCGTACAACAATAAAGTAGTGACGACGGCTAGAAGAATGTCC AGCTTCACACTCGTCTCATCCCAGACTCAAAAGATGTTGGCCTCAATGCTATCTGATAGG GAGTTCACGGAGAATTACATAAAGACAAATAGGGAGAGTCTTAGGAAGAGGCACGAGTAT ATTACTGAAGGGCTAAAGAACGCCGGTATTGAGTGCTTGCAGGGGAATGCTGGTCTCTTT TGCTGGATGAATCTTGGGCCATTGCTCGAAGAGCCCACAAGAGAAGGTGAACTGAGCCTT TGGAATTTGATACTGCATGAGGTTAAACTCAACATATCCCCAGGATCTTCATGCCACTGT TCTGAAGCTGGTTGGTTTAGGGTGTGCTTCGCTAATATGAGCCAGCAGACACTAGATATT GCACTCAGGAGAATACATGCATTCATGGAGAAAAGGAAGACAACAAAAGGGCAAGCTTTG >FGENESH: 1 4 exon (s) 1715 - 3412 440 aa, chain + MGVEFGVLLSEIATSDAHGEDSPYFAGWKAYDEDPYDAVSNPSGVIQMGLAENQVSFYLL ENYLEQHPEISNWESGISSFRENALFQDYHGLKTFRKAMASFMEQIRGGRAKFDPDRIVL TAGATAANELLTFILADPGDALLIPIPYYPGFDRDLRWRTGVHIIPVHCNGSNGFQITVK ALEEAYAEAGAANIRVRGLLLTNPSNPLGTAITRSVLEEILDFATQKDINLISDEIYSGS VFSSAEFVSMAEIVEARGYENSDRVHIVYSLSKDLGLPGFRVGTIYSYNNKVVTTARRMS SFTLVSSQTQKMLASMLSDREFTENYIKTNRESLRKRHEYITEGLKNAGIECLQGNAGLF CWMNLGPLLEEPTREGELSLWNLILHEVKLNISPGSSCHCSEAGWFRVCFANMSQQTLDI ALRRIHAFMEKRKTTKGQAL ########## So I could take this protein and rename it as such: >DatePalm Putative 1-aminocylcopropane-1-carboxylate synthase 3 homolog MGVEFGVLLSEIATSDAHGEDSPYFAGWKAYDEDPYDAVSNPSGVIQMGLAENQVSFYLL ENYLEQHPEISNWESGISSFRENALFQDYHGLKTFRKAMASFMEQIRGGRAKFDPDRIVL TAGATAANELLTFILADPGDALLIPIPYYPGFDRDLRWRTGVHIIPVHCNGSNGFQITVK ALEEAYAEAGAANIRVRGLLLTNPSNPLGTAITRSVLEEILDFATQKDINLISDEIYSGS VFSSAEFVSMAEIVEARGYENSDRVHIVYSLSKDLGLPGFRVGTIYSYNNKVVTTARRMS SFTLVSSQTQKMLASMLSDREFTENYIKTNRESLRKRHEYITEGLKNAGIECLQGNAGLF CWMNLGPLLEEPTREGELSLWNLILHEVKLNISPGSSCHCSEAGWFRVCFANMSQQTLDI ALRRIHAFMEKRKTTKGQAL I now have the Date Palm homolog of ACC Synthase 3, the gene sequence, its location in the assembly, its exon/intron boundary coordinates and its protein translation. I'm now ready for some genotyping to understand differences in fruit ripening among the Date Palm varieties. I go to the file "PdactyKSNPs1.1.txt" to find the location of SNPs in this scaffold. I search for all occurences of the name "PdactyK1.0Scaffold_1284864". This returns me a few lines documenting all detected polymorhphisms in that scaffold. I find that there was a SNP detected at nucleotide 2928. Specifically there is a 'T' reported in the consensus sequence at bp 2928 while some reads have a 'C' at this position. The quality of the SNP is high with a good number of reads covering it so I look into what affect this SNP may have on the coding sequence. I find the SNP is indeed in the middle of exon 4 of the putative Date Palm ACC Synthase 3 gene. Furthermore I find that the 'T' to 'C' occurs in the first nucleotide of codon 280. This causes the codon to change from 'TTT' to 'CTT'. This leads to an amino acid change from Phenylalanine (F) to Leucine (L). We denote the change as F280L. So we have found that for the 'Khalas' variety of Date Palm, there are two parental alleles in a gene responsible for fruit ripening. The two alleles result in the production of two slightly different protein sequences. Could this be the cause for 'Khalas' variety's favoured fruit? I can now design primers to genotype this amino acid change causing polymorphism in other varieties based on the gene sequence provided in this assembly. I'll leave that to you! I think this case study demonstrates the utility of the sequence. In 30 minutes of work I was able to find a fruit ripening gene homolog, get its DNA and protein sequence in Date Palm, and find an amino acid changing polymorphism in it. Enjoy!