SNP/INDEL Discovery Pipeline based on CAP3 assembly
How to find thousand polymorphic sites in EST assembly in 24 steps
Alexander Kozik, Brian Chan and Richard Michelmore
Genome Center, UC Davis California
One of the aims of the Compositae Genome Project
is to generate PCR based markers for genetic mapping of lettuce and sunflower.
Compositae Genome Project database (CGPDB)
represents over 19,000 lettuce and 12,000 sunflower unigenes.
We have developed custom pipeline (actually set of scripts written in Python and Tcl/Tk)
to find SNPs (single Nucleotide Polymorphism) and INDELs (INsertion/DELetions) in EST contigs
assembled by CAP3 program.
By using our custom pipeline we have been able to find more than 2,500 SNPs/INDELs candidates
out of 12,500 lettuce and sunflower contigs.
to view examples. These candidates will be used to generate molecular markers.
To check whether our pipeline is suitable for any EST dataset we have tested it on tomato ESTs that are publicly
available on NCBI database. We have been able to detect about 1,000 SNP/INDEL candidates out of 3821 tomato
contigs for three genotypes: Lycopersicon esculentum, Lycopersicon hirsutum and
Following web pages describe detailed protocols how to use our pipeline on tomato ESTs as an example.
Note: this pipeline was designed by year 2003. Since that time a sligthly different approach
and improved scripts were developed. You can check the current protocol of EST selection and SNP discovery