COS (Conserved Ortholog Set) Markers Overview
To study synteny between different life organisms very important
to select the set of conserved ortholog markers (genes). Because of multiple gene duplications it is
hard to distinguish orthologs from paralogs. Recently fully sequenced Arabidopsis genome revealed
about 3,000 single copy genes distributed evenly over all five chromosomes. These unique genes distributed
over genome regardless of segmental duplications regions. Using Lettuce/Sunflower as well as Tomato and
Corn ESTs we have selected and characterized more than 2,000 Arabidopsis genes as candidates for COS
(Conserved Ortholog Set) markers. By computational approach
we selected only these sequences which have just a single BLAST hit to Arabidopsis genome.
Graphical representation
of Conserved Ortholog Set (COS) Markers candidates
Strategy and Tools to select COS markers
We concentrated our efforts to identify those Lettuce/Sunflower ESTs which
have just a single strong BLAST hit to Arabidopsis genome. Based on our detailed and critical analysis of previous work by
Fulton TM, Van der Hoeven R, Eannetta NT and Tanksley SD
to identify Tomato COS markers we have realized that simple BLAST search of ESTs against Arabidopsis genome can not reveal
the true set of COS with single hits to target genome. See the details of our analysis here.
The hidden problem are genomic sequences with multidomain structure. Consider situation depicted on Figure 1:
Figure 1
"EST 1" has a single hit to "ORF A". "EST 2" has a single hit to "ORF B". However they can not be considered as
potential COS because "EST 3" has hits to both "ORF A" and "ORF B". Clustering of results of BLAST search and
graph analysis to eliminate cases like this are required for proper COS identification. We have used
tcl_blast_parser_123.tcl script and
Graph9 program to identify ESTs-Arabidopsis clusters
with a single Arabidopsis node (sequence). Results of search and analysis are displayed at
CGP database as a
COS Table and graphically.
General strategy to identify COS candidates was:
- BLAST (tblastn) search of all Arabidopsis predicted ORFs against the whole EST set for each plant (NCBI EST database).
- Selection of subset of ESTs with best match for every translated Arabidopsis ORF.
- BLAST (blastx) search of selected ESTs against Arabidopsis ORFs.
- Parsing results of previous search with tcl_blast_parser_123.tcl script and selection of ESTs with a single hit to Arabidopsis.
- Clustering analysis using Graph9 program and removing from potential COS set all EST-Arabidopsis clusters with
multiple Arabidopsis nodes.
- Final set with clusters where Arabidopsis gene is represented as a single node can be considered as a true
Conserved Orthologs Set (COS).
- BLAST (tblastx) individual COS ESTs against EST assembly to select those EST sequences which are unique for the given assembly.
Figure 2 - Pipeline to find clusters with a single Arabidopsis node.
So far, we have identified 1130 potential COS markers for Lettuce, 426 for Sunflower,
1860 for Tomato and 1413 for Corn with total number of Arabidopsis sequences 2185. These numbers correspond to EST sequences
with BLAST expectation value 1e-20 or better from:
COS Table at CGPDB.