Lettuce/Sunflower COS (Conserved Ortholog Set) Candidates

Computational approach to select the set of ESTs with a single BLAST hit to Arabidopsis genome

by Alexander Kozik and Richard Michelmore,
University of California at Davis

COS (Conserved Ortholog Set) Markers Overview

      To study synteny between different life organisms very important to select the set of conserved ortholog markers (genes). Because of multiple gene duplications it is hard to distinguish orthologs from paralogs. Recently fully sequenced Arabidopsis genome revealed about 3,000 single copy genes distributed evenly over all five chromosomes. These unique genes distributed over genome regardless of segmental duplications regions. Using Lettuce/Sunflower as well as Tomato and Corn ESTs we have selected and characterized more than 2,000 Arabidopsis genes as candidates for COS (Conserved Ortholog Set) markers. By computational approach we selected only these sequences which have just a single BLAST hit to Arabidopsis genome.

Graphical representation of Conserved Ortholog Set (COS) Markers candidates

Strategy and Tools to select COS markers

      We concentrated our efforts to identify those Lettuce/Sunflower ESTs which have just a single strong BLAST hit to Arabidopsis genome. Based on our detailed and critical analysis of previous work by Fulton TM, Van der Hoeven R, Eannetta NT and Tanksley SD to identify Tomato COS markers we have realized that simple BLAST search of ESTs against Arabidopsis genome can not reveal the true set of COS with single hits to target genome. See the details of our analysis here. The hidden problem are genomic sequences with multidomain structure. Consider situation depicted on Figure 1:

Figure 1

      "EST 1" has a single hit to "ORF A". "EST 2" has a single hit to "ORF B". However they can not be considered as potential COS because "EST 3" has hits to both "ORF A" and "ORF B". Clustering of results of BLAST search and graph analysis to eliminate cases like this are required for proper COS identification. We have used tcl_blast_parser_123.tcl script and Graph9 program to identify ESTs-Arabidopsis clusters with a single Arabidopsis node (sequence). Results of search and analysis are displayed at CGP database as a COS Table and graphically.

      General strategy to identify COS candidates was:
  • BLAST (tblastn) search of all Arabidopsis predicted ORFs against the whole EST set for each plant (NCBI EST database).
  • Selection of subset of ESTs with best match for every translated Arabidopsis ORF.
  • BLAST (blastx) search of selected ESTs against Arabidopsis ORFs.
  • Parsing results of previous search with tcl_blast_parser_123.tcl script and selection of ESTs with a single hit to Arabidopsis.
  • Clustering analysis using Graph9 program and removing from potential COS set all EST-Arabidopsis clusters with multiple Arabidopsis nodes.
  • Final set with clusters where Arabidopsis gene is represented as a single node can be considered as a true Conserved Orthologs Set (COS).
  • BLAST (tblastx) individual COS ESTs against EST assembly to select those EST sequences which are unique for the given assembly.

Figure 2 - Pipeline to find clusters with a single Arabidopsis node.

      So far, we have identified 1130 potential COS markers for Lettuce, 426 for Sunflower, 1860 for Tomato and 1413 for Corn with total number of Arabidopsis sequences 2185. These numbers correspond to EST sequences with BLAST expectation value 1e-20 or better from: COS Table at CGPDB.

email: Alexander Kozik

Last modified December 02, 2002