CDS assembly and SNP discovery -
example for Cichorium intybus and Cichorium endivia ESTs
( by Alexander Kozik, August 2007 )

This web page describe a protocol, or step by step instructions, how to identify SNPs and InDels (insertion-deletions) in a set of ESTs of two different species or genotypes. Cichorium intybus and Cichorium endivia ESTs were chosen as an example.

The protocol, Python SDP (Python SNP Discovery Pipeline) has five major parts:

PART 1 - ESTs Download from NCBI GenBank Database
PART 2 - Selection of ESTs homologous to known proteins and CDS region extraction
PART 3 - CAP3 assembly
PART 4 - SNP and InDel discovery
PART 5 - Visualization with Python Contig Viewer

PART 1 - ESTs Download from NCBI GenBank Database
( here we will download ESTs for two species in GenBank format and convert them into FASTA format )

use NCBI GenBank Taxonomy Browser to search for species name Cichorium intybus

follow web link to access Nucleotide sequences

follow web link to select EST sequences

change 'Summary' to 'GenBank' format

select 'Send to File'

save file with default name '' on local hard disk
( do not forget to re-name file '' to '' after download !
it's a good idea to keep timestamp like 2007_08_10 on the end of filename )

repeat search and download Cichorium endivia ESTs

transfer files using sftp to Linux/Unix machine and check their content using less or more

so far, we should have two files with names:

using Unix grep and wc -l check each file for number of sequences:

$ grep "ORIGIN" | wc -l

$ grep "ORIGIN" | wc -l

then using Tcl script GenBank2Fasta_UniExtractor_126.tcl we can transform sequences from GenBank to FASTA format:

$ tclsh GenBank2Fasta_UniExtractor_126.tcl
Enter the SOURCE (GenBank) file name:
Enter the DESTINATION (Fasta) file name: Cich_inty_ESTs_2007_08_10.fasta

after you press Enter for destination file name, the script should run for some time and generate FASTA file:


repeat FASTA extraction from GenBank for and get:


use less or more to check file content and FASTA headers (below are examples of FASTA headers with species-specific prefixes:

>Cich_inty.EH710692 ( EH710692.1 GI:124612347 ) ...

>Cich_endi.EL372564 ( EL372564.1 GI:125358052 ) ...

Note, that each GenBank ID should have species specific nine-letter prefix in generated FASTA files. These prefixes will be used to distinguish different species in the downstream analysis.

Exceptions and alternative solutions:
What if your original dataset is not available in GenBank format? You have, for example, just plain FASTA file. No problem, you can easily modify FASTA header and add species-specific prefixes using perl /find/replace/ regular expressions. If you execute from command line:

$ perl -p -i -e 's/^\>/\>Cich_endi./' my_est_file.fasta

this simple operation will do the job. It will replace all '>' signs in fasta header with '>Cich_endi.' string. After that your modified FASTA file is ready for next steps of this pipeline.

you can find more about this wonderful perl trick doing Google search:

example manipulations on real file:

$ cp my_est_file.fasta my_est_file.fasta.back
$ perl -p -i -e 's/^\>/\>Cich_endi./' my_est_file.fasta

check my_est_file.fasta.back - before modification and my_est_file.fasta - with modified FASTA header.

From here we are ready to BLAST (blastx) our EST sequences against protein reference set, Arabidopsis genome, in our case.
Time to move to the PART 2:

PART 2 - BLAST (blastx) versus known proteins and extraction of CDS (coding) regions

first, we need to format A_thaliana_ATGC_2006_08_24.protein.fasta FASTA file with Arabidopsis predicted protein sequences using formatdb BLAST command:

$ formatdb -i A_thaliana_ATGC_2006_08_24.protein.fasta -p T -o T

then we will run BLAST search (blastx) of ESTs versus Arabidopsis proteins (two times, one for Cich_inty_ESTs and other for Cich_endi_ESTs):

$ blastall -p blastx -e 1e-20 -F F -b 3 -v 3 -d A_thaliana_ATGC_2006_08_24.protein.fasta -i Cich_inty_ESTs_2007_08_10.fasta -o Cich_inty_ESTs_2007_08_10.vs.Arabidopsis.BlastX &

$ blastall -p blastx -e 1e-20 -F F -b 3 -v 3 -d A_thaliana_ATGC_2006_08_24.protein.fasta -i Cich_endi_ESTs_2007_08_10.fasta -o Cich_endi_ESTs_2007_08_10.vs.Arabidopsis.BlastX &

where by options:
-e 1e-20 we want to get BLAST hits with expectation 1e-20 or better
-F F we do not want to use low-complexity filter
-b 3 -v 3 we restrict search to first three best hits only

This particular BLAST (blastx) will run for day or so. Take the rest and practice Google search.

When BLAST search is done (you can use top command to check BLAST status) run tcl_blast_parser_123_V041.tcl on BLAST output:

$ tclsh tcl_blast_parser_123_V041.tcl Cich_inty_ESTs_2007_08_10.vs.Arabidopsis.BlastX Cich_inty_ESTs_2007_08_10.vs.Arabidopsis.BlastX 20 40 100 MATRIX

$ tclsh tcl_blast_parser_123_V041.tcl Cich_endi_ESTs_2007_08_10.vs.Arabidopsis.BlastX Cich_endi_ESTs_2007_08_10.vs.Arabidopsis.BlastX 20 40 100 MATRIX

tcl_blast_parser_123_V041.tcl will generate several files (see details here). We are interested in:

these so called *.info2 files contain coordinates of BLAST alignments. These coordinates will be used to extract CDS region from ESTs.

Then we will run script to extract CDS (coding) region from original FASTA file based on parsed results of BLAST search:

$ python Cich_inty_ESTs_2007_08_10.fasta Cich_inty_ESTs_2007_08_10.vs.Arabidopsis.BlastX.info2 Cich_inty_ESTs_2007_08_10.x.CDS

$ python Cich_endi_ESTs_2007_08_10.fasta Cich_endi_ESTs_2007_08_10.vs.Arabidopsis.BlastX.info2 Cich_endi_ESTs_2007_08_10.x.CDS output files (in abc-order for Cich_inty_ESTs_2007_08_10 set):

Cich_inty_ESTs_2007_08_10.x.CDS.x_no_hits_found - subset of sequences without BLAST hits to the Reference Dataset (Arabidopsis proteins)
Cich_inty_ESTs_2007_08_10.x.CDS.x_seq3 - 3' region out from Cich_inty_ESTs_2007_08_10.x.CDS.x_seqCDS_FR_Region ( EST/DNA region adjacent to the right side of BLAST alignment, if available )
Cich_inty_ESTs_2007_08_10.x.CDS.x_seq5 - 5' region out from Cich_inty_ESTs_2007_08_10.x.CDS.x_seqCDS_FR_Region ( EST/DNA region adjacent to the left side of BLAST alignment, if available )
Cich_inty_ESTs_2007_08_10.x.CDS.x_seqCDS_FR_Region - extracted CDS regions based on BLAST alignments (it's so called 'clean' set, there is no any 'N' letters, only 'ATGC' set is allowed); this file (subset) will be used for the following CAP3 assembly and SNP discovery
Cich_inty_ESTs_2007_08_10.x.CDS.x_seqCDS_UP_All - regions corresponding to BLAST alignments are in uppercase; 'N' letters may occur within these regions
Cich_inty_ESTs_2007_08_10.x.CDS.x_seqCDS_UP_Clean - regions corresponding to BLAST alignments are in uppercase; 'N' letters are not allowed; uppercase regions of this subset correspond to Cich_inty_ESTs_2007_08_10.x.CDS.x_seqCDS_FR_Region file (number of sequences in both files should be the same); lowercase should correspond to sub-sequences in 3' and 5' files (see *.x_seq3 and *.x_seq5 files above)
Cich_inty_ESTs_2007_08_10.x.CDS.x_seqSummary - summary (info) about extracted regions; it contains numbers (coordinates) extracted from BLAST report

scheme of CDS extraction:

Copy *.x_seqCDS_FR_Region files under new names before you start CAP3 assembly. These files will be modified and we want to preserve original copies.

$ cp -i Cich_inty_ESTs_2007_08_10.x.CDS.x_seqCDS_FR_Region Cich_inty_ESTs_2007_08_10.CDS.fasta

$ cp -i Cich_endi_ESTs_2007_08_10.x.CDS.x_seqCDS_FR_Region Cich_endi_ESTs_2007_08_10.CDS.fasta

Exceptions and alternative solutions:
What if you don't want to extract CDS regions? Instead, you prefer to work with full-size DNA fragments. As usually, no problem. Skip PART 2, and jump directly to the CAP3 assembly after PART 1. CDS extraction is not required for CAP3 assembly and SNP discovery. Your assembly will be different. There are some advantages and disadvantages for both methods. However, if you decide to skip PART 2, CDS extraction, you need to check very carefully your EST dataset for remaining vector contaminations and poly-A tails. If EST sequences are not clean enough, additional step, masking and trimming, is required.

In other words, you can start CAP3 assembly as soon your dataset is clean from vector and poly-A, and you assigned species-specific prefixes for all EST IDs.

PART 3 - CAP3 assembly

It's a good idea to create new folder/directory _data_cap3_, for example, and place Cich_inty_ESTs_2007_08_10.CDS.fasta and Cich_endi_ESTs_2007_08_10.CDS.fasta files there.

CAP3 does not like sequence IDs longer than 20 characters. Our IDs look like Cich_inty.EH710688.CDS (22 characters). We want to make them shorter Cich_inty.EH710688.c, for example. We can use perl /find/replace/ regular expressions (see more about this perl trick here):

$ perl -p -i -e 's/\.CDS /\.c /' Cich_inty_ESTs_2007_08_10.CDS.fasta

$ perl -p -i -e 's/\.CDS /\.c /' Cich_endi_ESTs_2007_08_10.CDS.fasta

after this simple manipulation all '.CDS ' strings will be replaced with '.c ' and sequence IDs became shorter.

Then we need to combine (concatenate) two files Cich_endi_ESTs_2007_08_10.CDS.fasta and Cich_inty_ESTs_2007_08_10.CDS.fasta into one Cich_2spc_ESTs_2007_08_10.CDS.fasta

$ cat Cich_endi_ESTs_2007_08_10.CDS.fasta Cich_inty_ESTs_2007_08_10.CDS.fasta > Cich_2spc_ESTs_2007_08_10.CDS.fasta

this file Cich_2spc_ESTs_2007_08_10.CDS.fasta contains EST sequences for both species/genotypes and will be used as an input for CAP3 assembly:

$ cap3 Cich_2spc_ESTs_2007_08_10.CDS.fasta -o 80 -p 90 > Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90 &

CAP3 takes Cich_2spc_ESTs_2007_08_10.CDS.fasta as an input data and will try to assemble sequences in that file with options '-o 80' - overlap 80 nt, and '-p 90' - identity 90%;
Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90 is a primary output of this CAP3 run along with several other files. Other important files are:
Cich_2spc_ESTs_2007_08_10.CDS.fasta.cap.contigs and

It will take several hours to finish this particular CAP3 run (use Unix top commant to monitor the progress).
After CAP3 completion, we can combine (concatenate) contigs and singlets together into one assembly file:

$ cat Cich_2spc_ESTs_2007_08_10.CDS.fasta.cap.contigs Cich_2spc_ESTs_2007_08_10.CDS.fasta.cap.singlets > Cich_2spc_ESTs_2007_08_10.CDS.assembly

All contigs in the assembly file have generic ID pattern 'Contig'; we want to change it to project specific pattern 'CICH_2CDS.CSA1.', for example:

$ perl -p -i -e 's/\>Contig/\>CICH_2CDS.CSA1./' Cich_2spc_ESTs_2007_08_10.CDS.assembly

will do the job, our file with unigenes:
Cich_2spc_ESTs_2007_08_10.CDS.assembly has project specific contig IDs

we can use grep to get basic info about assembly:

$ grep "CICH_2CDS.CSA1" Cich_2spc_ESTs_2007_08_10.CDS.assembly | wc -l
9131 - number of contigs

$ grep "Cich_[e|i]" Cich_2spc_ESTs_2007_08_10.CDS.assembly | wc -l
9829 - total number of singlets

$ grep "Cich_e" Cich_2spc_ESTs_2007_08_10.CDS.assembly | wc -l
4607 - number of Cich_endi singlets

$ grep "Cich_i" Cich_2spc_ESTs_2007_08_10.CDS.assembly | wc -l
5222 - number of Cich_inty singlets

$ grep ">" Cich_2spc_ESTs_2007_08_10.CDS.assembly | wc -l
18960 - number of unigenes (contigs and singlets)

Exceptions and alternative solutions:
What if your dataset is represented by too many sequences, 200,000 ESTs or more? Unlikely, CAP3 could handle more than 200,000 individual sequences. There are several possible solutions to reduce the number of ESTs per assembly:

1. Create non-redundant set of ESTs (many highly expressed ESTs represented by multiple identical copies).

2. Select ESTs derived only from particular libraries or submissions.

3. Select randomly chosen subset from large set. Dirty trick - you can select ESTs with odd GenBank IDs as one set, and with even numbers as another.

4. Divide large EST set into smaller non-overlapping subsets based on BLAST search against Reference Protein database, e.g. all kinases may go into one group, all cytochromes into other and so on.

5. BLAST (blastn) EST sequences of one genotype/species against another. Select only these sequences that have common overlap between genotypes/species.

PART 4 - SNP and InDel discovery - Python script for post-processing of CAP3 output - Python script for SNP discovery


bash-2.03$ python
What type of output do you want? (1/2): 1
Enter the SOURCE file name: Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90
Enter the DESTINATION file name: Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90.Info
Default contig file name prefix is Contig
Enter the contig file name prefix:CICH_2CDS.CSA1.


bash-2.03$ python
What type of output do you want? (1/2): 2
Enter the SOURCE file name: Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90
Enter the DESTINATION directory with alignments: Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90.Align
Default contig file extension is aln
Enter the contig file extension :
Default contig file name prefix is Contig
Enter the contig file name prefix:CICH_2CDS.CSA1.


cd Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90.Align
cat CICH_2CDS.CSA1.*.aln > ../Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90.Align.txt
cd ../


Enter the SOURCE file name: Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90.Align.txt
Enter the DESTINATION file name: Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90.MM
Enter the CONSENSUS_ID pattern: CICH_2CDS.CSA1.

Cich_2spc_ESTs_2007_08_10.CDS.cap3.80.90.MM.mm_info.good - SNP info

PART 5 - Visualization with Python Contig Viewer - Python Contig Viewer

tcl_blast_parser_123_V025S.tcl - tcl BLAST parser V25S

Finally, here the list of all commands and operations in one plain text file:

00_CDS_Pipeline.HOWTO - list of all operations performed in the Python SNP discovery pipeline

last modified: November 07 2007