|
Exceptions and alternative solutions: What if your original dataset is not available in GenBank format? You have, for example, just plain FASTA file. No problem, you can easily modify FASTA header and add species-specific prefixes using perl /find/replace/ regular expressions. If you execute from command line: $ perl -p -i -e 's/^\>/\>Cich_endi./' my_est_file.fasta this simple operation will do the job. It will replace all '>' signs in fasta header with '>Cich_endi.' string. After that your modified FASTA file is ready for next steps of this pipeline. you can find more about this wonderful perl trick doing Google search: http://www.google.com/search?q=perl+find+replace+in+line example manipulations on real file: $ cp my_est_file.fasta my_est_file.fasta.back $ perl -p -i -e 's/^\>/\>Cich_endi./' my_est_file.fasta check my_est_file.fasta.back - before modification and my_est_file.fasta - with modified FASTA header. |
|
Exceptions and alternative solutions: What if you don't want to extract CDS regions? Instead, you prefer to work with full-size DNA fragments. As usually, no problem. Skip PART 2, and jump directly to the CAP3 assembly after PART 1. CDS extraction is not required for CAP3 assembly and SNP discovery. Your assembly will be different. There are some advantages and disadvantages for both methods. However, if you decide to skip PART 2, CDS extraction, you need to check very carefully your EST dataset for remaining vector contaminations and poly-A tails. If EST sequences are not clean enough, additional step, masking and trimming, is required. In other words, you can start CAP3 assembly as soon your dataset is clean from vector and poly-A, and you assigned species-specific prefixes for all EST IDs. |
|
Exceptions and alternative solutions: What if your dataset is represented by too many sequences, 200,000 ESTs or more? Unlikely, CAP3 could handle more than 200,000 individual sequences. There are several possible solutions to reduce the number of ESTs per assembly: 1. Create non-redundant set of ESTs (many highly expressed ESTs represented by multiple identical copies). 2. Select ESTs derived only from particular libraries or submissions. 3. Select randomly chosen subset from large set. Dirty trick - you can select ESTs with odd GenBank IDs as one set, and with even numbers as another. 4. Divide large EST set into smaller non-overlapping subsets based on BLAST search against Reference Protein database, e.g. all kinases may go into one group, all cytochromes into other and so on. 5. BLAST (blastn) EST sequences of one genotype/species against another. Select only these sequences that have common overlap between genotypes/species. |