Codon Usage Pipeline based on BLASTX search

Alexander Kozik and Richard Michelmore
University of California at Davis

This codon usage pipeline is based on analysis of results of ungapped BLASTX search. The pipeline is a set of Python/Tcl-Tk scripts which extracts open reading frames (ORFs) from BLAST alignments generated by BLASTX search and counts codon usage for query sequences as well as for subject sequences and generates report in the form of tab-delimited files. This approach is useful when we want to find and extract putative ORFs from a set of ESTs (query) by comparing (aligning) it with a set of sequences with known ORFs (subject in BLAST terminology). Dataflow of this pipeline is shown below:

Codon Usage Pipeline


Input files:

    1. "DNA" file which corresponds to open reading frames of the subject sequences
    2. "Protein" file which derived by translation of "DNA" file
    For example, Arabidopsis ORFs and their translated sequences were derived from NCBI GenBank files using Python GenBank parser:
    ath_NCBI_cds.fasta.nr.info (Arabidopsis CDS or "DNA" file)
    ath_NCBI_genes.fasta.nr.info (corresponding protein sequences or "protein" file)
    (Note, instead of Arabidopsis files user can use any other dataset as long as "protein" file is derived from "DNA" file)

    3. cos_soybean.fasta (example set of soybean ESTs or "Query" file for which we want to find ORFs)

Programs and scripts:

    1. NCBI Blast program (you need to download and install BLAST program from NCBI ftp site)
    2. tcl_blast_parser_123_V025.tcl (read more at: http://www.atgc.org/BlastParser/Blast_Parser_017.html)

    3. SeqsExtractorFromBlastX_024_V07d.py

Pipeline usage:

    1. run BLASTX search of cos_soybean.fasta (ESTs query set) against Arabidopsis protein database:
    > blastall -p blastx -F "m S" -g F -d ./ath_NCBI_genes.fasta.nr.info
    -i ./cos_soybean.fasta -o x-cos_soybean_vs_ath_24hits_no_gaps.out
    -e 1e-10 -v 24 -b 24

    (note options -F "m S" [soft filtering] and -g F [ungapped BLAST])
    after some time (1-2 hours on 1 GHz CPU)  x-cos_soybean_vs_ath_24hits_no_gaps.out will be generated

    2. run tcl_blast_parser on BLASTX output:
    > tclsh tcl_blast_parser_123_V025.tcl x-cos_soybean_vs_ath_24hits_no_gaps.out
    x-cos_soybean_vs_ath_24hits_no_gaps.out 20 40 100 MATRIX


    after parsing several files will be generated (see details at
http://www.atgc.org/BlastParser/Blast_Parser_017.html).
    We are interested in x-cos_soybean_vs_ath_24hits_no_gaps.out.info2 file only which contains following information:
    SeqsExtractorFromBlastX will use information from "*.info2" file to extract sub-sequences from query and subject

    3. run SeqsExtractorFromBlastX script as described below:
   
> python SeqsExtractorFromBlastX_024_V07d.py cos_soybean.fasta ath_NCBI_cds.fasta.nr.info
    x-cos_soybean_vs_ath_24hits_no_gaps.out.info2 x-cos_soybean.codons x-cos_soybean.align


   
SeqsExtractorFromBlastX_024_V07d.py script takes five arguments as input:

   
SeqsExtractorFromBlastX script will generate a new directory with DNA sequence alignments in FASTA format:
        x-cos_soybean.align
    Each alignment is extracted according to first best hit of BLASTX report and written into separate file. Also, each alignment corresponds to open reading frames only and does not contain stop codons within query or subject sequences. However alignments may contain unambiguous letters (N or X, for example).
    You can download compressed example output here: x-cos_soybean.align.tar.gz

    Other output files of the
SeqsExtractorFromBlastX script:
    Detailed output explanation:

    Column description of "*.triplets" and "*.x_triplets" files:
    (Note, that the total number of all analyzed triplets of query and the total number of all analyzed triplets of subject may be different in the file "*.triplets" and must be the same in "*.x_triplets")

    Column description of "*x_kska" file:
    (Note, that the sum of values in columns 3 and 7 must be equal to value of column 8 ([synonymous substitutions] plus [perfect codon matches] = [amino acid matches])



    Additional scripts to determine GC% content in a set of sequences:

    Finally, we can calculate GC% content for our extracted EST query ORFs subset of sequences using seqs_processor_017_u.py script:

  > python seqs_processor_017_u.py x-cos_soybean.codons.query_seq x-cos_soybean.codons.query_seq.proc DNA

    Script takes three arguments ([input], [output], ["DNA" option]) and generates three output files:
    We are interested in x-cos_soybean.codons.query_seq.proc.stat  file which contains information about GC% content for every sequence as well as summary (last line)

    (Note,
seqs_processor_017_u.py script will terminate if gene ID duplication takes place in the input file. seqs_processor_017_d.py should work with duplicated IDs)

    py_stat_graph_007.py script can process x-cos_soybean.codons.query_seq.proc.stat file and generate MS excel ready output to show distribution of DNA sequences proportional to their GC% content:

   
> python py_stat_graph_007.py x-cos_soybean.codons.query_seq.proc.stat
    x-cos_soybean.codons.query_seq.proc.stat.graph 1 100 8


    py_stat_graph_007.py takes five arguments ( [input] [output] [min_value] [max_value] [column_number] ), analyzes defined column of input file and generates statistical information: x-cos_soybean.codons.query_seq.proc.stat.graph.



click here to find out how to obtain multiple alignments of orthologs by combining two subsets generated by pipeline described above.



    email: akozik@atgc.org
    last modified: November 18 2003