Python and Tcl/Tk scripts and tools to process and analyze DNA sequences and related data

GenBank2Fasta_UniExtractor_124.tcl - GenBank to Fasta file converter; besides of sequence extraction this parser extracts additional useful information from GenBank file and place it into Fasta header file.
GenBank2Fasta_UniExtractor_126.tcl - current version, minor bug fixes.

seqs_processor_and_translator_bin_V124_AGCT.py - DNA sequence processor and translator; it does translation in 6 frames in batch mode. Brief description is here
seqs_processor_and_translator_bin_V126_AGCT.py - current version, it has new function - sequence split into multiple fasta files.
seqs_processor_and_translator_bin_V128_AGCT.py - the same as above with additional option to create fake quality for FASTA file.
seqs_processor_and_translator_bin_V136_AGCT.py - the same as above with an option to convert fasta alignments into CAP3 style.

tcl_blast_parser_123_V038.tcl - NCBI BLAST parser. Detailed description is here
tcl_blast_parser_123_V039.tcl - current version
tcl_blast_parser_123_V041.tcl - current version - to find common query overlap
tcl_blast_parser_123_V043_SS_beta03.tcl - beta version to extract subgroups in FASTA format for dowstream assembly program (CAP3, DiAlign, etc)
tcl_blast_parser_123_V047.tcl - Febr. 19 2009 version - fixed bug for long query length (10,000 or longer); derived from V041

SeqsExtractorFromBlastX_V124.py - Extraction of ORF (open reading frame) from BLAST-X report. BLAST EST sequences against protein reference database and extract EST fragment that correspond to BLAST-X alignment.
SeqsExtractorFromBlastX_V126.py - current version (with no_hits counter).

SeqsExtractorFromTclBlast_V001.py - extraction of sub-region from BLAST report (blast-x) if hit ID has match to query ID.

seqs_subgroup_extr_001.py - sequence subgroup extractor (1)
seqs_subgroup_extr_003.py - sequence subgroup extractor (3)
to extract sequence subset from FASTA file based on gene ID list: version (1) - full size sequence extraction
version (3) - extraction of defined fragment

seqs_drobilka_003_mod.py - sequence splitter into overlapping fragments.

py_stat_graph_012.py - stat info summary display per column in tab delimited files; useful for downstream analysis in MS excel

seqs_trimmer_2007_03_20.py - EST sequence trimmer. It's weird, use it on your own risk. seqs_trimmer_2007_05_18.py - current version

seqs_processor_ultra_polyA_V009.py - sequence masking based on BLAST-N search against Vector_M_PolyAAA.fasta vector database.
Vector_DB_NCBI_2007_10_29_CGP.polyA.fasta - recent vector db
It's weird too (masking), use it on your own risk.
seqs_processor_ultra_polyA_V009_m.py - to work with tcl_blast_parser version 041

redundancy_elimination_005.py - redundancy elimination for sequences in FASTA file by Travis Kleeburg. read more here

qsep_002M.py - quality scores extractor from Phred output and trimmed sequences

Scripts to process CAP3 alignments:
Python_CAP3_ContigExtractor_Uni_2007_03_19.py
Python_CAP3_MM_Finder_Uni_2007_03_19.py
Python_CAP3_MM_Finder_Uni_2007_03_24f.py - current experimental version
Python_CAP3_MM_Finder_Uni_2007_03_24h.py - current experimental version
Python_CAP3_contig_poly_DIS_Uni_2007_03_19.py
Python_CAP3_ClipInfoExtractor_Uni_2007_03_19.py
Detailed description is here
recent versions of MM finder in CAP3 assembly:
Python_CAP3_MM_Finder_Uni_2007_08_14c.py - generic version; more details here
Python_CAP3_MM_Finder_Uni_2007_09_01a.py - for CGP particular project, unlikely you need it...
Python_CAP3_MM_Finder_Uni_2008_01_26c.py - state of the art display of sequence coverage per nucleotide and related SNP/InDel info

Scripts to run CAP3 in batch mode with pre-defined groups
CAP3_Anchored_Batch_Run_2007_08_31.py
CAP3_Anchored_Batch_Run_2007_09_14.py
Python_CAP3_ContigExtractor_Oct_25_2005.py - to work only with two scripts above, it is buggy...

Manipulation with CAP3 derivative files:
getcontig.py - post-processing of so-called CAP3 Info file after Python_CAP3_ContigExtractor_Uni_2007_03_19.py script
countContig.py - estimation of CAP3 contig complexity based on CAP3 Info file after Python_CAP3_ContigExtractor_Uni_2007_03_19.py script
read more here

SequenceTrimmer.py - to trim low-quality region from CAP3 alignment
detailed description is here

cap3_alignment2tab_03.py - to generate 'sequence_gap' file from CAP3 alignment

Scripts for Genetic Maps
addDuplMarker.py - add duplicated markers to non-redundant map
Instructions are here

MadMapper - current versions:
Python_MadMapper_V248_RECBIT_012NR.py - clustering
Python_MadMapper_V248_RECBIT_016NR.py - clustering, February 28 2008 update (reduced memory usage, *.all_pairs file is optional)
Python_MadMapper_V248_XDELTA_117.py - map construction
Python_MadMapper_V248_XDELTA_119.py - map construction (current version; variable column ID with pairwise data)
py_matrix_2D_V248_RECBIT.py - map visualization
MadMapper details here

MadMapper clustering based on numerical data
Python_UniCluster_V011.py - really 'beta' ...
Python_UniCluster_V014.py - the latest 'beta'; it generates pairwise matrix with values that can be used for the fine ordering/sorting using Python_MadMapper_V248_XDELTA_119.py script

Scripts to manipulate tab-delimited tables
tableRotation_2007_03_21.py
tableSort_2007_03_21.py
Read more here

Pixelirator - graphical data display for tab delimited tables

Scripts for Affymetrix Chip design
seqs_processor_and_translator_bin_V027_AGCT_Affy_V05.py - to generate Affy submission
seqs_processor_and_translator_bin_V027_AGCT_N2A.py - to convert 'N' to 'A' in fasta file
AffyProbeSetSorter-006.py
TkLife_Search_07M_Affy_05_off1_100L_ContigViewerTest.tcl
TkLife_Search_12M_AffySuper_25_off1_300L_025_035_25M.tcl
z-xlog-run-affy-chip.txt

TkLife_Search_07M_LettuceAffy_04_off1_100L_ContigViewerTest.tcl
TkLife_Search_07M_PepperAffy_04_off1_100L_ContigViewerTest.tcl -
to find multiple perfect matches of affy probes within reference set; the reference set should be provided as tab-delimited file with forward and reverse sequences


email: akozik@atgc.org
last modified: September 24 2007