Hello Stacks community,
I have written a python (2,7) script to create a FASTA file that represents much of the information present in a set of stacks files in the fasta format. We used this as a way to generate input files for BLAST searches and when comparing Stacks results to other analyses.
This script contains a function that will take a file with a 'whitelist' of 'tag_IDs' and a set of stacks files (*.tags.tsv *snps.tsv, and *.allele.tsv) and output a FASTA file with one sequence entry for each observed haplotype present in the whitelisted loci.
It can be used on a catalog (output of cstacks) basename = "batch_3.catalog"
Or applied to an 'individual' (output of pstacks or ustacks), basename = 'sample_PHOOD11Xt94_0001'
You can open up the script for a little more direction , but the basic usage is as follows:
example command line:
python stacks_to_fasta.py stacks_path stacks_basename subset_file output_file
where:
stacks_path = the path to where the stacks files reside
stacks_basename = basic name shared by the group of stacks files
subset_file = whitelist file of tags to process
output_file = FASTA file to write
Fasta headers are created with tag_ID, Allele sequence, and Snp positions info present.
>[TAG#]_[ALLELE]_[SNP1_pos]-[SNP2_pos]-[SNP3_pos]-[...]
The script can also be imported and the function called directly.
The format of subset_file should be one tag_ID per line. (example attached)
This ID is matched to the third column in the tags file (also called 'stack ID')
I have also considered an option with IUPAC Nucleotide ambiguity codes instead of separate sequence entries, but this is yet to be implemented.
See the attached script and example whitelist file.
-Ryan