Python Script to create FASTA file from Stacks files

Waplesry

unread,

Nov 27, 2012, 1:24:48 PM11/27/12

to stacks...@googlegroups.com

Hello Stacks community,

I have written a python (2,7) script to create a FASTA file that represents much of the information present in a set of stacks files in the fasta format. We used this as a way to generate input files for BLAST searches and when comparing Stacks results to other analyses.

This script contains a function that will take a file with a 'whitelist' of 'tag_IDs' and a set of stacks files (*.tags.tsv *snps.tsv, and *.allele.tsv) and output a FASTA file with one sequence entry for each observed haplotype present in the whitelisted loci.

It can be used on a catalog (output of cstacks) basename = "batch_3.catalog"

Or applied to an 'individual' (output of pstacks or ustacks), basename = 'sample_PHOOD11Xt94_0001'

You can open up the script for a little more direction , but the basic usage is as follows:

example command line:

python stacks_to_fasta.py stacks_path stacks_basename subset_file output_file

where:

stacks_path = the path to where the stacks files reside

stacks_basename = basic name shared by the group of stacks files

subset_file = whitelist file of tags to process

output_file = FASTA file to write

Fasta headers are created with tag_ID, Allele sequence, and Snp positions info present.

>[TAG#]_[ALLELE]_[SNP1_pos]-[SNP2_pos]-[SNP3_pos]-[...]

The script can also be imported and the function called directly.

The format of subset_file should be one tag_ID per line. (example attached)

This ID is matched to the third column in the tags file (also called 'stack ID')

I have also considered an option with IUPAC Nucleotide ambiguity codes instead of separate sequence entries, but this is yet to be implemented.

See the attached script and example whitelist file.

-Ryan

subset.txt

stacks_to_fasta.py

Julia Canitz

unread,

Jun 19, 2020, 9:44:43 AM6/19/20

to Stacks

Dear Ryan,

thanks for providing this script. I tried it and unfortunately, it does not work with my data. I had a look into the script but I am not an expert in python. I identified that the columns in the script are not the same as in my "tags.tsv" file (used Stacks 2.4). In my case the consensus sequence is located in column 6 instead of 9. I change the column in the script (line 72) and run the script. It outputs the following error:

Traceback (most recent call last):
File "/home/work-jc/scripts/stacks_to_fasta.py", line 122, in <module>
    main(sys.argv[1:])
File "/home/work-jc/scripts/stacks_to_fasta.py", line 119, in main
    stacks_to_fasta(*argv)
File "/home/work-jc/scripts/stacks_to_fasta.py", line 73, in stacks_to_fasta
    seq = columns[6]
IndexError: list index out of range

My ID List is the same as the yours and the IDs also occur in the "tags.tsv" file. By accident I changed the column to 5 (instead of 6). In this column the SeqID of the originial fastq file is stored. The script ran without an error but the output file (fasta file) is empty.

Can you please help me as I have to run this as soon as possible.

My initial command is:

stacks_to_fasta.py ../1_stacks_denovo/stacks_output_files/ BrC03_CCO_final_R1_denovo.fastq Test_ID.txt BrC03_Test

BrC03_CCO_R1_denovo.fastq - is the identifier for the "tags", "snps" and "alleles" files.

I ran the python script with python 2.7.12.

It would be great, if we can fix that.

Cheers.

Julia

Julian Catchen

unread,

Jun 19, 2020, 10:29:37 AM6/19/20

to 'Julia Canitz' via Stacks

Julia,

Why don't you just export FASTA from the populations program? -julian

'Julia Canitz' via Stacks wrote on 6/19/20 8:44 AM:

> Dear Ryan,
>
> thanks for providing this script. I tried it and unfortunately, it does
> not work with my data. I had a look into the script but I am not an
> expert in python. I identified that the columns in the script are not
> the same as in my "tags.tsv" file (used Stacks 2.4). In my case the
> consensus sequence is located in column 6 instead of 9. I change the
> column in the script (line 72) and run the script. It outputs the
> following error:
>

> *Traceback (most recent call last):

> File "/home/work-jc/scripts/stacks_to_fasta.py", line 122, in <module>
>     main(sys.argv[1:])
> File "/home/work-jc/scripts/stacks_to_fasta.py", line 119, in main
>     stacks_to_fasta(*argv)
> File "/home/work-jc/scripts/stacks_to_fasta.py", line 73, in
> stacks_to_fasta
>     seq = columns[6]

> IndexError: list index out of range*
> *
> *

> My ID List is the same as the yours and the IDs also occur in the
> "tags.tsv" file. By accident I changed the column to 5 (instead of 6).
> In this column the SeqID of the originial fastq file is stored. The
> script ran without an error but the output file (fasta file) is empty.
>
> Can you please help me as I have to run this as soon as possible.
> My initial command is:

> stacks_to_fasta.py *../1_stacks_denovo/stacks_output_files/*
> BrC03_CCO_final_R1_denovo.fastq *Test_ID.txt* BrC03_Test

Julia Canitz

unread,

Jun 22, 2020, 2:03:38 AM6/22/20

to Stacks

Hey Julian,

the issue is that I have a filtered VCF file based on ddRAD data which were assembled de novo. I want to have the fasta files for each individual containing only the sequences according to the filtered VCF IDs.

The main idea was to get the sequences for these IDS, as almost all individuals cover all SNPs quite well and the tags are reliable. Based on these sequence I wanna do an ASTRAL phylogeny.

Do you have any Idea, how to get back from filtered VCF to fasta of the individuals?