SNAPP input formatting and memory limits

156 views
Skip to first unread message

JPC

unread,
Sep 9, 2016, 12:40:59 PM9/9/16
to beast-users
I am working with very large whole genome sequences and was wondering if there is a memory limit or a data size limit for running SNAPP? All my files are >1GB. 

Second, I have been trying to convert to SNAPP format for sometime and although there are some nice tricks and work arounds on this forum, I cannot manually edit my files because they are too large to open and edit. Therefore, I was wondering if you know how to convert such large files into a binary nexus file for SNAPP? The closest I am getting is through BCF tools, where I can convert a vcf into binary bcf using bcftools query -H -f '[\t%GT]\n' filename.vcf -o outfile, I was hoping to use sed find and replace to make all 0/0 into 0, 0/1 into 1, and 1/1 into 2. However, the file also contains 1/2, 2/2, and 0/2 and I am unsure how to manually code these for SNAPP.

Alternatively, I also have fasta and fastq files for each individual that I could concatenate together as an alignment for SNAPP, then I would need to write a python script to code it in binary and then convert to nexus. I also have ped and bed formatted files if there are any solutions using those?

Any advice is greatly appreciated!

Remco Bouckaert

unread,
Sep 18, 2016, 8:46:34 PM9/18/16
to beast...@googlegroups.com
One option is to encode the alignment in the XML by pattern (= unique column) and for each pattern register its weight in the weights-input. If there are not too many patterns, SNAPP should be able to handle the data since computational time is proportional to the number of patterns (which usually is much smaller than the size of the complete alignment). However, if the number of patterns is very large SNAPP will struggle to calculate the likelihood and the analysis will probably not reach convergence.

The format of a weighted alignment is as follows: use snap.WeightedData instead of snap.Data and add a line with <input name=“weights”> containing a comma separated list of weights. It may look something like this:

<data spec='snap.WeightedData' id='alignment' dataType='integerdata' statecount='3’> 
              <input name='weights'>1,2,3,1,4,4,1,7,1,1,1,1</input> 
 <sequence taxon='A' totalcount='3'>2,1,2,2,2,1,0,0,2,2,2,0</sequence> 
 <sequence taxon='B' totalcount='3'>2,2,1,2,2,0,1,0,1,2,0,1</sequence> 
 <sequence taxon='C' totalcount='3'>1,2,2,2,0,0,0,1,2,1,2,1</sequence> 
 <sequence taxon='D' totalcount='3'>2,2,2,1,0,0,0,0,0,0,0,1</sequence> 
</data>

Note that patterns must be unique (it is a bit fragile). So, it is probably easiest to set up the analysis in BEAUti using a very small subset, save the XML and replace the alignment with a weighted alignment.

Hope this helps,

Remco



--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at https://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages