SNAPP and missing (SNP) data

1,256 views
Skip to first unread message

Brant Faircloth

unread,
Jul 16, 2014, 5:08:47 PM7/16/14
to beast...@googlegroups.com
Hi Remco/David/Others,

SNAPP is really great - thanks for your hard work!  Remco - your BEAST blog posts are awesome! Relative to SNAPP, i have a quick question pertaining to missing data.  currently, we've coded sites having individuals lacking a SNP call using "?", as suggested on the mailing list.  e.g.,:

#NEXUS
begin data;
        dimensions ntax=28 nchar=4430;
        format datatype=standard missing=? gap=-;
matrix
103200 0000000000000200100001
144749 ??020???000???00?????0
67235  ?000000210002000020000
;
end;

this works fine in SNAPP, as expected given posts on the mailing list.  my question is whether "support" for missing data simply means BEAUTI/SNAPP will allow the user to enter a nexus file with missing data coded as "?", while the sites having missing data are excluded prior to the analysis - something that seems to be happening given the output when I load the XML from the above file (in it's entirety) with `beast -threads 12 TIPS.xml` and see the following in the log (to stdout):

> WARNING: removed 3037 sites becaues they have one or more branches without data.

So, this suggests I can enter sites having missing data no problem, and that such an analysis runs just fine (the results from the run above look sensible so far)... but the warning message also suggests the data going into the analysis are only those sites having SNP calls across 100% of individuals in the analysis (also seen w/ the number of sites/patterns that follow the warning above).  I'm using BEAST v2.1.3 and SNAPP v1.1.5 on linux (via the CLI).

Thanks very much for your thoughts - we're looking forward to using SNAPP a lot more!

Sincerely,
Brant

paolo gratton

unread,
Jul 17, 2014, 4:57:33 AM7/17/14
to beast-users
Hi Brant,

as far as I understand, the removed loci should be the ones for which there is no sampled individual in some of the populations. E.g. if you have 20 individuals, and 4 pops., with 5 individuals per pop., it is fine if, at a given locus, you have 3 inds for pop. 1, 2 inds for pop. 2, 4, inds for pop. 3 and 1 ind for pop. 4. However, if you have 5 inds for pop. 1, 5 for pop 2, 5 for pop 3 but 0 inds for pop. 4, then the locus will be removed. It should be the same as in *Beast. You can have missing data, but you must always have at least one indivual sampled at each locus for every pop.

I hope somebody will let us know if I got it wrong!

Best

Paolo


--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.
Visit this group at http://groups.google.com/group/beast-users.
For more options, visit https://groups.google.com/d/optout.

Remco Bouckaert

unread,
Jul 17, 2014, 3:04:46 PM7/17/14
to beast...@googlegroups.com
Hi Paolo,

It is exactly as you say.

It looks like this is fairly easy to fix in the code so that the likelihood takes these sites in account, so we hope to get out  a new version shortly that can handle these sites more elegantly.

Cheers,

Remco

Brant Faircloth

unread,
Jul 18, 2014, 12:02:18 PM7/18/14
to beast...@googlegroups.com
Hi Paolo and Remco,

Thanks for the quick responses!  This makes sense and clarifies my earlier question - thank you.  The only remaining bit that is confusing is the following:

In one analysis, my tips are not grouped by population (for a variety of reasons) - so, there is effectively one OTU in each "population".  I would assume, based on Paolo's response (and also some earlier communication between Remco and I), that in this abnormal case (one individual per population), the algorithm effectively requires that all tips have SNP data present?

I ask because SNAPP logs to STDOUT that there are 1388 sites remaining after removal. However, if there were initially 4030 sites and 3037 were dropped, then shouldn't there be 4030 - 3037 = 993 remaining (versus 1388)?  Additionally, when I slice across my nexus file and tally the presence/absence of data on a site-by-site basis, I do not get (any of) these numbers (the "Counter" output below is a count of all patterns at each alignment column and sum(Counter) totals 4030, as expected):

total characters/sites: 4430
sites with missing (?): 3331
sites with no missing: 1099
Counter({
    '1 0 2 ?': 1660,
    '1 0 ?': 1478, 
    '1 0': 792,
    '1 0 2': 276,
    '0 2 ?': 102,
    '1 2 ?': 91,
    '1 2': 22,
    '0 2': 9
})

So, I guess my remaining question is whether or not SNAPP is tallying sites for inclusion in a more complicated way than simple counts of missing data by population?

Thanks very much, once again!

-b

Remco Bouckaert

unread,
Jul 18, 2014, 10:54:51 PM7/18/14
to beast...@googlegroups.com
Hi Brant,

Looking at the code, for the situation you describe it should remove all sites containing any missing value.
If you can send me the XML I can have a closer look at why you get the numbers that you see.

Cheers,

Remco

Adam Miller

unread,
Jul 20, 2014, 11:17:05 PM7/20/14
to beast...@googlegroups.com
Hi Paolo and Remco,
I have encountered the same problem as Brant. In my case I am dealing with diploid SNP datasets with 5000+ SNPs from pooled samples (10 individuals) from 15 locations. I am wanting to generate a species tree but I am effectively dealing with a single sequence per location/population. So in my case the analysis will dump a lot of loci as it is uncommon to have data from all sites at a given locus. Can I get around this using Beauti to define the species/population status as being identical for all taxa (under the 'species' tab)? This way all loci are retained (given data is present in most samples), and OTUs can still be distinguished by their unique sample code (location name)?
Cheers,
Adam

Remco Bouckaert

unread,
Jul 21, 2014, 3:00:52 PM7/21/14
to beast...@googlegroups.com
Dear SNAPP users,

SNAPP v1.1.6 is released and the main change is that it does not remove sites any more based on missing data.
Also, the XML produced by BEAUti for a SNAPP analysis is now easier to use in combination with path sampling.

There is a bit  more detail here: http://blog.beast2.org/2014/07/21/snapp-handling-missing-data-and-path-sampling-made-easier/

Cheers,

Remco

Brant Faircloth

unread,
Jul 21, 2014, 3:07:28 PM7/21/14
to beast...@googlegroups.com
Hi remco,

Sorry to be slow in responding.  I can get that XML file to you if you want to have a look.  That said, i just saw the email re: v1.1.6, so perhaps it's a moot point?

-b

Remco Bouckaert

unread,
Jul 21, 2014, 3:10:55 PM7/21/14
to beast...@googlegroups.com
Hi Brant,

It would be good if I could have a look at it -- I cannot explain what you described and perhaps something more sinister is going wrong. Any examples are greatly appreciated.

Cheers,

Remco

Brant Faircloth

unread,
Jul 22, 2014, 12:16:58 PM7/22/14
to beast...@googlegroups.com
Right on - sent you two files via email.

-b

Brant Faircloth

unread,
Jul 23, 2014, 1:14:20 AM7/23/14
to beast...@googlegroups.com
Hi all,

Just a quick follow-up.  The differences in counts of sites that I saw relative to what SNAPP was reporting were an artifact of a change made to the NEXUS file while being formatted as XML for input to SNAPP.  I was not aware of this change, so the counts that I was making were against the original NEXUS file (which I assumed was identical to the content of the XML file, although it was not).  This is what caused the differences in counts of sites and site patterns that I saw rather than any problem with the counts of sites that SNAPP was making.

Remco noticed the differences between the two files as we were discussing what could be causing discrepancies between my numbers and his.  And, long story short, it was human error on my part.

Thanks very much to Remco for his excellent support, and I apologize for having him spend time on what was a silly error on my part.

-b

Tobias Hofmann

unread,
Mar 15, 2017, 11:48:33 AM3/15/17
to beast-users
Hello everybody,

I hate to pick this thread back up, but the same problem that was described here for the very old SNAPP versions is now occuring for me in the newest SNAPP version (1.3.0) in BEAST 2.4.4.
I have a binary SNP alignment of 1574 sites for 113 taxa and I'm running into the problem that SNAPP discards all sites:

WARNING: removed 1574 patterns (1574 sites) becaues they have one or more branches without data.

There is missing data ('?') in every column and every sequence is treated as a separate species.
I also tried SNAPP version 1.2.5 in BEAST 2.3.2: same problem

Is there something I can do in setting up the xml to avoid this?

Best,
Tobias

Tobias Hofmann

unread,
Mar 15, 2017, 11:50:48 AM3/15/17
to beast-users
With that I mean that accross all taxa there is a '?' somewhere in every column in the alignment....

Remco Bouckaert

unread,
Mar 15, 2017, 2:40:01 PM3/15/17
to beast...@googlegroups.com
Hi Tobias,

SNAPP needs to have at least one taxon with data for each species, otherwise it cannot calculate the tree likelihood for that site and data for all other species is ignored for that site. What you could possibly do to diminish this problem (apart from getting data covering all species) is remove species from the analysis such that more sites have all species covered.

Cheers,

Remco

--
You received this message because you are subscribed to the Google Groups "beast-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beast-users...@googlegroups.com.
To post to this group, send email to beast...@googlegroups.com.

Tobias Hofmann

unread,
Mar 15, 2017, 3:42:19 PM3/15/17
to beast-users
Dear Remco,

thank you for the quick reply.
I may have misunderstood the previous posts on this thread but I thought it was said that SNAPP has been updated to avoid removing columns with missing data for some of the taxa:


Dear SNAPP users,

SNAPP v1.1.6 is released and the main change is that it does not remove sites any more based on missing data.

Has this feature been removed from SNAPP since then or is it still possible to apply whatever model allowed for missing data? Or am I misunderstanding something here?
Best,
Tobias

Remco Bouckaert

unread,
Mar 15, 2017, 4:02:17 PM3/15/17
to beast...@googlegroups.com
Hi Tobias,

The earlier versions required all taxa to have data, and removed sites if any taxa had missing data.

The current version can handle missing data, but under the condition that all species have at least some data. So, if a species X consists of lineages A, B and C and there is missing data for A and B, but C has data, then it will not be removed. However, if all three lineages have data missing for a site, this site will be removed from the alignment since SNAPP currently cannot calculate a treelikelihood for that situation.

In summary: SNAPP can handle some missing data, just not when there is a species with no data at all.

Hope this help,

Remco
Reply all
Reply to author
Forward
0 new messages