short seq reads from SNP dataset

117 views
Skip to first unread message

rachel binks

unread,
Jun 15, 2020, 3:30:05 PM6/15/20
to migrate-support

Hi there,

I'm keen to run migrate-n on my snp dataset and it's very easy for me to convert my vcf file to migrate format using vcfR2migrate in the vcfR package but the more I read, the more obvious it becomes that using polymorphic snps as individual loci is not ideal. 

So I would like to extract my short sequences (variant + invariant sites) and use those as my loci instead. But I'm not a whizz with coding. I can easily generate a fasta file that concatenates all the short read sequences for each individual. But I have no clue on how to split them into separate fasta sequences of each read for each individual.... and with 4000 loci, it's just not practical to do that manually. 

I've seen the python script for fasta2migrate format but I need to "unconcatenate" my concatenated fasta format first to suit the fasta input required for that script... any help would be most appreciated! 

I'll keep trying to figure it out but I suppose my question is, how rubbish will it be if I just use my polymorphic snps as input for migrate? Most published papers don't seem to mention whether they used their short reads or just the snps in their input files so it's hard to gauge what people are actually doing.. 

Cheers,
Rachel




Peter Beerli

unread,
Jun 15, 2020, 3:37:39 PM6/15/20
to migrate...@googlegroups.com
Rachel,

if you can generate a migrate file that contains all  loci for an individual on one line and now the lengths of each locus, then you are set.
Migrate 4.x allows something like this (the manual talks about this, too)
 2 3   example with very very short sequences
(s4) (s2) (s3)
2 pop1
ind1       AAAACCGGG
ind1       ATAACCGGG
2 pop2
ind3       TAAAGCGGG
ind4       TA--CCGCT

The example has 3 loci that are unlinked, the (); the s is for sequence and the number is the length of the sequence, this also means that your individual loci must be aligned, for example ind4 has a gap.

Peter


--
You received this message because you are subscribed to the Google Groups "migrate-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to migrate-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/migrate-support/4f348ec5-1ec9-40aa-825c-b56723b5f667n%40googlegroups.com.

rachel binks

unread,
Jun 16, 2020, 2:27:31 AM6/16/20
to migrate-support

Ah! Brilliant. I don't know how I missed that in the manual but that's most helpful thankyou.

I'm now reading through the section on p24/25 of the manual to see if I can simplify line two because all my short reads are the same length. 

So am I interpreting it correctly to write:
4 4469
[4469o69] (s308361)

And that will split the full concatenated sequence into the 4469 loci, each 69bp long?

Peter Beerli

unread,
Jun 16, 2020, 9:01:54 AM6/16/20
to migrate...@googlegroups.com
This should work, but I suggest that you test this with a small set of say 5 loci to see that migrate is doing what I promise,
if there +-1 shifts then the scheme will fail.

Peter


Rachel Binks

unread,
Jun 17, 2020, 12:29:45 AM6/17/20
to migrate...@googlegroups.com


Thanks Peter, that did work. 

I'm now trying to get my head around the best sampling design/models to run. I have two strongly divergent lineages of a tree species that I suspect are cryptic species. They each occur over large geographic regions separately but where their boundaries meet at a central contact zone, they remain entirely genetically distinct so I'm suspecting a reproductive barrier between them. So I have two questions:

1. pertaining to sampling design - I'm thinking that the majority of my populations are not useful here because they are allopatric and I'm interested in whether there is gene flow in sympatry. So I was thinking of taking 4 adjacent populations from one area of the contact zone, 2 from each lineage, and running migrate to see whether gene flow occurs between pops within lineages but not between lineages. And then repeating that for say 2 or 3 other groups of 4 populations along the contact zone. I suppose that is a population genetic type approach. Or I can group all the populations of each lineage so that I'm testing the whole dataset with a 2 population model representing each lineage, which is more of a phylogenetic approach. I'm not sure which is best.

2. pertaining to model testing - I would like to test a model of full migration, divergence (d) and divergence with migration (D) but am I correct in thinking that the divergence must be directional so *dd* is not correct? Instead I must test each direction, *d0* and *0d*, separately? Likewise for D. The number of possibilities adds up quickly!

Any advice would be most appreciated.

Cheers,
Rachel


You received this message because you are subscribed to a topic in the Google Groups "migrate-support" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/migrate-support/j1PjoX0ICMI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to migrate-suppo...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/migrate-support/51135614-1143-4465-977F-6761A40ACB40%40gmail.com.


--
If people sat outside and looked at the stars each night, I'll bet they'd live a lot differently. Bill Watterson.

Peter Beerli

unread,
Jun 17, 2020, 9:42:40 AM6/17/20
to migrate...@googlegroups.com
Rachel,

1. pertaining to sampling design - I'm thinking that the majority of my populations are not useful here because they are allopatric and I'm interested in whether there is gene flow in sympatry. So I was thinking of taking 4 adjacent populations from one area of the contact zone, 2 from each lineage, and running migrate to see whether gene flow occurs between pops within lineages but not between lineages. And then repeating that for say 2 or 3 other groups of 4 populations along the contact zone. I suppose that is a population genetic type approach. Or I can group all the populations of each lineage so that I'm testing the whole dataset with a 2 population model representing each lineage, which is more of a phylogenetic approach. I'm not sure which is best.

I would do both, comparisons with few populations will run faster and with less problems. In principle you can directly compare your 4-pop problem with the 2-pop approach [here an example for comparing of a cline: https://pubmed.ncbi.nlm.nih.gov/23125403/]

2. pertaining to model testing - I would like to test a model of full migration, divergence (d) and divergence with migration (D) but am I correct in thinking that the divergence must be directional so *dd* is not correct? Instead I must test each direction, *d0* and *0d*, separately? Likewise for D. The number of possibilities adds up quickly!

yes, with two populations I usually run: x0dx xd0x xxDx xDxx xx0x x0xx x
If you run them on a computer cluster you can send them all at the same time, on your own computer this will take forever.
with two population you can also think about adding a 3rd (ancestral population), in the infile add an additional population (+1) on the first line and add a this line at the end
0 ancestor

then in the parmfile use the number of population including the ancestor,
so then you have a,b, anc
then x0d 0xd 00x would be a population split from the ancestor, this is different from IM because the time for anc->a and anc->b can be different,
I allow for  ’t’ instead of ‘d’ to sync the times, but this will not work with more than one split. x0t 0xt 00x
with split and then migration: xxt xxt 00x
but read this to understand the problem with mixing migration and divergence: https://www.biorxiv.org/content/10.1101/587832v1


Peter


Rachel Binks

unread,
Jun 18, 2020, 7:49:39 AM6/18/20
to migrate...@googlegroups.com

Hi Peter,

Thanks so much, I really appreciate the advice. I'm sure I'll have more questions along the way but you've given me plenty to read and think through while testing different models/datasets.

Cheers,
Rachel

Reply all
Reply to author
Forward
0 new messages