Hi Paul,
On 12/5/12 11:40 AM, Paul Richards wrote:
> I am still unsure as to how I end up with up to 4 SNPs
> at a locus if I am only allowing a distance of two between a maximum of two stacks?
Are you ending up with 4 SNPs at a locus in an individual sample, or in the
locus in the catalog (which is the synthesis of the data in all samples)? If
it's the latter, then I assume you are setting -M 2 and -n 2. This allows two
differences within an individual (handled by ustacks), and then up to two fixed
differences between individuals (as determined by cstacks when making the catalog).
> Furthermore, would you mind briefly describing the exact process by which
> mismatching works when building the catalog?
Check out the original Stacks paper. We use the same k-mer matching algorithm in
both ustacks and cstacks.
> Would you also mind commenting on this original question I posted...What are the
> markers included in the structure and phylip output files? I have presumed they
> are RAD loci haplotypes and not individual SNPs that are variable between
> populations. Therefore in the case of the phylip files is just one 'token' SNP
> from each RAD loci output, rather than every SNP found?
>
The data in the Phylip file are individual SNPs that are fixed in individual
samples but vary across samples (we need fixed differences to satisfy
phylogenetic models when building trees). These SNPs are concatenated together
to construct the Phylip file (a log file is also generated that specifies which
loci/nucleotide each SNP is taken from).
The Structure file is the first SNP from each catalog locus (since any
subsequent SNPs at the locus are in linkage to the first SNP and STRUCTURE does
not want linked data). You could convert the batch_X.haplotypes.tsv file into a
STRUCTURE format without too much difficulty if you want haplotype data.
These issues are covered in detail in our upcoming Stacks 1.0 paper.
Best,
julian