Hi Miranda,
We will need more information to provide any advice. Can you give an example of the positions being different between the stacks output and the original 3RAD data? What are you referring to when you say the “original 3RAD data”?
julian
Hi Miranda,
I see, that makes sense.
In general, I think the chrom/pos should match between the two analyses, depending on how you made your ‘reference genome’ with BWA (which FASTA file did you feed into it?). Also, the VCF columns changed in Stacks v2.52, so I am assuming you are running on the latest version. You might also consider looking at the populations.sumstats.tsv files from the two runs to see how/if they match.
To say more than that, I would still need more information. What commands did you run to create your BWA reference and do your alignments, and, can you send the first 20 or 30 lines from your denovo/reference VCF files (as well as the populations command you used to create them)? If it is easier, you could just drop the files in a Dropbox folder.
I am curious why your de novo loci have base pair positions in the 200s – I assume you used paired-end reads? Did you again use the paired-end reads in your reference run?
If everything looks correct, then the path would be to look at how well the alignments went against your ‘reference genome’, what was the coverage (reported by gstacks), etc.
The loci you are using are, for the most part, 291bp in length (which explains how you can have SNP coordinates >200bp). The all have 10 Ns in the middle, which means that de novo stacks could not merge the two halves of your loci together (which is fine), so you used a size selection window in your molecular protocol greater than the single-end+paired-end read length. Some of your loci are significantly longer, which in principle should not be possible, but can happen when you have two paired-end cutsites nearby each other in the genome and some times you got one cutsite and other times you got another. Or, they can appear because of significant indels and repeats in that particular sequence which caused a lot of gaps to appear during reassembly of your raw, de novo reads. This is all fine, but you should be aware and be more skeptical of these longer loci.
For your refmap data, you should look at the gstacks output file, gstacks.log.distribs, which will tell you how many reads aligned for each sample and some characteristics of the alignments. It will also tell you coverage. You should have good numbers here for both.
On your refmap data, I would recommend re-running populations (latest version) without the --max-obs-het and --write-single-snp filters. Examine the sumstats output to see if your expected SNPs are present (even if you find additional SNPs that were not in your original de novo data).
--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
stacks-users...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/stacks-users/398bca98-59ec-4279-8d70-62408648eac3n%40googlegroups.com.
Just as a tip, and to save a lot of disk space, you can combine all your BWA and samtools loops into a single command:
for file in $samples
do
bwa mem -t 2 $bwa_db $src/${file}.1.fa.gz $src/${file}.2.fa.gz | \
samtools view -h -b | \
samtools sort --threads 2 > $src/aligned/${file}.bam
done
.