Hi all,
I am analyzing some GBS data (two enzyme, similar to ddRAD) using stacks in reference mode. Looking at the VCF file, there are sometimes multiple Locus IDs sharing genomic coordinates. For example:
grep -v ^# populations.snps.vcf | grep 2849996 | cut -f1-5
Chr01 2849996 2844:49:- T A
Chr01 2849996 2852:53:- T A
Chr01 2849996 2854:54:- T A
Chr01 2849996 2855:55:- A T
Chr01 2849996 2856:56:- A T
I'm not showing the genotypes here, but they are wildly different per sample! Looking at a genome browser of the reads mapped to this area, I don't see why stacks thinks they are different loci, unless maybe soft clipping around the beginning of reads could make it think they start at different locations?
Command for mapping:
bwa mem -t 12 $reference ${sample}.1.fq.gz ${sample}.2.fq.gz | samtools view -O BAM | samtools sort --threads 12 -o $sample.psitt.bam
Commands for stacks:
gstacks -I $OUTDIR/bams -M $POPMAP_ALL -t 16 --details -O $GSTACKS_DIR
populations -P $GSTACKS_DIR --popmap $POPMAP_CROSS -t 16 -O $OUTDIR/populations_merged --vcf
Any help would be appreciated!
Thanks,
Ethan