6 chr6:152697706:C:T 0 152697706 0 0 C T 0 0 C T 0 0 C T 0 0 C T
7 chr7:4310365:C:T 0 4310365 0 0 T T 0 0 C T C T C T T T C T
7 chr7:13894276:A:G 0 13894276 A G 0 0 A G G G G G G G A G A G
7 chr7:137029838:T:C 0 137029838 T C T C T C T C 0 0 T C T C T C
8 chr8:28411072:T:C 0 28411072 T C T C T C T C T C T C C C T C
and here is some example from my .genome file:
FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
1 1 2 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.733333 0.9959 NA
1 1 3 3 UN NA 0.0000 0.9111 0.0889 0.5445 -1 0.784314 0.9990 NA
1 1 1 2 OT 0 0.0000 0.9183 0.0817 0.5409 -1 0.782609 0.9990 NA
1 1 1 3 PO 0.5 0.0000 0.7572 0.2428 0.6214 -1 0.820755 0.9994 NA
There, 1 1 1 2 is the relationship between two parents, and 1 1 1 3 is the relationship between a parent and child.
This analysis uses just 92 SNPs. I have another analysis using 250k that gives similar results.
Thoughts? I'm not even sure what questions to ask because as far as I can tell this all looks right until the 0.0 for Z0 and 0.5 Pi-hats. Then again though this is my first plink project.
Thanks.
The next issue is likely that you need to provide reference minor allele frequencies with the --read-freq option or merge these samples with a much larger dataset. Right now plink is inferring the population allele frequencies from the samples in your data set. Since your provided dataset appears to be very small and is expect to contain relatives, infer allele frequencies from it will give poor IBD estimates.
Jeff
~/tools/plink-1.9/plink --bfile corrected_ids --read-freq plink.frq --extract pakstis_snps --make-bed --out cleaned
My frequences were generated on the same data set using:
~/tools/plink-1.9/plink --bfile corrected_ids --freq
Is that what you were referring to, or should I be calculating that from an orthogonal data set?
For the 250k SNP analysis, here is the equivalent set of lines from the genome file:
FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
1 1 2 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.762557 1.0000 NA
1 1 3 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.764591 1.0000 NA
1 1 1 2 OT 0 0.0000 1.0000 0.0000 0.5000 -1 0.733975 1.0000 NA
1 1 1 3 PO 0.5 0.0000 0.9056 0.0944 0.5472 -1 0.833411 1.0000 NA
1 chr1:3303446:T:C 0 3303446 T C C C T C C C T C C C T C C C
1 chr1:3306855:T:G 0 3306855 T G G G T G G G T G G G G G G G
1 chr1:3311212:C:G 0 3311212 C G G G C G G G C G G G G G G G
1 chr1:3313282:G:A 0 3313282 G A G A G A G A G A A A G A A A
1 chr1:3315359:C:T 0 3315359 C T C T C T C T C T T T C T T T
Yes, use an unrelated reference dataset of more that 50 individuals with a similar ancestral background to younsamples. You can either merge the reference with your data set or use the --freq on the reference and read it in the --read-freq for your dataset. Also restrict the analysis to common, high quality variants. I use --maf 0.1 and --geno 0.05.
FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
1 1 2 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.863452 1.0000 NA
1 1 1 2 OT 0 0.0000 1.0000 0.0000 0.5000 -1 0.855480 1.0000 NA
1 1 6 1 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.856943 1.0000 NA
1 1 4 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.862032 1.0000 NA
1 1 1 3 PO 0.5 0.0000 0.8077 0.1923 0.5961 -1 0.925370 1.0000 NA
gzip -dc %s/Variations/%s.genome.vcf.gz | ~/tools/gvcftools-0.16/bin/break_blocks --region-file /volume/Coord_for_GRM.bed --ref /volume/hg19_whole_genome.fa --exclude-off-target --include-variants
command += " | ~/tools/bcftools/bcftools norm -Ou -m -any"
command += " | ~/tools/bcftools/bcftools norm -Ou -f /volume/hg19_whole_genome.fa"
command += " | ~/tools/bcftools/bcftools annotate -Ob -x ID -I +"
command += '%CHROM:%POS:%REF:%ALT'
and create ped/fam files with vcftools:
command += " | ~/tools/vcftools_0.1.13/bin/vcftools --bcf - --plink --out %s/Variations/%s"%(d, pid)