Hi all,
I am trying to put together a QC tool to be able to identify sample swaps or meta data misannotations of my samples.
My data consists of low coverage WGS bam files (2-4x coverage).
I came up with this small pipeline which I would like the VerifyBAMID authors to comment on:
- Run angsd on set of diverse bams, produce mafs.gz file:
angsd -nThreads 4 -b bam.filelist -out CEGX_Run497.bams.46.angsd -gl 1 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3
This produces a mafs.gz file where column 7 is 'number of individuals'.
In parallel, I run my variant caller on each of the bam files I want to query (currently using GATK Lite Unified Genotyper), obtaining the .vcf.gz files for each bam.
I then subset my vcf.gz files with the SNPs found by angsd using a number of individuals threshold, e.g. the median or mean or a certain number that is big enough (I don't have a good statistical measure to choose this ATM).
With my subset .vcf.gz files, I then run VerifyBAMID on each .vcf.gz/.bam pair combination, producing a semi-matrix of results:
$verifybamid --smID $pair_id --ignoreRG --maxDepth $maxdepth --noPhoneHome --vcf $vcffile --bam $inputfile --out $outpref 2>&1 | grep -v 'WARNING : Skipping'
From my test, when I do the .vcf.gz + .bam pairs of the same individuals (diagonal in the matrix), I am getting contamination values close to 50%. I am unsure why they are not close to 100%. Should they be? Should I change a parameter for this?
For the .vcf.gz / .bam combos outside the diagonal, I am getting values between 0.9% and 7%, although these values seem to be conditional to sequencing depth, where the higher values have higher coverage.
Any comments/feedback from my current pipeline?