Hi,
I just started to use lobSTR for STR extractuion from VCF and I am amazed by your work guys!
Here is my modified pipe:
intersectBed -a $VCF_FILE -b '~/Software/lobSTR-bin-Linux-x86_64-4.0.6/hg19_v3.0.2/lobSTR_codis_hg19.bed' -wa -wb \
| cut -f 1,2,10,14- \
| sed 's/:/\t/g' \
| cut -f 1,2,3,9,15- \
| sed 's/\//\t/g' \
| awk '{print $0 "\t" ($5+($7*$8))/$7 "\t" ($6+($7*$8))/$7}'
Columns that I selected for final table are: Chromosome, Position, Genotype, Genotype given in bp difference from reference, Repeat period, Reference allele, Repeat name.
I slightly corrected sed by adding `g` at the end, this command should parse Genotype and Genotype given in bp difference from reference columns
I then computed number of repeats using following formula (Genotype given in bp difference from reference + (Repeat period + Reference allele))/Repeat period.
Here are results that I get:
chr11 2192318 1 1 11 11 4 7 TH01 9.75 9.75
chr13 82722160 1 0 -8 0 4 11 D13S317 9 11
chr15 97374245 1 2 5 45 5 5 PentaE 6 14
Cheers,
Anastassiya