VCF format to R/qtl format

789 views
Skip to first unread message

Marc G.

unread,
Jul 10, 2019, 11:52:24 AM7/10/19
to R/qtl2 discussion
Dear R/qtl2 users, 

I have a regular VCF file from a diploid population. I am trying to convert this VCF file into a genotype data file (such as the iron_geno.csv below):

id D1Mit18 D1Mit80 D1Mit17 D2Mit379 D2Mit75
1 BB SB SB SB SB
2 - - - BB BB
3 BB SB SB SB SB
4 SB SB BB SS SS
5 - - - SB SB
6 - - - SS SS
7 BB BB BB SB SB
8 - - - SB SB

Is there a relatively easy way to do that? I cannot get anywhere close to that....

Thank you
Marc

Karl Broman

unread,
Jul 10, 2019, 12:02:16 PM7/10/19
to rqtl2...@googlegroups.com
vcf files are generally a big tab-delimited file with some annotation information at the top.

I'll read it into R using data.table::fread() and then pull apart the map information from the genotypes.

You then need founder genotypes, which is maybe part of that data, in order to convert the nucleotide-based SNP calls to AA/AB/BB type genotypes. I use functions from the qtl2convert package: find_consensus_geno() and encode_geno().

Then write back to a CSV file with qtl2convert::write2csv().

karl

Joanna R.

unread,
Jul 10, 2019, 3:32:54 PM7/10/19
to R/qtl2 discussion
I often convert my vcfs to csvs via 012 format in vcftools using grotesque bash scripting. Example below. Enjoy!


012 to RQTL csv
The input file is a comma-delimited text file. A different field separator may be specified via the argument sep, which will be passed to the function read.table). For example, in Europe, it is common to use a comma in place of the decimal point in numbers and so a semi-colon in place of a comma as the field separator; such data may be read by using sep=";" and dec=",".

The first line should contain the phenotype names followed by the marker names. At least one phenotype must be included; for example, include a numerical index for each individual.

The second line should contain blanks in the phenotype columns, followed by chromosome identifiers for each marker in all other columns. If a chromosome has the identifier X or x, it is assumed to be the X chromosome; otherwise, it is assumed to be an autosome.

An optional third line should contain blanks in the phenotype columns, followed by marker positions, in cM.

Marker order is taken from the cM positions, if provided; otherwise, it is taken from the column order.

Subsequent lines should give the data, with one line for each individual, and with phenotypes followed by genotypes. If possible, phenotypes are made numeric; otherwise they are converted to factors.

The genotype codes must be the same across all markers. For example, you can't have one marker coded AA/AB/BB and another coded A/H/B. This includes genotypes for the X chromosome, for which hemizygous individuals should be coded as if they were homoyzogous.





#Convert to 012 format
vcftools --gzvcf Filtered_GQ20_Max0.3MissingTX_plus_parent_genotypes_1-17-2019.vcf.gz --012 --remove-indv sampleTTF --remove-indv sampleTTM --remove-indv sampleH10.TX83M --remove-indv sampleH9.TX75F  --out TX_GQ20_0.3_correct_ind_only

#Convert genotypes: 0 to A, 1 to H, 2 to B, -1 to -

cp TX_GQ20_0.3_correct_ind_only.012 CopyTX_GQ20_0.3_correct_ind_only.012


#sed -i 's/original/new/g' file.txt
sed -i 's/\s-1/ -/g' CopyTX_GQ20_0.3_correct_ind_only.012
sed -i 's/\s0/ A/g' CopyTX_GQ20_0.3_correct_ind_only.012
sed -i 's/\s1/ H/g' CopyTX_GQ20_0.3_correct_ind_only.012
sed -i 's/\s2/ B/g' CopyTX_GQ20_0.3_correct_ind_only.012
less CopyTX_GQ20_0.3_correct_ind_only.012
       
#Glue individuals back on
#paste -d' ' file1 file2
paste -d' ' TX_GQ20_0.3_correct_ind_only.012.indv CopyTX_GQ20_0.3_correct_ind_only.012 > IndvAddedCopyTX_GQ20_0.3_correct_ind_only.012 &

cp TX_GQ20_0.3_correct_ind_only.012.pos CopyTX_GQ20_0.3_correct_ind_only.012.pos
sed -i -e 's/\s/_/' CopyTX_GQ20_0.3_correct_ind_only.012.pos
sed -i -e 's/;/-/' CopyTX_GQ20_0.3_correct_ind_only.012.pos
sed -i -e 's/=/-/' CopyTX_GQ20_0.3_correct_ind_only.012.pos


#Transpose the positions
  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
        max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++)
                  printf "%s%s", a[i,j], (j==NR?RS:FS)
              }
        }' CopyTX_GQ20_0.3_correct_ind_only.012.pos > TPTX_GQ20_0.3_correct_ind_only.012.pos

sed -i -e 's/^/  /' TPTX_GQ20_0.3_correct_ind_only.012.pos

cat TPTX_GQ20_0.3_correct_ind_only.012.pos IndvAddedCopyTX_GQ20_0.3_correct_ind_only.012 > PosIndvAddedCopyTX_GQ20_0.3_correct_ind_only.012
       
sed 's/\s/,/g' PosIndvAddedCopyTX_GQ20_0.3_correct_ind_only.012 > PosIndvAddedCopyTX_GQ20_0.3_correct_ind_only.csv

Marc G.

unread,
Jul 24, 2019, 8:59:05 AM7/24/19
to R/qtl2 discussion
Hello Karl (and Joanna)
Thank you for your answer. QTL and SNPs are quite a new adventure for me!
I am getting stuck at the first step:
I'll read it into R using data.table::fread() and then pull apart the map information from the genotypes.

So if I have my VCF file as such:

##fileformat=VCFv4.2
[... removed the extra info fields...]
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    100    18    4    55    63    64    91    L
contig1    53    .    A    G    .    PASS    NS=6;AF=0.083    GT:DP:AD:GQ:GL    0/0:8:8,0:33:-0.00,-2.70,-26.45    0/0:13:13,0:40:-0.00,-4.20,-42.47    ./.    0/0:12:12,0:40:-0.00,-3.90,-39.27    0/1:8:6,2:40:-3.71,-0.00,-17.35    0/0:14:14,0:40:-0.00,-4.50,-45.67    ./.    0/0:6:6,0:27:-0.00,-2.10,-20.05
contig3    19    .    C    A    .    PASS    NS=6;AF=0.333    GT:DP:AD:GQ:GL    0/1:10:6,4:40:-9.75,-0.00,-16.50    ./.    0/1:20:12,8:40:-19.19,0.00,-32.17    0/1:9:5,4:40:-10.05,-0.00,-13.69    ./.    0/0:7:7,0:24:-0.01,-1.81,-22.33    0/0:4:4,0:13:-0.05,-0.95,-13.03    0/1:6:3,3:40:-7.84,-0.00,-8.36
contig3    22    .    T    G    .    PASS    NS=6;AF=0.083    GT:DP:AD:GQ:GL    0/0:10:10,0:39:-0.00,-3.26,-21.44    ./.    0/1:20:11,8:40:-10.56,-0.00,-17.57    0/0:9:9,0:36:-0.00,-2.96,-19.38    ./.    0/0:7:7,0:30:-0.00,-2.37,-15.26    0/0:4:4,0:20:-0.01,-1.49,-9.08    0/0:6:6,0:26:-0.00,-2.07,-13.19
contig7    129    .    T    C    .    PASS    NS=6;AF=0.417    GT:DP:AD:GQ:GL    0/1:9:4,5:40:-12.94,-0.00,-10.02    0/1:4:1,3:31:-8.42,-0.00,-2.48    ./.    0/1:8:1,7:18:-19.30,-0.02,-1.29    ./.    0/1:9:2,7:40:-18.98,-0.00,-3.99    0/1:15:9,6:40:-14.16,-0.00,-23.30    0/0:7:7,0:21:-0.01,-1.55,-21.22
contig8    101    .    T    A    .    PASS    NS=6;AF=0.250    GT:DP:AD:GQ:GL    0/1:15:13,2:25:-1.97,-0.00,-37.81    ./.    0/0:18:18,0:40:-0.00,-5.34,-58.23    0/0:8:8,0:29:-0.00,-2.33,-26.25    0/0:7:7,0:26:-0.00,-2.03,-23.05   

Are these columns (CHROM, POS, ID, REF, ALT) what you call "map information" ? That would make my first table (=map information).
My genotypes for sample 100 would then be: 0/0, 0/1, etc. right?

For now, I find it easier to use the Variant Annotation package (Bioconductor):
vcf <- readVcf("variants.vcf")
snp.matrix <- genotypeToSnpMatrix(vcf, uncertain = FALSE)
snp.matrix.transposed <- t(as(snp.matrix$genotypes, "character"))
write.table(snp.matrix.transposed, "SNP_matrix_test.csv", sep="\t")

to obtain the following SNP matrix:
contig 100 18 4 55 63 64 91 L
contig1:53_A/G A/A A/A NA A/A A/B A/A NA A/A
contig3:19_C/A A/B NA A/B A/B NA A/A A/A A/B
contig3:22_T/G A/A NA A/B A/A NA A/A A/A A/A
contig7:129_T/C A/B A/B NA A/B NA A/B A/B A/A
contig8:101_T/A A/B NA A/A A/A A/A A/B A/B NA
contig10:52_A/T A/A NA A/A A/A A/A A/A A/A A/B

Karl Broman

unread,
Jul 24, 2019, 8:35:51 PM7/24/19
to rqtl2...@googlegroups.com
yes that’s right. You want one file with just marker genotypes, and another with just mark, chromosome, position.

karl
Reply all
Reply to author
Forward
0 new messages