Error in extract.gt(x) : ID column contains non-unique names

45 views
Skip to first unread message

Sundy Maurice

unread,
Dec 4, 2018, 6:02:28 AM12/4/18
to poppr


Hi,

I am working with a SNPs dataset (55 000 var.) obtained from STACKS (RAD data). When I transform to a genlight format using vcfR. I get this error msg: 

Error in extract.gt(x) : ID column contains non-unique names

Has anyone experienced this before, Is this related to the output from Stacks due to some masked characters in the vcf file? vcfR is able to read the file though, I can head it and check.

Thanks for any tips 

Sandy

Brian Knaus

unread,
Dec 4, 2018, 11:20:09 AM12/4/18
to sundym...@gmail.com, poppr
Hi Sandy,

Section 1.6.1 of the VCFv4.3 specification:


part 3 states that ID is used as unique identifiers. This suggests to me that your file may not be properly formed. This may be something to bring up with the developers of stacks (which I'm not very familiar with). But first you should make sure that you, or part of your team, has not introduced the issue.

This is important because vcfR::extract.gt() tries to use the information in the ID column as row names for the matrix of genotypes it returns. In R row names need to be unique. You should be able to circumvent this issue by setting IDtoRowNames = FALSE. But that doesn't address why you're having this problem in the first place.

Good luck!
Brian

--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/eb607505-60a8-4766-bd01-f364e3a16d37%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

Sundy Maurice

unread,
Dec 14, 2018, 4:33:08 PM12/14/18
to poppr
Hi Brian 
thank you for this reply. 
Yep, in the meanwhile I changed the IDs to unique values in bcftools, prune the data based on LD in plink (end up with 30K Snps)... but same error message in poppR while extracting the gt... could it be linked to the delimitators or the fact that I have something like scaffolds1. 100X.pilon.10140_... as IDs name? 
Sandy

Brian Knaus

unread,
Dec 15, 2018, 11:57:21 AM12/15/18
to Sundy Maurice, poppr
Hi Sandy,

There is little we can do to help you without a minimal reproducible example (https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). With this, we can try to reproduce the behavior you're reporting on our local machines. Below is an example I've made using example data from the package vcfR. You might want to use a subset of your own data (unless you feel it is sensitive, anything you share here is visible to the world). You can subset your VCF data using the square brackets. myVcf[1:10,] will subset to the first 10 variants. And you can manipulate the ID column with vcf@fix[,'ID'].

This example has reminded me that we have tried to help attain unique IDs while preserving any info in the ID column. So perhaps a first step would be to make sure you're using the most current version of vcfR (1.8.0 on CRAN now).

Good luck and let us know if this helps!
Brian

``` r
library(vcfR)
#>
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
# Use example data.
data("vcfR_test")
extract.gt(vcfR_test)
#>            NA00001 NA00002 NA00003
#> rs6054257  "0|0"   "1|0"   "1/1" 
#> 20_17330   "0|0"   "0|1"   "0/0" 
#> rs6040355  "1|2"   "2|1"   "2/2" 
#> 20_1230237 "0|0"   "0|0"   "0/0" 
#> microsat1  "0/1"   "0/2"   "1/1"
# Create data with non-unique IDs.
vcf <- rbind2(vcfR_test, vcfR_test)
getID(vcf)
#>  [1] "rs6054257" NA          "rs6040355" NA          "microsat1"
#>  [6] "rs6054257" NA          "rs6040355" NA          "microsat1"
extract.gt(vcf)
#>              NA00001 NA00002 NA00003
#> rs6054257_1  "0|0"   "1|0"   "1/1" 
#> 20_17330_2   "0|0"   "0|1"   "0/0" 
#> rs6040355_3  "1|2"   "2|1"   "2/2" 
#> 20_1230237_4 "0|0"   "0|0"   "0/0" 
#> microsat1_5  "0/1"   "0/2"   "1/1" 
#> rs6054257_6  "0|0"   "1|0"   "1/1" 
#> 20_17330_7   "0|0"   "0|1"   "0/0" 
#> rs6040355_8  "1|2"   "2|1"   "2/2" 
#> 20_1230237_9 "0|0"   "0|0"   "0/0" 
#> microsat1_10 "0/1"   "0/2"   "1/1"

vcfR2genlight(vcf)
#> Warning in vcfR2genlight(vcf): Found 4 loci with more than two alleles.
#> Objects of class genlight only support loci with two alleles.
#> 4 loci will be omitted from the genlight object.
#> Loading required namespace: adegenet
#>  /// GENLIGHT OBJECT /////////
#>
#>  // 3 genotypes,  6 binary SNPs, size: 7.7 Kb
#>  0 (0 %) missing data
#>
#>  // Basic content
#>    @gen: list of 3 SNPbin
#>
#>  // Optional content
#>    @ind.names:  3 individual labels
#>    @loc.names:  6 locus labels
#>    @chromosome: factor storing chromosomes of the SNPs
#>    @position: integer storing positions of the SNPs
#>    @other: a list containing: elements without names

sessionInfo()
#> R version 3.5.1 (2018-07-02)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 16.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/libblas/libblas.so.3.6.0
#> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
#>
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C             
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8   
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8  
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C           
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base    
#>
#> other attached packages:
#> [1] vcfR_1.8.0
#>
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.0        ape_5.2           lattice_0.20-38 
#>  [4] deldir_0.1-15     gtools_3.8.1      assertthat_0.2.0
#>  [7] rprojroot_1.3-2   digest_0.6.18     mime_0.6        
#> [10] R6_2.3.0          plyr_1.8.4        backports_1.1.2 
#> [13] evaluate_0.12     coda_0.19-2       ggplot2_3.1.0   
#> [16] pillar_1.3.0      rlang_0.3.0.1     lazyeval_0.2.1  
#> [19] spdep_0.8-1       adegenet_2.1.1    rstudioapi_0.8  
#> [22] gdata_2.18.0      vegan_2.5-3       gmodels_2.18.1  
#> [25] Matrix_1.2-15     rmarkdown_1.10    splines_3.5.1   
#> [28] pinfsc50_1.1.0    stringr_1.3.1     igraph_1.2.2    
#> [31] munsell_0.5.0     shiny_1.2.0       compiler_3.5.1  
#> [34] httpuv_1.4.5      pkgconfig_2.0.2   mgcv_1.8-26     
#> [37] htmltools_0.3.6   tidyselect_0.2.5  tibble_1.4.2    
#> [40] expm_0.999-3      permute_0.9-4     viridisLite_0.3.0
#> [43] crayon_1.3.4      dplyr_0.7.8       later_0.7.5     
#> [46] MASS_7.3-51.1     grid_3.5.1        nlme_3.1-137    
#> [49] spData_0.2.9.6    xtable_1.8-3      gtable_0.2.0    
#> [52] magrittr_1.5      scales_1.0.0      stringi_1.2.4   
#> [55] reshape2_1.4.3    LearnBayes_2.15.1 promises_1.0.1  
#> [58] sp_1.3-1          bindrcpp_0.2.2    seqinr_3.4-5    
#> [61] boot_1.3-20       tools_3.5.1       ade4_1.7-13     
#> [64] glue_1.3.0        purrr_0.2.5       parallel_3.5.1  
#> [67] yaml_2.2.0        colorspace_1.3-2  cluster_2.0.7-1 
#> [70] knitr_1.20        bindr_0.1.1
```

<sup>Created on 2018-12-15 by the [reprex package](https://reprex.tidyverse.org) (v0.2.1)</sup>


--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages