pgenlibr: what is the fastest way to get the index of a variant by variant-ID?

164 views
Skip to first unread message

N.

unread,
Jan 24, 2021, 3:37:55 PM1/24/21
to plink2-dev
Hi, 

I have been using pgenlibr in R to load specific variants from pgen files, however it requires me to load(e.g. fread) the entire pvar to  retrieve the index of the variant since it's a txt file, which is the bottleneck for speed, since  the pvar files are quite large.  

Is there a more efficient way to know the index of a variant by name, without loading the entire pvar file? 

Thanks, 

Niek 


Christopher Chang

unread,
Jan 24, 2021, 4:14:14 PM1/24/21
to plink2-dev
How are you reading the .pvar?  pgenlibr includes its own .pvar loader which should be pretty fast, especially if the file is BGZF- or Zstd-compressed, and extraneous columns (e.g. INFO) have been removed from the .pvar.

If that isn't enough, you'd need to write your own software for this.  It can be done by constructing a name-based index.

N.

unread,
Jan 24, 2021, 5:42:05 PM1/24/21
to plink2-dev
that is great, could you post an example? 

I am now using something like this: 

snps  <- c("rs1234","rs54123")
i  <- which( fread(f.pvar,select = 3)$ID %in% snps)
pvar <- pgenlibr::NewPvar(f.pvar)
pgen <- pgenlibr::NewPgen(f.pgen, pvar=pvar) #,sample_subset = c(1,2,3,4) )
ReadList(pgen, i , meanimpute=F)

Is there a way to use pgenlibr to query the index using the ID? 

Thanks so much

Christopher Chang

unread,
Jan 24, 2021, 9:08:27 PM1/24/21
to plink2-dev
I've added a pgenlibr::GetVariantsById() function for this purpose (note that it returns a list, since there can be more than one index corresponding to a single ID).  It'll be slow the first time you call it (since that's when it constructs the string -> ID lookup table), but subsequent queries are fast.

N.

unread,
Jan 25, 2021, 4:22:22 AM1/25/21
to plink2-dev
Awesome, thank you. 
Reply all
Reply to author
Forward
0 new messages