pgenlibr: what is the fastest way to get the index of a variant by variant-ID?

N.

unread,

Jan 24, 2021, 3:37:55 PM1/24/21

to plink2-dev

Hi,

I have been using pgenlibr in R to load specific variants from pgen files, however it requires me to load(e.g. fread) the entire pvar to retrieve the index of the variant since it's a txt file, which is the bottleneck for speed, since the pvar files are quite large.

Is there a more efficient way to know the index of a variant by name, without loading the entire pvar file?

Thanks,

Niek

Christopher Chang

unread,

Jan 24, 2021, 4:14:14 PM1/24/21

to plink2-dev

How are you reading the .pvar? pgenlibr includes its own .pvar loader which should be pretty fast, especially if the file is BGZF- or Zstd-compressed, and extraneous columns (e.g. INFO) have been removed from the .pvar.

If that isn't enough, you'd need to write your own software for this. It can be done by constructing a name-based index.

N.

unread,

Jan 24, 2021, 5:42:05 PM1/24/21

to plink2-dev

that is great, could you post an example?

I am now using something like this:

snps <- c("rs1234","rs54123")

i <- which( fread(f.pvar,select = 3)$ID %in% snps)

pvar <- pgenlibr::NewPvar(f.pvar)

pgen <- pgenlibr::NewPgen(f.pgen, pvar=pvar) #,sample_subset = c(1,2,3,4) )

ReadList(pgen, i , meanimpute=F)

Is there a way to use pgenlibr to query the index using the ID?

Thanks so much

Christopher Chang

unread,

Jan 24, 2021, 9:08:27 PM1/24/21

to plink2-dev

I've added a pgenlibr::GetVariantsById() function for this purpose (note that it returns a list, since there can be more than one index corresponding to a single ID). It'll be slow the first time you call it (since that's when it constructs the string -> ID lookup table), but subsequent queries are fast.

N.

unread,

Jan 25, 2021, 4:22:22 AM1/25/21

to plink2-dev

Awesome, thank you.

Reply all

Reply to author

Forward