Dear all,

Lea Lemler

unread,

Oct 20, 2022, 4:29:24 AM10/20/22

to plink2-users

I was wondering if you could advice me on the following:

Previously I have converted bgen files to vcf files (I did this for 1-22 chromosomes).

I know need to know the allele frequency and potentially exclude genes with low allele frequency.

I saw the command --freq, but not sure how to access this now with my ready VCF fiile (e.g. chr20.vcf).

Also I need to look at the actual counts of the individuals in my dataset to see whether e.g. only one in many has a snp which might be very high. I might need to exclude this.

This is all very new to me, i hope it makes sense.

Thank you & Best Regards,

Lea

Christopher Chang

unread,

Oct 21, 2022, 12:54:10 PM10/21/22

to plink2-users

You should read and understand the plink 2.0 General usage documentation. (Since you are starting from VCF data, version 1.9 is not recommended; it is easy to scramble your REF/ALT alleles with it.)

You can use --vcf to import a VCF file for plink2's use, and --make-pgen to write the data out as a plink2-formatted fileset. E.g. "plink2 --vcf chr20.vcf --make-pgen --out chr20" will create a {chr20.pgen, chr20.psam, chr20.pvar} fileset that can then be efficiently loaded with "plink2 --pfile chr20".

From there:

* "plink2 --pfile chr20 --freq --out chr20" writes an allele frequency report to chr20.afreq . If you replace "--freq" with "--freq counts", an allele count report will be written to chr20.acount instead. The possible contents of these reports are summarized at https://www.cog-genomics.org/plink/2.0/formats#afreq .

* "plink2 --pfile chr20 --maf 0.01 --make-pgen --out chr20_maf01" filters out all variants with minor allele frequency less than 0.01, and then writes the filtered data to {chr20_maf01.pgen, chr20_maf01.psam, chr20_maf01.pvar}. To filter on minor allele count, use --mac. You can apply both filters at the same time if you want.

Lea Lemler

unread,

Oct 26, 2022, 10:16:59 AM10/26/22

to plink2-users

Dear Christopher,

Thank you very much.

I tried it with plink2 --vcf crcsurvival_chr20.vcf –freq and it seemed to work directly with my vcf file. Why do I need to write the data out as plink2-formatted fileset?

I then wanted to remove any ALT allele frequency below 0.02 and above 0.98 and get an updated VCF file. Again I thought it worked, but maybe I am making it too easy for myself:

plink2 --vcf crcsurvival_chr20.vcf --maf 0.02 --max-maf 0.98 --recode vcf

Thank you,

Lea

Christopher Chang

unread,

Oct 30, 2022, 11:54:46 PM10/30/22

to plink2-users

1. See the discussion at https://www.cog-genomics.org/plink/2.0/input#keep_autoconv . plink2 will *always* convert the VCF to a plink2-formatted fileset first. With your first command, it performs this VCF -> pgen conversion, then it deletes the pgen at the end of the run; then your second command wastefully repeats the conversion.

2. "--maf 0.02" is actually enough, you don't need "--max-maf 0.98". See https://www.cog-genomics.org/plink/2.0/filter#maf for how "minor allele frequency" is defined for the purpose of this filter (and how to use a different definition).

Lea Lemler

unread,

Oct 31, 2022, 5:34:56 AM10/31/22

to plink2-users

Dear Christopher,

Thank you very much for your answer.

Okay, you are saying I am doing an unnecessary step with my commands (two times converting from VCF to plink2-formatted fileset). But the commands itself (beside being a bit wasteful) aren't incorrect?

I actually don't need to run the first command for my pipeline.

So if I run plink2 --vcf crcsurvival_chr20.vcf --maf 0.02 --recode vcf it is correct?

Sry, this is all very new to me, I am trying to keep it as simple as possible for myself.

Also one other question:

Afterwards I am going to run Predict.py (from PrediXcan) with my VCF files & MASHR Whole Blood to predict the gene expression.

After doing above changes to my VCF files (alternative allele frequency <0.02 and >0.98 have been excluded) strangely predict.py gives me an error:

INFO - Loading samples
INFO - Loading model
INFO - Acquiring on-the-fly mapping
INFO - Preparing genotype dosages
INFO - Acquiring liftover conversion
INFO - Setting whitelist from available models
INFO - Processing genotypes
INFO - Preparing prediction
INFO - Couldn't import h5py_cache. Anyway, this dependency should be removed. It has been folded into h5py
Level 9 - Processing vcfs
Level 9 - Processing vcf data/untared/Ready_VCF_files/crcsurvival_chr20_filtered.vcf
Traceback (most recent call last):
File "MetaXcan/software/Predict.py", line 272, in <module>
run(args)
File "MetaXcan/software/Predict.py", line 178, in run
raise e
RuntimeError: Missing DS field when vcf mode is imputed

I have also contacted the Google Groups PrediXcan for this issue, but I haven't had a reply unfortunately, maybe you have an idea as well.

Thank you so much for your support, I really appreciate it.

Best Regards,

Lea

Lea Lemler

unread,

Oct 31, 2022, 9:28:49 AM10/31/22

to plink2-users

One additional comment:

I was able to run Predict.py when I added this: plink2 --vcf crcsurvival_chr20.vcf --maf 0.02 --export vcf vcf-dosage=DS-force.

So I guess even though i did before specify the vcf file (when I was converting from bgen to vcf) with

plink2 --bgen crcsurvival_chr20.bgen --oxford-single-chr 20 --sample crcsurvival_chr20.sample --export vcf vcf-dosage=DS-force

I need to do it again now after applying the filter?

Thanks!

Christopher Chang

unread,

Oct 31, 2022, 9:31:58 AM10/31/22

to plink2-users

1. Yes, repeated use of --vcf/"--recode vcf" with plink2 is still correct, just inefficient.

2. Since there are multiple ways of encoding dosage in VCF files, you need to use "--vcf <filename> dosage=DS" to tell plink2 to read dosage from the DS field. So you should actually be running

"plink2 --vcf crcsurvival_chr20.vcf dosage=DS --maf 0.02 --export vcf vcf-dosage=DS-force"

Lea Lemler

unread,

Oct 31, 2022, 11:11:03 AM10/31/22

to plink2-users

Thank you so much!!!
So just summarising a final time, this sequence makes sense now:

1)Convert bgen to vcf:

plink2 --bgen crcsurvival_chr20.bgen --oxford-single-chr 20 --sample crcsurvival_chr20.sample --export vcf vcf-dosage=DS-force

2)Filter on allele frequency and export VCF with DS:

plink2 --vcf crcsurvival_chr20.vcf dosage=DS --maf 0.02 --export vcf vcf-dosage=DS-force

Christopher Chang

unread,

Oct 31, 2022, 11:53:21 AM10/31/22

to plink2-users

Almost; you also need to include "--out crcsurvival_chr20" in the first command line if you want it to feed into the second, and you probably want to use --out in some fashion in the second command line.

You can also combine these operations into a single plink2 run.

Lea Lemler

unread,

Oct 31, 2022, 12:22:30 PM10/31/22

to plink2-users

Sry, for stupid questions, i am a total beginner with plink2. Why do i need to use --out crcsurvival_chr20?

I cant remember what happened when I run the first command, maybe I had to rename it manually and this way it would be already the correct naming crcsurvival_chr20.vcf?

Thanks again!!

Christopher Chang

unread,

Oct 31, 2022, 12:29:20 PM10/31/22

to plink2-users

The --out flag is explained in the "General usage" documentation that I previously asked you to read: https://www.cog-genomics.org/plink/2.0/general_usage#out

More generally, you can find documentation on any flag from the index: https://www.cog-genomics.org/plink/2.0/index

Lea Lemler

unread,

Nov 1, 2022, 4:11:48 AM11/1/22

to plink2-users

Thank you very much, I just wanted to double check I understood it correctly and it is only about the renaming (no other purpose). I read the documentation, but i am a bit confused.

So it is what I thought with "out" i do not need to manually rename it and I can already specify my desired name.

Thanks for all your help understanding plink2!!

Lea Lemler

unread,

Nov 1, 2022, 5:12:43 AM11/1/22

to plink2-users

I am really sorry to bother you one more time but I just saw that after I added the part in yellow that my ALT_FREQS filter with <0.02 and >0.98 removed seem not to work correctly (before I clearly saw how everything below 0.02 and above 0.98 was removed in the results):

plink2 --vcf crcsurvival_chr20.vcf dosage=DS --maf 0.02 --export vcf vcf-dosage=DS-force

I am also getting values below 0.02 and 0.98 in my results. I assume it has to do with the DS but I do not quite understand the connection here.

However, the filter is doing something (size reduced from 6634406 KB to 6056987 KB) ..

Christopher Chang

unread,

Nov 1, 2022, 11:07:08 AM11/1/22

to plink2-users

When dosages are available, plink2 will use them for allele frequency computations. So if you apply the "--maf 0.02" filter on your dosage data, the effect is slightly different than "--maf 0.02" on just the hard-called genotypes.

If you remember to include "dosage=DS" in your --freq command, you'll see that the dosage-aware allele frequencies are between 0.02 and 0.98 in the filtered VCF.

Lea Lemler

unread,

Nov 1, 2022, 11:19:02 AM11/1/22

to plink2-users

Thank you so much.

You are right, I didn't include "dosage-ds", I can see it now as expected. So also when I just look at data (not manipulating data, e.g. filter on MAF) I need to include it.

Christopher Chang

unread,

Nov 1, 2022, 11:20:40 AM11/1/22

to plink2-users

Note that this wouldn't be a problem if you used --make-pgen/--pfile instead. You're not supposed to keep importing/exporting dosages from VCFs like this.

Lea Lemler

unread,

Nov 1, 2022, 11:33:53 AM11/1/22

to plink2-users

Okay, noted. It is more complicated but doesn't change the results or?

I will try another time the way you suggested, just happy to get this somehow working.

Lea Lemler

unread,

Dec 23, 2022, 5:40:45 AM12/23/22

to plink2-users

Hi Chris,

For some reason i have the same problem again as solved by you above even though I include "Dosage=DS":

"When dosages are available, plink2 will use them for allele frequency computations. So if you apply the "--maf 0.02" filter on your dosage data, the effect is slightly different than "--maf 0.02" on just the hard-called genotypes.

If you remember to include "dosage=DS" in your --freq command, you'll see that the dosage-aware allele frequencies are between 0.02 and 0.98 in the filtered VCF."

My code to filter:

#plink2 --vcf crcsurvival_chr20_DS_Only.vcf dosage=DS --maf 0.02 --export vcf vcf-dosage=DS-force --out crcsurvival_chr20_DS_Only_filtered

My code to check results:
# plink2 --vcf crcsurvival_chr20_DS_Only_filtered.vcf dosage=DS --freq --out crcsurvival_chr20_filter_allefreq_frequency

Do you see the mistake?

Lea Lemler

unread,

Dec 23, 2022, 7:25:25 AM12/23/22

to plink2-users

Hi Chris,

Please ignore my previous email. I have solved it now.

I need holidays maybe..

Thank you for your support and merry Christmas,

Lea

Reply all

Reply to author

Forward