plink merging different study cohorts

2,325 views
Skip to first unread message

Henry Lu

unread,
Feb 22, 2022, 2:24:34 PM2/22/22
to plink2-users
Dear plink community

I am a new user to plink 1.9, and have a question on merging different study cohorts. (Let's say study A, B and C, all based on hg38). Currently, I have .bed, .bim and .fam files for all of them.

When I tried to merge them, using --merge-list, I did not get an error, but I want to make sure I am doing it correctly.

1) I expect the SNPs from the studies should not overlap entirely. Is the result based on their union or intersection?

2) I used R to read the bim files and checked whether chromosome number, SNP position, ALT and REF columns exactly match among the 3 studies. Only about 2/3 of them matches. However, I did not get any error message after merging. Does plink merging handle the flipping automatically? Do I need to manually flip and get rid of ambiguous and non-matching SNPs?

I tried looking for solutions in the documentation/online but was not able to find recent relevant threads. The most relevant one was about 8 years ago: https://www.biostars.org/p/101191/
which I would expect to be out of date?

In addition, there was no .missnp output created when I merged... 

Thanks a lot!!!
 

I wonder

Christopher Chang

unread,
Feb 24, 2022, 11:03:32 AM2/24/22
to plink2-users
1. By default, the result is based on the union.  (When plink 2.0 --pmerge-list is finished, there will be an option to generate just the intersection instead.)

2. No, flipping is not handled automatically.  See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.

Henry Lu

unread,
Feb 28, 2022, 8:26:33 PM2/28/22
to plink2-users

1. I used conda to install plink2 which do not have this --pmerge option. Would you expect to update plink2 in conda in the new future?
2. If my dataset1 has REF = A and ALT = G, and my dataset2 has REF=G and ALT = A for the same SNP, is that not handled as well? (I suppose you meant flipping by strand flipping?) If not handled, can I simply change the bim file without changing the bed and fam? Or should I do something else? 

Many thanks

Henry Lu

unread,
Feb 28, 2022, 8:45:05 PM2/28/22
to plink2-users
More on Q2: What if my REF=G ALT = A in dataset1 and REF=G, ALT = T in dataset 2?

Christopher Chang

unread,
Mar 2, 2022, 12:18:34 PM3/2/22
to plink2-users
1. It will probably be updated in conda shortly after --pmerge is completed.

2. In plink 1.x, REF/ALT order is not preserved in general.  The genotypes will be merged correctly, but you'd need to use e.g. --a2-allele to restore REF/ALT order afterwards.
You MUST NOT change the bim file without changing the bed, since that will result in incorrect genotypes (e.g. an A/A genotype will become G/G if you change REF=A, ALT=G to REF=G, ALT=A in the bim without making the corresponding change to the bed).
plink 1.x merge will error out if you try to merge a variant with REF=G, ALT=A with another variant with REF=G, ALT=T; "triallelic" variants are not supported by the bed file format (but they are supported by plink2's pgen format).

Henry Lu

unread,
Mar 5, 2022, 6:26:49 PM3/5/22
to plink2-users
For part 2: If I would like to change only the SNP names in bim files, is that ok to just change the bim file?

Additionally, I could not find the formula for the --score with center option in plink 1.9. Would you mind sharing that?

Thanks!!!

Christopher Chang

unread,
Mar 7, 2022, 11:20:26 AM3/7/22
to plink2-users
1. Yes, it is safe to directly change the bim file if you are just updating SNP names.
2. https://www.cog-genomics.org/plink/1.9/score includes a concrete example of what 'center' does.  If that isn't enough, please spell out what you're confused by.

Henry Lu

unread,
Apr 5, 2022, 12:13:59 PM4/5/22
to plink2-users
2. Thanks. I would like to formulate the equation. In general, PRS for an individual i = sum of beta * G. I wonder how the center step changes this equation. To me, centring a variable usually means subtracting its mean and dividing by its standard deviation, which makes me wonder how this is related to MAF, as described in the documentation below:

  • Alternatively, you can use the 'center' modifier to shift all scores to mean zero. For example, if the minor allele is assigned score 4.5, and its loaded/imputed frequency is 0.2, 'center' makes minor allele observations contribute +3.6 instead of +4.5 to score, and major allele observations contribute -0.9 instead of 0.
That's why I wonder if it is possible to have the formula that calculates this centered score.

thanks!

Christopher Chang

unread,
Apr 8, 2022, 10:08:46 PM4/8/22
to plink2-users
In the plink documentation, "center" refers to mean-centering only, while "standardize" refers to the combination of mean-centering and variance-standardization.
Reply all
Reply to author
Forward
0 new messages