Interpreting --recodeA / --recode A .raw files

1,028 views
Skip to first unread message

Tom

unread,
Mar 20, 2024, 10:49:01 AM3/20/24
to plink2-users

I am very new PLINK user, so please forgive the superficial question.

I am using the --recodeA flag in conjunction with --bfile (i.e. .bed / .bim / .fam files), e.g.

plink2 --bfile myfile --recode A --out myrawfile

My understanding is that myrawfile.raw should contain allelic dosage information for each variant (in column 7 onwards, one column per variant). I have not provided a reference genome or used any additional flags, so I would expect the coding to be as follows: 0 = homozygous for major allele; 1 = heterozygous; 2 = homozygous for minor allele. Is this correct? I ask because I am getting some unusual results when analysing my .raw file, e.g. >4x as many homozygous minor alleles (2s) as heterozygotes (1s), which seems unlikely/unrealistic given my population. Am I misunderstanding the output of this file? I am using the 850k variant UK Biobank genotyping array, if that helps at all. When using the toy dataset from the old PLINK tutorial (PLINK: Whole genome data analysis toolset (harvard.edu)) I note the suffixes *_0 and *_1 that appear on the variant names (e.g. below). Are these relevant to interpreting the values? Any help would be appreciated, thank you.

Screenshot 2024-03-20 140645.jpg

Christopher Chang

unread,
Mar 20, 2024, 1:10:19 PM3/20/24
to plink2-users
https://www.cog-genomics.org/plink/2.0/formats#raw

For the toy dataset, "_0" means that the column reports the number of copies of the "0" allele, etc.

With plink2, the counted allele is NOT necessarily the minor allele.  For the last several years, it has actually defaulted to the REF allele (because that is the least problematic choice once multiallelic variants are in the picture), which is usually major; please update your plink2 build because you must be using a very old one and there have been several significant bugfixes since.

Tom

unread,
Mar 22, 2024, 7:15:04 AM3/22/24
to plink2-users
Dear Christopher,

Thank you for your answer. Is there any command/flag that I can use so that the Minor Allele is counted/reported for all genotypes? As I am still very new to PLINK there may be one but I am struggling to interpret the documentation. Since I am really interested in calculating individual ratios of heterozyous sites/homozygous minor alleles, my colleague suggested that I might just be better to use the "Linear scoring" function instead?

Tom

unread,
Mar 22, 2024, 7:16:39 AM3/22/24
to plink2-users
counted/reported for all variants?*

Christopher Chang

unread,
Mar 22, 2024, 12:25:23 PM3/22/24
to plink2-users
With PLINK 2.0, you can use --maj-ref + --make-bed/--make-pgen to save a dataset with all major* alleles set to REF.  Then, "--sample-counts cols=fid,homalt,het" on that dataset is an efficient way to get the counts you want.

But first, you probably need to update your plink2 build; it looks like you are using a build too old to have --sample-counts.

*: This will normally use allele frequencies observed in your dataset.  Use --read-freq to specify different allele frequencies.

Tom

unread,
Mar 26, 2024, 5:26:23 AM3/26/24
to plink2-users
Dear Christopher,

I think I replied to this message privately--rather than publicly--by mistake. Apologies. I can re-post my message publicly if you think that it would be of interest to other forum users.

Thank you.

Reply all
Reply to author
Forward
0 new messages