remove duplicates in .fam file

1,839 views
Skip to first unread message

fasssster101

unread,
Apr 5, 2019, 5:46:47 PM4/5/19
to plink2-users
Hello, 
My apologizes if this has already been answered... I am trying to update my *.bed, *.bim, and *.fam files by removing duplicate samples (not snps). These are individuals with the same FID and IID. 

I don't see an option in plink to do this ... ? If I use either the --keep or --remove option, how do I ensure that only one of the two duplicates are removed? 

Thank you! 

Christopher Chang

unread,
Apr 5, 2019, 5:54:12 PM4/5/19
to plink2-users
If you want to keep one sample in each group, you have to use your own script to modify the .fam file to assign unique IDs; otherwise plink has no way to know which sample in each duplicate group to keep.

(plink 2.0 allows for an optional "source ID" field to distinguish multiple samples from the same individual, which is respected by --keep/--remove/etc.  But the FID+IID+SID combination is still required to be unique there.)

fasssster101

unread,
Apr 5, 2019, 6:05:12 PM4/5/19
to plink2-users
That sounds great. -- one more question. 

After modifying the .fam file, will this affect the *.bim or *.bed files in any way? 

My plan is to modify the *.fam file so that the second duplicate will have _1, and then remove those with _1. 

Thank you! 

Christopher Chang

unread,
Apr 5, 2019, 6:55:06 PM4/5/19
to plink2-users
As long as you don't reorder the .fam file or change the number of entries, this doesn't desynchronize from the .bed or .bim file, so your plan is fine.

MAOMAO

unread,
Mar 19, 2020, 5:00:45 PM3/19/20
to plink2-users
Hi Dr. Chang,

I am trying to change family ID to 0 in .ped file, but it gives me an error: awk: program limit exceeded: maximum number of fields size=32767 FILENAME="merged_clean1.ped" FNR=1 NR=1awk '{$1 = "0"; print}' merged_clean1.ped > change.ped I am not sure how to fix it. However, it works when I 
use awk '{$1 = "0"; print}' merged_clean1.fam > change.fam

My .ped file looks like below

1  2002034 0 0 1 0 A C C A T .......
2  2009036 0 0 1 0 A T A  T T ......
3  2000349 0 0 2 0 T A T C A ....
...
...

The reason I change FID to 0, because I have 3 plates samples and each one assign the FID as consecutive number from 1 to 96. When I do the duplicate check use --genome --rel-check, the PI-HAT all less than 0.4. It was supposed to have a couple of duplicates. 

Would you give me some suggestions? 

Thanks in advance!


MAOMAO

Ying Zhao

unread,
Mar 20, 2020, 11:28:04 AM3/20/20
to MAOMAO, plink2-users
Hi Christpher,

Would you help me out? Thanks. See below for details.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/cfda01ed-813d-4e0b-babc-4cbba2b515b9%40googlegroups.com.

Christopher Chang

unread,
Mar 20, 2020, 11:49:50 AM3/20/20
to plink2-users
What is there to help with?  You already found that .fam files work with awk, and there are many other reasons to use .bed+.bim+.fam over .ped+.map.


On Friday, March 20, 2020 at 8:28:04 AM UTC-7, Ying Zhao wrote:
Hi Christpher,

Would you help me out? Thanks. See below for details.

On Thu, Mar 19, 2020 at 5:01 PM MAOMAO wrote:
Hi Dr. Chang,

I am trying to change family ID to 0 in .ped file, but it gives me an error: awk: program limit exceeded: maximum number of fields size=32767 FILENAME="merged_clean1.ped" FNR=1 NR=1awk '{$1 = "0"; print}' merged_clean1.ped > change.ped I am not sure how to fix it. However, it works when I 
use awk '{$1 = "0"; print}' merged_clean1.fam > change.fam

My .ped file looks like below

1  2002034 0 0 1 0 A C C A T .......
2  2009036 0 0 1 0 A T A  T T ......
3  2000349 0 0 2 0 T A T C A ....
...
...

The reason I change FID to 0, because I have 3 plates samples and each one assign the FID as consecutive number from 1 to 96. When I do the duplicate check use --genome --rel-check, the PI-HAT all less than 0.4. It was supposed to have a couple of duplicates. 

Would you give me some suggestions? 

Thanks in advance!


MAOMAO

On Friday, April 5, 2019 at 5:54:12 PM UTC-4, Christopher Chang wrote:
If you want to keep one sample in each group, you have to use your own script to modify the .fam file to assign unique IDs; otherwise plink has no way to know which sample in each duplicate group to keep.

(plink 2.0 allows for an optional "source ID" field to distinguish multiple samples from the same individual, which is respected by --keep/--remove/etc.  But the FID+IID+SID combination is still required to be unique there.)

On Friday, April 5, 2019 at 2:46:47 PM UTC-7, fasssster101 wrote:
Hello, 
My apologizes if this has already been answered... I am trying to update my *.bed, *.bim, and *.fam files by removing duplicate samples (not snps). These are individuals with the same FID and IID. 

I don't see an option in plink to do this ... ? If I use either the --keep or --remove option, how do I ensure that only one of the two duplicates are removed? 

Thank you! 

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Ying Zhao

unread,
Mar 23, 2020, 9:03:24 AM3/23/20
to plink2-users

Hi Christpher,

Thanks for your reply.

I still have some questions: 

1)do family ID really affect the IBD output? I am trying to identify the duplicates in my sample. However there are no duplicates there, all the PI-HAT value from 0 to 0.4. (it was supposed to have four duplicates.) I am thinking whether the problem is from family ID, because family ID is arranged by consecutive numbers(1,2,…,96) for each sample plate. After merging them, the family ID is all from 1 to 96, so I am trying to change all family ID to “0”.

If it works with .fam files, but how can I use –bfile including the .bed file to check duplicates?  Or any easy way to do?

2)And do all the research need to check heterozygosity? When I check SNPs to prune SNPs, I want to exclude high inversion regions, how can I generate a list of that. 

3)I use the plink1.9 and with the option --check-sex to check sex discrepancy, there are 4% of discrepancy. (16 of 400 samples). Is any way to check whether 4% is acceptable or within expected error?


Thanks in advance.

 MAOMAO


To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/d64adeb3-58d6-49e4-a5c5-29c603611d97%40googlegroups.com.

Christopher Chang

unread,
Mar 23, 2020, 11:24:48 AM3/23/20
to plink2-users
1. Family ID, on its own, normally doesn't affect the rest of --genome's output.  It affects your output because you added --rel-check; so the simplest solution to your problem is just to remove --rel-check.

It may still be useful to change all the family IDs to 0 for other reasons; as mentioned above, there's nothing wrong with using awk to do this.

As for --bfile, I have no idea what you are even asking.  Please post command lines and .log files showing what you're trying and failing to do.

2. Presumably you have, or can download, a list of high-inversion regions.  If this is formatted as an UCSC interval-BED file, plink 2.0 "--exclude bed0 <interval-BED filename>" can then be used to remove those regions.

3. From the --check-sex documentation: "We suggest running --check-sex once without parameters, eyeballing the distribution of F estimates (there should be a clear gap between a very tight male clump at the right side of the distribution and the females everywhere else), and then rerunning with parameters corresponding to the empirical gap."  If you do this, most or all of your 16 discrepancies should go away, unless you have very bad data.

On Monday, March 23, 2020 at 6:03:24 AM UTC-7, Ying Zhao wrote:

Hi Christpher,

Thanks for your reply.

I still have some questions: 

1)do family ID really affect the IBD output? I am trying to identify the duplicates in my sample. However there are no duplicates there, all the PI-HAT value from 0 to 0.4. (it was supposed to have four duplicates.) I am thinking whether the problem is from family ID, because family ID is arranged by consecutive numbers(1,2,…,96) for each sample plate. After merging them, the family ID is all from 1 to 96, so I am trying to change all family ID to “0”.

If it works with .fam files, but how can I use –bfile including the .bed file to check duplicates?  Or any easy way to do?

2)And do all the research need to check heterozygosity? When I check SNPs to prune SNPs, I want to exclude high inversion regions, how can I generate a list of that. 

3)I use the plink1.9 and with the option --check-sex to check sex discrepancy, there are 4% of discrepancy. (16 of 400 samples). Is any way to check whether 4% is acceptable or within expected error?


Thanks in advance.

 MAOMAO

To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Ying Zhao

unread,
Mar 23, 2020, 1:27:05 PM3/23/20
to plink2-users

Hi Chrispher,

 

Thanks for your answers and help.

 

I have resolved two of them, but still have one not figured out. I tried to identify the sample duplicates (it is supposed to have 4 there), but it did not work. The PH-HAT still all less than 0.4. (attached is the .log file and part of plink.genome file) I have already changed all the family ID to 0. All the paternal and maternal ID are 0’s. Is that the problem? How to figure out? Any suggestion?

 

Thanks a lot!

 

Maomao



To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/c98b12f6-ad60-4682-a5ba-2a556227fb54%40googlegroups.com.
partplinkgenome.txt
plink_log.pdf

Christopher Chang

unread,
Mar 23, 2020, 1:33:20 PM3/23/20
to plink2-users
What happens if you run "plink2 --bfile raw2 --make-king-table" instead?  (Duplicates should correspond to KINSHIP > 0.355; if you still don't have any, you probably made a mistake earlier on in your pipeline, since --make-king-table is *very* reliable for this purpose.)
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Ying Zhao

unread,
Mar 23, 2020, 2:01:24 PM3/23/20
to plink2-users
Thanks Christpher!

Any similar command in plink1.9?



To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/88d01340-7102-4e86-9a36-3667120e9e16%40googlegroups.com.

Christopher Chang

unread,
Mar 23, 2020, 2:05:09 PM3/23/20
to plink2-users
No.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Ying Zhao

unread,
Mar 23, 2020, 2:54:55 PM3/23/20
to plink2-users
Thanks so much Christpher. It works. :)

No.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/00ac28aa-a4f0-4bb0-bd49-09a50dc4fccf%40googlegroups.com.

Ying Zhao

unread,
Mar 24, 2020, 12:42:27 PM3/24/20
to plink2-users

Hi Christopher,

 

When I try to remove the identified duplicates, the command is: plink1.9 --bfile raw3 --remove duplicatesIDs.txt --make-bed --out sub_raw3     or    plink1.9 --file raw3 --remove duplicatesIDs.txt --make-bed --out sub_raw3), both of them do not remove the duplicates and also the output does not give me any error. The sample size is still the same.  I am not sure where’s the problem? Would you give me some suggestions? Thanks.

 

Here is the duplicatesIDs.txt

 

165   00003568M1

168   00003568M1_RE

174   00003568N1

178   00003568N1_RE

 

Best,

 

maomao

Christopher Chang

unread,
Mar 24, 2020, 12:50:33 PM3/24/20
to plink2-users
Please post a .log file from a failed run.


On Tuesday, March 24, 2020 at 9:42:27 AM UTC-7, Ying Zhao wrote:

Hi Christopher,

 

When I try to remove the identified duplicates, the command is: plink1.9 --bfile raw3 --remove duplicatesIDs.txt --make-bed --out sub_raw3     or    plink1.9 --file raw3 --remove duplicatesIDs.txt --make-bed --out sub_raw3), both of them do not remove the duplicates and also the output does not give me any error. The sample size is still the same.  I am not sure where’s the problem? Would you give me some suggestions? Thanks.

 

Here is the duplicatesIDs.txt

 

165   00003568M1

168   00003568M1_RE

174   00003568N1

178   00003568N1_RE

 

Best,

 

maomao


On Mon, Mar 23, 2020 at 2:54 PM Ying Zhao wrote:
Thanks so much Christpher. It works. :)

No.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Ying Zhao

unread,
Mar 24, 2020, 1:07:07 PM3/24/20
to plink2-users
Hi Christopher,

Here is my log

No.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/73c0ee3a-75ed-440b-8ef2-4e64e737285d%40googlegroups.com.
sub_raw3.log

Christopher Chang

unread,
Mar 24, 2020, 1:21:48 PM3/24/20
to plink2-users
The .log file indicates that the IDs in the .fam/.ped files don't exactly match the IDs in duplicatesIDs.txt.  Look more carefully at the IDs in the .fam file.


On Tuesday, March 24, 2020 at 10:07:07 AM UTC-7, Ying Zhao wrote:
Hi Christopher,

Here is my log

No.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Ying Zhao

unread,
Mar 24, 2020, 4:32:22 PM3/24/20
to plink2-users
Hi Christopher,

I check my .fam file and they are matching the IDs. The family ID and sample ID I want to remove are the same. The command did not give me an error, but it did not actually remove anything. 

No.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/fb36197c-5d98-40f6-9f80-a5226c7a82f9%40googlegroups.com.

Christopher Chang

unread,
Mar 24, 2020, 4:39:36 PM3/24/20
to plink2-users
Either check again, or post a dataset and .log file that allows me to replicate what you're seeing.  Without the ability to replicate, I can't do anything other than suggest that you check the IDs again, more carefully.


On Tuesday, March 24, 2020 at 1:32:22 PM UTC-7, Ying Zhao wrote:
Hi Christopher,

I check my .fam file and they are matching the IDs. The family ID and sample ID I want to remove are the same. The command did not give me an error, but it did not actually remove anything. 

No.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Krishi

unread,
Dec 8, 2022, 1:20:21 PM12/8/22
to plink2-users
Hi Chang,

I am wondering how to remove duplicate genotypes. Family Ids and Individuals IDs are different in all these duplicate pairs. But P-hat is equal to 1. 
Please let me know how to remove these individuals from the dataset.

I tried to remove them from the .fam file and use --recode to create ped and map with new data without duplicates. But it gave me an error mentioning can't open the bed file.

Thank you in advance.

Cheers
Krishiiii

To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages