Mismatched IDs

17 views
Skip to first unread message

Edward Blake

unread,
Jul 15, 2024, 9:46:29 AM (yesterday) Jul 15
to plink2-users
Hello,

I have been stuck on this problem for a couple of days.

So I have 1000+ manta vcf files that were merged with jasmine, note these are structural variants not SNPs.

I used a function to extract all the IIDs then created a .psam file with the required indices and correct corresponding data assigned to each IID:
#FID IID SEX Phenotype Age

I then did many checks to see if anything was mismatched (nothing is).

After I created a Keepfile to analysis around 75% of the entire .psam file.

Then processed to run the following:
#!/bin/bash

# Hardcoded paths to the files
VCF_FILE="/SV_VCF/manta_results/manta_samples_merge/statistics/jasmine_merged_noreplicates_HLA_fixed_strands_50bp_sorted.vcf"
OUTPUT_DIR="/homes/eblak01/rtest/sv_pipeline/input/jas"
OUTPUT_PREFIX="${OUTPUT_DIR}/output"
PSAM_FILE="${OUTPUT_DIR}/all_SVs_VCF.psam"
KEEP_FILE="${OUTPUT_DIR}/all_SVs_VCF-CaCon.psam"

# Create the output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Convert VCF to PLINK2 format with sex information and handling pseudoautosomal regions
plink2 --vcf $VCF_FILE \
--make-pgen \
--out $OUTPUT_PREFIX \
--psam $PSAM_FILE \
--allow-extra-chr \
--vcf-half-call m \
--split-par hg38 \
--keep $KEEP_FILE


echo "Conversion complete. Output files are located at ${OUTPUT_PREFIX}"




Output:
./plink_file_setup.sh

PLINK v2.00a5.12LM AVX2 Intel (25 Jun 2024)    www.cog-genomics.org/plink/2.0/
(C) 2005-2024 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /homes/eblak01/rtest/sv_pipeline/input/jas/output.log.
Options in effect:
  --allow-extra-chr
  --keep /homes/eblak01/rtest/sv_pipeline/input/jas/all_SVs_VCF-CaCon.psam
  --make-pgen
  --out /homes/eblak01/rtest/sv_pipeline/input/jas/output
  --psam /homes/eblak01/rtest/sv_pipeline/input/jas/all_SVs_VCF.psam
  --split-par hg38
  --vcf /SV_VCF/manta_results/manta_samples_merge/statistics/jasmine_merged_noreplicates_HLA_fixed_strands_50bp_sorted.vcf
  --vcf-half-call m

Start time: Mon Jul 15 07:36:34 2024
772687 MiB RAM detected, ~568796 available; reserving 386343 MiB for main
workspace.
Using up to 80 threads (change this with --threads).
Error: Mismatched IDs between --vcf file and
/homes/eblak01/rtest/sv_pipeline/input/jas/all_SVs_VCF.psam.
End time: Mon Jul 15 07:36:34 2024
Conversion complete. Output files are located at /homes/eblak01/rtest/sv_pipeline/input/jas/output



This method works perfectly for my SNP vcf merge. 

But for some reason no matter how much I try there always exist a mismatch for my manta vcf file (merged with jasmine).  Please note, I have also tried the method using SURVIVOR merged manta vcf files and still a mismatched occurs.

Could this problem be due to the fact that I am using Structural Variants instead of SNPs?

Is there some function that can give me a more detailed explanation on why it is mismatched?

 

Christopher Chang

unread,
Jul 15, 2024, 8:46:38 PM (19 hours ago) Jul 15
to plink2-users
What are the first 10 sample IDs in the VCF file, and the first 11 lines of the .psam?

Edward Blake

unread,
3:34 AM (13 hours ago) 3:34 AM
to Christopher Chang, plink2-users
First 10 IDs from my VCF:
(stats) eblak01@targetid-01:/scripts$ bcftools query -l jasmine_merged_noreplicates_HLA_fixed_strands_50bp_sorted.vcf
D799
D354
D706
D773
D321
D474
D401
D868
D108
D933

Head of the .psam file (note its been reordered via IID):
(stats) eblak01@targetid-01:/scripts$ head all_SVs_VCF.psam
#FID IID SEX Phenotype Age
5 D1 1 3 40
2 D10 1 3 43
35 D100 2 3 67
D1000 D1000 1 1 34
D1001 D1001 2 1 -9
D1002 D1002 1 1 -9
D1003 D1003 1 1 32
D1004 D1004 1 2 63
D1005 D1005 1 1 47

The Located 10 IDs from the VCF:
(stats) eblak01@targetid-01:/scripts$  awk 'NR==1 || /D799|D354|D706|D773|D321|D474|D401|D868|D108|D933/' all_SVs_VCF.psam

#FID IID SEX Phenotype Age
53 D108 1 1 33
D1080 D1080 1 1 53
D1081 D1081 1 1 -9
D1084 D1084 1 1 66
D1085 D1085 1 1 52
D1086 D1086 1 1 43
D321 D321 2 1 28
D354 D354 1 1 65
D401 D401 1 2 72
D474 D474 1 2 61
D706 D706 1 2 36
68 D773 1 3 52
D799 D799 1 2 57
D868 D868 2 1 -9
D933 D933 2 1 23



On Tue, Jul 16, 2024 at 2:46 AM Christopher Chang <chrch...@gmail.com> wrote:
What are the first 10 sample IDs in the VCF file, and the first 11 lines of the .psam?

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/ec4bcfe3-a2f0-45cc-be32-fc7552f53939n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages