I have been stuck on this problem for a couple of days.
So I have 1000+ manta vcf files that were merged with jasmine, note these are structural variants not SNPs.
I used a function to extract all the IIDs then created a .psam file with the required indices and correct corresponding data assigned to each IID:
#FID IID SEX Phenotype Age
I then did many checks to see if anything was mismatched (nothing is).
After I created a Keepfile to analysis around 75% of the entire .psam file.
Then processed to run the following:
#!/bin/bash
# Hardcoded paths to the files
VCF_FILE="/SV_VCF/manta_results/manta_samples_merge/statistics/jasmine_merged_noreplicates_HLA_fixed_strands_50bp_sorted.vcf"
OUTPUT_DIR="/homes/eblak01/rtest/sv_pipeline/input/jas"
OUTPUT_PREFIX="${OUTPUT_DIR}/output"
PSAM_FILE="${OUTPUT_DIR}/all_SVs_VCF.psam"
KEEP_FILE="${OUTPUT_DIR}/all_SVs_VCF-CaCon.psam"
# Create the output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"
# Convert VCF to PLINK2 format with sex information and handling pseudoautosomal regions
plink2 --vcf $VCF_FILE \
--make-pgen \
--out $OUTPUT_PREFIX \
--psam $PSAM_FILE \
--allow-extra-chr \
--vcf-half-call m \
--split-par hg38 \
--keep $KEEP_FILE
echo "Conversion complete. Output files are located at ${OUTPUT_PREFIX}"
Output:
./plink_file_setup.sh
PLINK v2.00a5.12LM AVX2 Intel (25 Jun 2024)
www.cog-genomics.org/plink/2.0/(C) 2005-2024 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /homes/eblak01/rtest/sv_pipeline/input/jas/output.log.
Options in effect:
--allow-extra-chr
--keep /homes/eblak01/rtest/sv_pipeline/input/jas/all_SVs_VCF-CaCon.psam
--make-pgen
--out /homes/eblak01/rtest/sv_pipeline/input/jas/output
--psam /homes/eblak01/rtest/sv_pipeline/input/jas/all_SVs_VCF.psam
--split-par hg38
--vcf /SV_VCF/manta_results/manta_samples_merge/statistics/jasmine_merged_noreplicates_HLA_fixed_strands_50bp_sorted.vcf
--vcf-half-call m
Start time: Mon Jul 15 07:36:34 2024
772687 MiB RAM detected, ~568796 available; reserving 386343 MiB for main
workspace.
Using up to 80 threads (change this with --threads).
Error: Mismatched IDs between --vcf file and
/homes/eblak01/rtest/sv_pipeline/input/jas/all_SVs_VCF.psam.
End time: Mon Jul 15 07:36:34 2024
Conversion complete. Output files are located at /homes/eblak01/rtest/sv_pipeline/input/jas/output
This method works perfectly for my SNP vcf merge.
But for some reason no matter how much I try there always exist a mismatch for my manta vcf file (merged with jasmine). Please note, I have also tried the method using SURVIVOR merged manta vcf files and still a mismatched occurs.
Could this problem be due to the fact that I am using Structural Variants instead of SNPs?
Is there some function that can give me a more detailed explanation on why it is mismatched?