Having an issue with nucleotide-base simulation

52 views
Skip to first unread message

Zuxi Cui

unread,
Aug 19, 2021, 12:23:46 PM8/19/21
to slim-discuss
Hi there,

I'm experiencing the following error even if I set the mutation rate to 0.

"more than one nucleotide-base mutation encountered at the same position in the same genome; the nucleotide cannot be called"

Attached is my slim code. I tried setting mmJukesCantor(0.0) but it didn't work.

I used to have no problem with it until I update my reference panel from the old GRCh37 1000G phase 3 data of 2504 samples (http://ftp.1000genomes.ebi.ac.uk//vol1/ftp/release/20130502/)

I used the sequencing reference FASTA file from phase2 and followed the slim guide section 18.12. (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz)

I wonder if I should do a lift-over as the two reference panel use different builds (GRCh37 vs GRCh38), or I should use a new FASTA file although I cannot find an updated one from the 1000G ftp server.

Can anyone help me with this, please? Any comments are welcome.

Best,
Terry
slim.txt

Ben Haller

unread,
Aug 19, 2021, 12:58:55 PM8/19/21
to slim-discuss
Hi Terry,

If you could provide download links for the final processed .fa and .vcf files that you are using in your model, that would be quite helpful.

In the meantime, I just posted a tweak to SLiM that might help you debug.  The "nucleotide cannot be called" error message will now print out the position that exhibits the problem.  Note that that position is as represented in SLiM, which is zero-based; since positions in VCF are one-based, you would look to see what is going on in your VCF file at the printed position plus one, I think.

The most likely thing is that the VCF file has more than one segregating mutation at the same site, and that there is an error in the VCF input file such that some genome is called as containing more than one of those mutations.  In other words, my guess is that the error condition that SLiM is objecting to was actually present in the input file at the start of the simulation.  It would be nice if SLiM detected that at load time and gave some kind of "VCF input error" message, but it does not presently do that check.

So, I'd recommend that (a) you get the current GitHub version of SLiM to get the improved error message, (b) you manually inspect the input VCF file, at the reported position (plus one) to see what's going on, (c) you use the Eidos console in SLiMgui to inspect genomes at the given position to see what's happening in SLiM, if necessary, and (d) if you are unable to figure out the problem, then put your input files up on a server somewhere and post the links to them here so that I can debug the problem.  I hope this helps!

Cheers,
-B.

Benjamin C. Haller
Messer Lab
Cornell University


Zuxi Cui

unread,
Aug 20, 2021, 9:32:56 PM8/20/21
to Ben Haller, slim-discuss
Hi all,

Ben helped me with the debug and we found the potential causes of the issue are tri-allelic positions.

Here's a brief summary.
In VCF file, I found that tri-allelic loci are usually coded in two formats as follows:

1. Some tri-allelic positions are coded in the same row using a comma to separate different alternatives.
REF=<ref type> ALT=<alt type1>,<alt type2>
For example
                                              3    12010    .    A    G,C    100    PASS  
This type of tri-allelic can be excluded by using a perl script 
perl -lane 'if(/^#/ or length("$F[3]$F[4]")==2){print}' <vcf_file>

2. Some tri-allelic positions are coded in different rows. (see an example below)
                                              3    82988805    rs544932678    T    A    100    PASS    .  
                                              3    82988805    rs544932678    T    G    100    PASS    . 
This type of tri-allelic can be exclude by using the following script
cat <vcf_file> | awk '!a[$2]++{print}' 

Because there are not many of them so personally I just exclude them before simulation.

Here's our debug process.
1. Because the error message said there were multiple mutations, I first tried setting the mutation rate to 0 and I found it didn't work.
So our first guess was that the input file might be problematic.
2. I had a brief view of the input file and noticed the first type of tri-allelic loci and exclude them. This time simulation ran smoothly
for most of the chromosomes except chr3 and chr8. I realized there might be other causes of the issue.
3. Although slim reported an error message and stopped for chr3 and chr8, it still outputted partial VCF. My guess was that it prints
the VCF of all loci right before the problematic position one. So Ben help me checking the input file again and found the second type
of position that caused the issue.


Thanks for reading and let us know if you have any comments on this.

Best,
Terry



--
SLiM forward genetic simulation: http://messerlab.org/slim/
---
You received this message because you are subscribed to the Google Groups "slim-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to slim-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/slim-discuss/7316a2b0-9c17-4f9f-a6cd-00ae7e2614can%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages