problem loading vcf from ipyrad into populations

508 views
Skip to first unread message

Martin Schwentner

unread,
Feb 18, 2020, 8:36:06 AM2/18/20
to Stacks
Dear STACKS community,

I have some ipyrad runs whose output files I would like to import and analyse with the populations option in STACKS. The ipyrad output is in vcf format. Although the overall format seems identical among my various vcf files outputted from ipyrad, not all appear to work in STACKS. While some are read and processed as expected, others terminate with an error. I attached the vcf file of one of the runs that terminated with an error.

The most basic command that I used (populations version 2.1):

populations -V Neo_red.vcf -O TEST3_STACKS_popgen -t 16 --hwe


The error message was:

Aborted. (basic_string::compare: __pos (which is 18446744073709551614) > this->size() (which is 11))


The repective outfile read:
Logging to 'TEST3_STACKS_popgen/Neo_red.p.log'.
Locus/sample distributions will be written to 'TEST3_STACKS_popgen/Neo_red.p.log.distribs'.
populations parameters selected:
  Input mode: VCF
  Percent samples limit per population: 0
  Locus Population limit: 1
  Log liklihood filtering: off; threshold: 0
  Minor allele frequency cutoff: 0
  Maximum observed heterozygosity cutoff: 1
  Applying Fst correction: none.
  Pi/Fis kernel smoothing: off
  Fstats kernel smoothing: off
  Bootstrap resampling: off

A population map was not specified, all samples will be read from '' as a single popultaion.
Opening the VCF file...




I am looking forward to any helpful ideas.
Best,
Martin
Neo_red.vcf

Martin Schwentner

unread,
Feb 21, 2020, 5:47:12 AM2/21/20
to Stacks
Dear all,

I found the error. The problem are too long individual names in the vcf file. Once these were shorted Stacks ran perfectly.

Best,
Martin

Nicolas Rochette

unread,
Feb 21, 2020, 9:14:51 AM2/21/20
to stacks...@googlegroups.com, Martin Schwentner

Hi Martin,

Thanks for the update, sorry we couldn't get back to you earlier.

What was the length of the initial header line? eg. this code

zgrep '^#C' mygenotypes.vcf.gz | wc -c

Anyway, that's weird; there already is code to handle that case, but it's a limit situation that may have been extensively tested so there might be a bug. If you could send me the error-causing VCF (or a subset with the header and a bunch of record lines) it would greatly help.

Best,

Nicolas

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/293565da-dc8f-4c35-9db7-67a70125a1ff%40googlegroups.com.

Stacks newbie

unread,
Apr 8, 2020, 2:55:00 PM4/8/20
to Stacks
Hi Nicolas,
I am also having issues parsing my .vcf file from ipyrad through populations in STACKS.
Here is the error I got and populations command used:

Now processing...

RAD_0 populations: src/export_formats.cc:2632: virtual int VcfExport::write_site(const CSLocus*, const LocPopSum*, const Datum* const*, std::size_t, std::size_t): Assertion `d[s]->snpdata[index].gq != (255)' failed.

/export/software/slurm/spool/slurmd/job14086941/slurm_script: line 27: 63137 Aborted                 (core dumped) populations --in_vcf /lustre/mngeve/rapturedata/CATDATA/ipyrad/CBtestsbb_outfiles/CBtestsbb.vcf -M /lustre/mngeve/rapturedata/CATDATA/ipyrad/CBtests_popmap.txt -O /lustre/mngeve/rapturedata/CATDATA/ipyrad/stackspop -p 2 -r 0.7 -t 10 --vcf


It does not look like the same problem Martin had, but I tried to even shorten the names as he did to see if it works, and it didn't.
When I used the command you gave above to check the  length of my header, it was  = 3614

Again, the error was not like Martin's... 
PLEASE, could advice me on how I can proceed so as to run "Populations"??
My original .vcf file from ipyrad is attached here (CBtestsbb.vcf).
THANK YOU.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks...@googlegroups.com.
CBtestsbb.vcf

Alberto Abreu

unread,
Jan 9, 2023, 3:29:13 PM1/9/23
to Stacks
Hello Nicolas

Has a fix for this error been found? I am also getting the message
Aborted. (basic_string::compare: __pos (which is 18446744073709551614) > this->size() (which is 11))

where my individual sample names are very short (two letters) so the solution Martin provided is unlikely to solve my problem.

best

Alberto

A Dickinson

unread,
Sep 5, 2024, 4:27:03 PM9/5/24
to Stacks
Hi all, 

Just to join the annual list of people running into this problem. I am getting the exact same error (Aborted. (basic_string::compare: __pos (which is 18446744073709551614) > this->size() (which is 11)) when I attempt to use my ipyrad vcf.

I adjusted the names to be much shorter, and I even tried shortening the POS field so instead of reading “loc0_pos123", they are all coded as “0_p123”, no fixes seem to change the outcome.  I will attach the VCF file. Since it is from ipyrad the format doesn’t have a header 

I’m hoping someone knows something at this point!
subdata.vcf

A Dickinson

unread,
Sep 5, 2024, 4:27:04 PM9/5/24
to Stacks
Hello, 

I thought I already sent a reply on this conversation, so my apologies if I did and there is just a lag. I am getting this exact same error when I attempt to run an ipyrad VCF file in populations as well. I had "a time" getting stacks to install on my M2 Mac and getting the OpenMP to work through gcc code instead of clang. After I finally had it installed and it looked like things were moving ahead... I am now stuck on this error.

Initially I read this post and tried updating my VCF to see if some field (sample names or locus ids)  were to long, but shortened names did not work. Then I even tried removing all the metadata lines except the first row. No matter what I edited I received the same error with that same “which is 11” value, so I don’t think anything I was changing was involved with the error. 

Next I started searching the error online and it looks like this is related to c++ code that is comparing the length of specific character strings? I know basically nothing about c++ so one of you will probably have a better idea what this could mean. But apparently 18446744073709551614 is the exact value that is given as the maximum unsigned value in 64 bit, and a negative number in a field that should have an “unsigned integer” will kick out this message. Just in case, I checked to see if a negative sign shows up anywhere on the vcf file, but it does not. Anyway, I think it would be a negative value calculated by the basic_string:: compare function. 

I am at a loss of anything else I can on this end to try to tease out what may be the problem, and would be grateful for any other ideas or solutions!

I will attach my log and my VCF file. 

Versions I am using: 
Stacks 2.68
gcc 14.2.0 (homebrew)
ipyrad 0.9.93
VCF 4.0

Thanks!
Ashley 

On Monday, January 9, 2023 at 12:29:13 PM UTC-8 Alberto Abreu wrote:
subdata.p.log
subdata.vcf

Catchen, Julian

unread,
Sep 6, 2024, 4:59:36 PM9/6/24
to stacks...@googlegroups.com

Hi Ashley,

 

This runs without error on my linux machine (see output below). What are you actually trying to accomplish analytically by mixing the two pipelines? For example, the --smooth option won’t do much with this particular dataset since your ‘chromosomes’ are so short.

 

Julian

% populations -V ./subdata.vcf -O ./ --smooth

Logging to './subdata.p.log'.

Locus/sample distributions will be written to './subdata.p.log.distribs'.

populations parameters selected:

  Input mode: VCF

  Percent samples limit per population: 0

  Locus Population limit: 1

  Percent samples overall: 0

  Minor allele frequency cutoff: 0

  Maximum observed heterozygosity cutoff: 1

  Applying Fst correction: none.

  Pi/Fis kernel smoothing: on

  F-stats kernel smoothing: on

  Bootstrap resampling: off

 

A population map was not specified, all samples will be read from '' as a single popultaion.

Opening the VCF file...

No population map specified, creating one from the VCF header...

Working on 24 samples.

Working on 1 population(s):

    defaultpop: 120H5, 124H5, 126H1, 130H5, 134H5, 137H1, 140H5, 141H5, 142H5, 143H5, 151H5, 154G1, 156G7, 157G19, 159H5, 162L1, 163L5, 169L1, 

                170L5, 171L5, 173H1, 72G6, 74G8, 88G22

Working on 1 group(s) of populations:

    defaultgrp: defaultpop

 

Raw haplotypes will be written to './subdata.p.haplotypes.tsv'

Population-level summary statistics will be written to './subdata.p.sumstats.tsv'

Population-level haplotype summary statistics will be written to './subdata.p.hapstats.tsv'

 

Processing data in batches:

  * load a batch of catalog loci and apply filters

  * compute SNP- and haplotype-wise per-population statistics

    * smooth per-population statistics

  * compute F-statistics

    * smooth F-statistics

  * write the above statistics in the output files

  * export the genotypes/haplotypes in specified format(s)

More details in './subdata.p.log.distribs'.

 

Now processing...

RAD_0 

RAD_2 

RAD_1247 

RAD_1250 

RAD_1253 

 

Found 1978 SNP records in file './subdata.vcf'. (Skipped 0 already filtered-out SNPs and 0 non-SNP records ; more with --verbose.)

 

Removed 0 loci that did not pass sample/population constraints from 1978 loci.

Kept 1906 loci, composed of 1906 sites; 0 of those sites were filtered, 1906 variant sites remained.

    1904 genomic sites, of which 2 were covered by multiple loci (0.1%).

Mean genotyped sites per locus: 1.00bp (stderr 0.00).

 

Population summary statistics (more detail in populations.sumstats_summary.tsv):

  defaultpop: 21.116 samples per locus; pi: 0.11727; all/variant/polymorphic sites: 1906/1906/1906; private alleles: 0

Populations is done.

Reply all
Reply to author
Forward
0 new messages