Error: catalog VCF and FASTA files are discordant, maybe trucated.

Robert Syme

unread,

Nov 13, 2019, 10:01:02 AM11/13/19

to Stacks

Hi Stacks users

I'm running the gstacks and populations tools (v2.41) on pre-aligned ddRAD BAM inputs.

Given a folder of bams "bams" and a popmap file "popmap.txt":

$ ls bams | head
OQ002.bam
OQ003.bam
OQ006.bam
OQ007.bam
OQ008.bam
OQ009.bam
OQ010.bam
OQ011.bam
OQ012.bam
OQ014.bam
$ head popmap.txt
OQ002   1
OQ003   1
OQ006   1
OQ007   1
OQ008   1
OQ009   1
OQ010   1
OQ011   1
OQ012   1
OQ014   1
$

Running gstacks completes without error. There is a warning about some records missing a pair, but that is expected, as the BAMs contain both post-trimming pairs and singletons.

$ mkdir -p out
$ gstacks   -I bams   -M popmap.txt   -O out   --rm-pcr-duplicates   --threads 12
Logging to 'out/gstacks.log'.
Locus/sample distributions will be written to 'out/gstacks.log.distribs'.

Configuration for this run:
  Input mode: reference-based
  Population map: 'popmap.txt'
  Input files: 192, e.g. 'bams/OQ002.bam'
  Output to: 'out/'
  Model: marukilow (var_alpha: 0.01, gt_alpha: 0.05)
  Discarding unpaired reads.
  Removing PCR duplicates.

Reading BAM headers...
Processing all loci...
WARNING: Some records (e.g. '560_8_1117_27397_11126') are missing a READ1/READ2 flag.
1K...
2K...
5K...
10K...
20K...
50K...
100K...
200K...
done.

Read 1165353144 BAM records:
  kept 794653478 primary alignments (68.3%), of which 391967719 reverse reads
  skipped 324871409 primary alignments with insufficient mapping qualities (27.9%)
  skipped 8519905 excessively soft-clipped primary alignments (0.7%)
  skipped 36124820 unmapped reads (3.1%)
  skipped some suboptimal (secondary/supplementary) alignment records

  Per-sample stats (details in 'gstacks.log.distribs'):
    read 6069547.6 records/sample (1151044-17591418)
    kept 0.1%-83.3% of these

Built 365377 loci comprising 402685759 forward reads and 388594302 matching paired-end reads; mean insert length was 219.1 (sd: 61.1).
Removed 14091457 unpaired (forward) reads (3.5%); kept 388594302 read pairs in 274840 loci.
Removed 384741325 read pairs whose insert length had already been seen in the same sample as putative PCR duplicates (99.0%); kept 3852977 read pairs.

Genotyped 274840 loci:
  effective per-sample coverage: mean=1.4x, stdev=0.1x, min=1.0x, max=1.7x
  mean number of sites per locus: 187.3
  a consistent phasing was found for 78389 of out 79298 (98.9%) diploid loci needing phasing

gstacks is done.
$ ls out/
catalog.calls  catalog.fa.gz  gstacks.log  gstacks.log.distribs  populations.log  populations.log.distribs

My difficulties come when running the "populations" tool:

$ mkdir -p foo
$ populations --in-path out --out-path foo -M popmap.txt --verbose
Logging to 'foo/populations.log'.
Locus/sample distributions will be written to 'foo/populations.log.distribs'.
populations parameters selected:
  Percent samples limit per population: 0
  Locus Population limit: 1
  Percent samples overall: 0
  Minor allele frequency cutoff: 0
  Maximum observed heterozygosity cutoff: 1
  Applying Fst correction: none.
  Pi/Fis kernel smoothing: off
  Fstats kernel smoothing: off
  Bootstrap resampling: off

Parsing population map...
The population map contained 192 samples, 1 population(s), 1 group(s).
Working on 192 samples.
Working on 1 population(s):
    1: OQ002, OQ003, OQ006, OQ007, OQ008, OQ009, OQ010, OQ011, OQ012, OQ014, OQ016, OQ017, OQ018, OQ019, OQ020, OQ021, OQ022, OQ023,
       OQ025, OQ026, OQ027, OQ028, OQ029, OQ030, OQ031, OQ033, OQ034, OQ035, OQ036, OQ037, OQ038, OQ039, OQ040, OQ042, OQ043, OQ044,
       OQ045, OQ047, OQ048, OQ049, OQ050, OQ051, OQ053, OQ054, OQ055, OQ056, OQ057, OQ058, OQ059, OQ060, OQ061, OQ062, OQ063, OQ064,
       OQ065, OQ066, OQ067, OQ068, OQ069, OQ070, OQ071, OQ072, OQ073, OQ074, OQ075, OQ076, OQ077, OQ078, OQ079, OQ080, OQ081, OQ082,
       OQ083, OQ084, OQ085, OQ086, OQ088, OQ089, OQ093, OQ094, OQ096, OQ097, OQ098, OQ099, OQ100, OQ101, OQ102, OQ103, OQ104, OQ105,
       OQ106, OQ109, OQ110, OQ112, OQ115, OQ116, OQ117, OQ119, OQ120, OQ121, OQ122, OQ123, OQ124, OQ127, OQ130, OQ131, OQ132, OQ133,
       OQ135, OQ138, OQ139, OQ140, OQ141, OQ143, OQ144, OQ145, OQ146, OQ147, OQ148, OQ149, OQ151, OQ152, OQ153, OQ156, OQ157, OQ158,
       OQ159, OQ160, OQ162, OQ163, OQ164, OQ165, OQ166, OQ167, OQ168, OQ169, OQ170, OQ171, OQ172, OQ173, OQ174, OQ176, OQ177, OQ178,
       OQ179, OQ180, OQ181, OQ184, OQ185, OQ188, OQ189, OQ193, OQ202, OQ203, OQ204, OQ205, OQ206, OQ207, OQ208, OQ209, OQ210, OQ211,
       OQ212, OQ213, OQ216, OQ217, OQ219, OQ220, OQ222, OQ224, OQ225, OQ227, OQ228, OQ229, OQ231, OQ233, OQ244, OQ245, OQ247, OQ248,
       OQ249, OQ251, OQ252, OQ253, OQ254, OQ256, OQ263, OQ267, OQ268, OQ269, OQ270, OQ271
Working on 1 group(s) of populations:
    defaultgrp: 1

Genotyping markers will be written to 'foo/populations.markers.tsv'
Raw Genotypes/Haplotypes will be written to 'foo/populations.haplotypes.tsv'
Population-level summary statistics will be written to 'foo/populations.sumstats.tsv'
Population-level haplotype summary statistics will be written to 'foo/populations.hapstats.tsv'

Processing data in batches:
  * load a batch of catalog loci and apply filters
  * compute SNP- and haplotype-wise per-population statistics
  * write the above statistics in the output files
  * export the genotypes/haplotypes in specified format(s)
More details in 'foo/populations.log.distribs'.
Now processing...
Error: catalog VCF and FASTA files are discordant, maybe trucated. rv: 0; cloc_id: 152
Aborted.

The error on the second-to-last line indicates the problem is with the VCF and FASTA files, but there are no VCF or FASTA files provided as input. I've no doubt I'm doing something incorrect in my invocation of gstacks or populations, but any pointers would be appreciated.

Thanks!

-Rob

Robert Syme

unread,

Nov 18, 2019, 12:48:44 PM11/18/19

to Stacks

I think that the problem stems from a possibly-truncated catalog.fa.gz produced by gstacks. When run on our cluster, compiled with gcc 4.9.3, the catalog.calls file is only 1.0M large, and the catalog.fa.gz file contains only 123 entries. When I compile stacks on my laptop (Apple clang version 11.0.0), the catalog.calls is 42M and catalog.fa.gz contains 60177 entries.

For the moment, I think I'll just run the pipeline on my laptop as a workaround until I can figure out why gstacks is producing the truncated output file on our cluster.

-Rob

Stacks newbie

unread,

Dec 6, 2019, 6:30:39 AM12/6/19

to Stacks

Hi Robert,

Did you get a solution to this? I am running into the same problem. Any input from Stacks community on this, please?

Rob Syme

unread,

Dec 6, 2019, 8:43:23 AM12/6/19

to stacks...@googlegroups.com

Hi Stacks newbie, and Stacks team

I compiled the same source on two machines (one HPC facility running CentOS Linux release 7.6.1810 and one laptop running MacOS Catalina). On the HPC facility, the binary would produce truncated output files. My suspicion is that the error was caused by threading difficulties, and the fault probably lies with me incorrectly setting up the dependencies. On my laptop, the compiled binaries produced correct output given the same input files.

If other people are finding the same error, it might be worth our time digging a little deeper to work out why there were no errors during compilation, but for the moment, is it possible to compile and run Stacks on a different machine? What OS are you using at the moment? What version of gcc?

-Rob

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/79ede1b8-a762-4088-b67c-777e976eb229%40googlegroups.com.

Stacks newbie

unread,

Dec 6, 2019, 9:29:07 AM12/6/19

to Stacks

I am running Stacks version 2.3d on the HPC cluster of my institution. I am not techie at all, so I don't know the other details you asked.

I cannot run it on my Laptop because I have about 30GB of data and my laptop will crash trying to handle that lot.

To unsubscribe from this group and stop receiving emails from it, send an email to stacks...@googlegroups.com.

Nicolas Rochette

unread,

Dec 6, 2019, 1:36:30 PM12/6/19

to Stacks Users Group

Hi Rob, magn...@gmail.com,

As far as I can tell, for Rob the problem is that gstacks possibly crashes on his cluster, or at least doesn't run properly. To know more we would need to have the standard output and error of gstacks, as well as the return value of the program (eg. 0 for success, in Unix) and whether it was terminated by a system signal (such as KILL or TERM).

It could also be an issue with system libraries (ie. zlib/gzip, although I believe the newest version should complain if there's an issue there) as Rob mentionned, but it seems less likely at this point.

Best,

Nicolas

To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/2a3f4ac1-4d22-41a2-b723-45d7d95f326e%40googlegroups.com.

Rob Syme

unread,

Dec 9, 2019, 9:19:28 AM12/9/19

to stacks...@googlegroups.com

Hi Nicolas.

I was running gstacks inside a Nextflow pipeline, which automatically include "set -e" everywhere, and the gstacks processes completed so the command must have had a zero exit status. Since moving to my laptop, I've discarded the unsuccessful runs. The library problem may well be the issue - managing software (even with well organised modules) is a challenge. My primary suspect is pthreads rather than zlib/gzip, but I can't be sure right now. I'm running stacks-2.41, downloaded in November 18, which I think is the latest version.

-Rob

To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/5cb4c250-5418-3a17-4308-b5349335e9aa%40gmail.com.

Nicolas Rochette

unread,

Dec 9, 2019, 2:10:09 PM12/9/19

to stacks...@googlegroups.com

Hi Rob,

Regarding the pipeline, do you mean there were other commands after gstacks that also ran?

Best,

Nicolas

To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/CAEf4xgdeNUaccQUm_6g%3DP2fj9cQacHLUzSvt%2B2sf4_w6-ENuKQ%40mail.gmail.com.

Reply all

Reply to author

Forward