populations killed due to memory

Sébastien Renaut

unread,

Nov 28, 2019, 1:42:46 PM11/28/19

to Stacks

Hi,

I'm running populations (2.41) and trying to output to a --phylip format. Here, I have 47 individuals from which I define 47 distinct groups. The dataset is not that big (catalog.calls is 79M and catalog.fa.gz is 6.9MB). However populations dies after a <1min. It seems to be due to a memory issue. I can see that memory usage slowly creeps up to >14GB until it dies. It's odd that it would take that much memory: the sample size and data are not that big. I think this has to do with me defining 47 populations out of 47 distinct groups, but why would that not be valid or consume that much memory (I want to get one sequence per individual)?

Thanks for your help,

sebastien

$ populations -P . -t 2 -M pop_labels --phylip

populations -P . -t 2 -M pop_labels --phylip
Logging to './populations.log'.
Locus/sample distributions will be written to './populations.log.distribs'.
populations parameters selected:
Percent samples limit per population: 0
Locus Population limit: 1
Percent samples overall: 0
Minor allele frequency cutoff: 0
Maximum observed heterozygosity cutoff: 1
Applying Fst correction: none.
Pi/Fis kernel smoothing: off
Fstats kernel smoothing: off
Bootstrap resampling: off

Parsing population map...
The population map contained 47 samples, 47 population(s), 1 group(s).
Working on 47 samples.
Working on 47 population(s):
    Alyssa: Alyssa_sorted
    B08: B08_sorted
    Bialobrzeskie: Bialobrzeskie_sorted
    C11: C11_sorted
    CAN100_01: CAN100_01_sorted
    CAN16_94: CAN16_94_sorted
    CAN17_95: CAN17_95_sorted
    CAN18_95: CAN18_95_sorted
    CAN19_87: CAN19_87_sorted
    CAN20_02: CAN20_02_sorted
    CAN22_88: CAN22_88_sorted
    CAN23_99: CAN23_99_sorted
    CAN24_89: CAN24_89_sorted
    CAN26_93: CAN26_93_sorted
    CAN28_01: CAN28_01_sorted
    CAN29_94: CAN29_94_sorted
    CAN37_97: CAN37_97_sorted
    CAN39_98: CAN39_98_sorted
    CAN40_99: CAN40_99_sorted
    Carmagnola: Carmagnola_sorted
    Carmen: Carmen_sorted
    Chameleon: Chameleon_sorted
    D12: D12_sorted
    Delores: Delores_sorted
    E11: E11_sorted
    F01: F01_sorted
    Fedora17: Fedora17_sorted
    Fedora19: Fedora19_sorted
    Fedrina: Fedrina_sorted
    Felina: Felina_sorted
    Ferimon: Ferimon_sorted
    Futura77: Futura77_sorted
    Jus: Jus_sorted
    K110: K110_sorted
    Kompolti: Kompolti_sorted
    LKCSD: LKCSD_sorted
    Novosadska: Novosadska_sorted
    Petera: Petera_sorted
    Silesia: Silesia_sorted
    Suditalien: Suditalien_sorted
    Tygra: Tygra_sorted
    Uniko: Uniko_sorted
    VIR541: VIR541_sorted
    VIR569: VIR569_sorted
    VIR575: VIR575_sorted
    VIR577: VIR577_sorted
    Zolotonsha15: Zolotonsha15_sorted
Working on 1 group(s) of populations:
    defaultgrp: Alyssa, B08, Bialobrzeskie, C11, CAN100_01, CAN16_94, CAN17_95, CAN18_95, CAN19_87, CAN20_02, CAN22_88, CAN23_99, CAN24_89, CAN26_93, CAN28_01, CAN29_94, CAN37_97, CAN39_98, CAN40_99, Carmagnola, Carmen, Chameleon, D12, Delores, E11, F01, Fedora17, Fedora19, Fedrina, Felina, Ferimon, Futura77, Jus, K110, Kompolti, LKCSD, Novosadska, Petera, Silesia, Suditalien, Tygra, Uniko, VIR541, VIR569, VIR575, VIR577, Zolotonsha15

Fixed difference sites in Phylip format will be written to './populations.fixed.phylip'
Genotyping markers will be written to './populations.markers.tsv'
Raw Genotypes/Haplotypes will be written to './populations.haplotypes.tsv'
Population-level summary statistics will be written to './populations.sumstats.tsv'
Population-level haplotype summary statistics will be written to './populations.hapstats.tsv'

Processing data in batches:
* load a batch of catalog loci and apply filters
* compute SNP- and haplotype-wise per-population statistics
* write the above statistics in the output files
* export the genotypes/haplotypes in specified format(s)
More details in './populations.log.distribs'.
Now processing...
Killed

Catchen, Julian

unread,

Dec 3, 2019, 3:38:08 PM12/3/19

to stacks...@googlegroups.com, Sébastien Renaut

Hi Sebastien,

You are correct that increasing the number of groups does increase
memory usage significantly, this is because there area number of
group-specific data structures that are required to compute/store all of
the population-level characteristics. That said, 14Gb of memory is not
much, so how big a system are you running the software on?

julian

Sébastien Renaut wrote on 11/28/19 12:42 PM:
> Hi,
>
> I'm running /populations/ (2.41) and trying to output to a --phylip

> format. Here, I have 47 individuals from which I define 47 distinct
> groups. The dataset is not that big (catalog.calls is 79M and

> catalog.fa.gz is 6.9MB). However /populations/ dies after a <1min. It

Sébastien Renaut

unread,

Dec 4, 2019, 3:34:46 PM12/4/19

to Stacks

Hi Julian,

Thanks for your answer. I could run it on a bigger machine (currently 16GB of RAM), but I was surprised by the amount of RAM used for this dataset, and not sure how it would scale up with say 2X, 5X, 10X the number of pops.

Essentially, I wanted the genotype info per individual in a .phylip fomat (hence the --phylip parameter), so it seems like it should be the same info as with --structure but formatted differently. There must be a better way to do this that I am not getting?