Out of memory boostrapping

Mathilde SALAMON

unread,

Apr 8, 2024, 12:56:00 PMApr 8

to dadi-user

Hello Ryan,

I am running out of memory (191GB) when I try to generate bootstraps datasets. I am wondering if this could be because my dataset was generated with a reference free SNP discovery method, so it consist mostly of SNPs on individual contigs (no scaffold, no chromosome), and sometimes a few SNPs on the same contigs. Here is a snippet of the input dataset:

REF ALT Allele1 L207 L278 Allele2 L207 L278 Gene Position
-A- -T- A 318 332 T 82 68 10000008 42
-A- -G- A 86 13 G 314 387 10000064 33
-T- -G- T 344 290 G 56 110 10000065 38
-A- -G- A 332 314 G 68 86 10000065 47
-G- -A- G 323 298 A 77 102 10000065 66
-A- -T- A 0 22 T 400 378 10000071 34
-T- -G- T 0 22 G 400 378 10000071 53
-G- -T- G 148 71 T 252 329 10000074 39
-C- -T- C 365 331 T 35 69 10000169 38

Below is the code for the bootstrapping.

Thank you for your help,

Mathilde Salamon

####### Generate and save the FS for bootstraped genomes #######
# Numpy is the numerical library dadi is built upon
import numpy as np
import dadi

# Import data and generate fs
dd = dadi.Misc.make_data_dict("dadi_input_filtercov_L207_L278")
pop_ids, ns = ['L207', 'L278'], [400, 400]

# Nboot is the number of bootstraped datasets we want
# chunk_size is the genome chunks we use to make the
# randomized genomes.
Nboot, chunk_size = 100, 1e5

# Break up the data dictionary into a list of
# dictionary entries for each chunk of the genome.
# Ex, the fist 10^5 base pairs will be element 0,
# the next 10^5 base pairs will be element 1, etc.
# to the last 10^5 base pairs will be the last element/element -1.
chunks = dadi.Misc.fragment_data_dict(dd, chunk_size)

# Get a list containin the SFS of the bootstraped genomes.
boots = dadi.Misc.bootstraps_from_dd_chunks(chunks, Nboot, pop_ids, ns)

# Save the bootstraps: needs to mask and fold the SFS like what was done during optimization
for i in range(len(boots)):
boots[i].mask[0:15] = True
boots[i].mask[:,0:15] = True
boots[i].fold().to_file('/home/msalamon/projects/rrg-barrett/msalamon/genomics/minutus/output/cape-race-latitudinal/dadi/L207-L278/minutuslatL207L278.subset.boot_{0}.fs'.format(str(i)))

Mathilde SALAMON

unread,

Apr 8, 2024, 1:43:56 PMApr 8

to dadi-user

NB: the dataset is actually not that large: 961,220 SNPs (36.6MB)

Ryan Gutenkunst

unread,

Apr 8, 2024, 6:53:35 PMApr 8

to dadi-user

Hello Mathilde,

I’m not sure what the problem is. Using the snippet you provided and the code below, everything looks fine on my machine. Do you know which line it’s crashing on? Or can you send the full data set off-list (since it’s small) and we can debug directly?

Best,
Ryan

> --
> You received this message because you are subscribed to the Google Groups "dadi-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/9f42e3c4-8922-4ec2-9dfc-3009b899de9cn%40googlegroups.com.

Ryan Gutenkunst

unread,

Apr 11, 2024, 4:47:40 PMApr 11

to dadi-user

Hello Mathilde,

Thanks for sending the data off-list. The issue is that your data consists of ~800,000 contigs. The bootstrapping method built into dadi is optimized for a different case of a few contigs that we’re breaking up into chunks based on length. To that end, it calculates the SFS for each chunk, then adds them up to do the bootstrapping, and 800,000 400x400 SFS is too much RAM. It’s pretty simple to switch out the code to just calculate the SFS at the end of each bootstrap however.

To do so, you can just replace the "boots = dadi.Misc.bootstraps_from_dd_chunks(chunks, Nboot, pop_ids, ns)” line with

boots = []
for ii in range(Nboot):
chosen = random.choices(chunks, k = len(chunks))
temp = {}
for _ in chosen:
temp.update(_)
boots.append(dadi.Spectrum.from_data_dict(temp, pop_ids, ns))

Best,
Ryan

Mathilde SALAMON

unread,

Apr 16, 2024, 12:15:45 PMApr 16

to dadi-user

Hello Ryan,

thank you for your answer and for the solution. In that case the bootstrapping is done with chunks of consecutive SNPs (e.g. 10^5 in my case) instead of chunks of 10^5 bp ? I tried to run the bootstrapping with the code you provided but it only generated one bootstrapped SFS. Did I miss something ?

Best wishes,

Mathilde Salamon

Mathilde SALAMON

unread,

Apr 16, 2024, 12:21:43 PMApr 16

to dadi-user

Here is the code for the bootstrapping with your modification:

####### Generate and save the FS for bootstraped genomes #######
# Numpy is the numerical library dadi is built upon
import numpy as np
import dadi

import random

# Import data and generate fs
dd = dadi.Misc.make_data_dict("dadi_input_filtercov_L207_L278")
pop_ids, ns = ['L207', 'L278'], [400, 400]

# Nboot is the number of bootstraped datasets we want
# chunk_size is the genome chunks we use to make the
# randomized genomes.
Nboot, chunk_size = 100, 1e5

# Break up the data dictionary into a list of
# dictionary entries for each chunk of the genome.
# Ex, the fist 10^5 base pairs will be element 0,
# the next 10^5 base pairs will be element 1, etc.
# to the last 10^5 base pairs will be the last element/element -1.
chunks = dadi.Misc.fragment_data_dict(dd, chunk_size)

# Get a list containing the SFS of the bootstraped genomes.
boots = []

for ii in range(Nboot):
    chosen = random.choices(chunks, k = len(chunks))
temp = {}
for _ in chosen:
    temp.update(_)
boots.append(dadi.Spectrum.from_data_dict(temp, pop_ids, ns))

# Save the bootstraps: needs to mask and fold the SFS like what was done during optimization
for i in range(len(boots)):
  boots[i].mask[0:15] = True
  boots[i].mask[:,0:15] = True 
  boots[i].fold().to_file('/home/msalamon/projects/rrg-barrett/msalamon/genomics/minutus/output/cape-race-latitudinal/dadi/L207-L278/minutuslatL207L278.subset.boot_{0}.fs'.format(str(i)))

Ryan Gutenkunst

unread,

Apr 16, 2024, 12:37:38 PMApr 16

to dadi-user

This is an indentation problem… You want the code below. I’m not sure where the error was introduced.

boots = []
for ii in range(Nboot):
chosen = random.choices(chunks, k = len(chunks))
temp = {}
for _ in chosen:
temp.update(_)
boots.append(dadi.Spectrum.from_data_dict(temp, pop_ids, ns))

And to answer your other question… In this case we’re just randomly choosing 10^5 SNPs. There’s no notion of a “chunk” since you’re assuming things are unlinked.

Best,
Ryan

> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/f7284052-6a82-4825-8983-1740589bf760n%40googlegroups.com.

Mathilde SALAMON

unread,

Apr 16, 2024, 1:00:59 PMApr 16

to dadi-user

Hi Ryan,

thank you for your fast answer and help.

I tried without any indentation as you suggested but now I am getting an error:

chosen = random.choices(chunks, k = len(chunks))

^
IndentationError: expected an indented block after 'for' statement on line 25

I had introduced indentations at: chosen = random.choices(chunks, k = len(chunks)), and temp.update(_) because these were flagged with indentation errors.

Best,

Mathilde

Ryan Gutenkunst

unread,

Apr 16, 2024, 3:53:22 PMApr 16

to dadi-user

Ahh… I see from the mailing list archives that someone how the indentation is being crushed out of my messages. Below -> indicates an indentation

boots = []
for ii in range(Nboot):

-> chosen = random.choices(chunks, k = len(chunks))
-> temp = {}
-> for _ in chosen:
-> -> temp.update(_)
-> boots.append(dadi.Spectrum.from_data_dict(temp, pop_ids, ns))

Give that a shot.

Best,
Ryan

> To view this discussion on the web visit https://groups.google.com/d/msgid/dadi-user/a75434c4-2e95-4e00-9b59-0608904ea88bn%40googlegroups.com.

Mathilde SALAMON

unread,

Apr 16, 2024, 5:40:29 PMApr 16

to dadi-user

Hi Ryan,

that's what I suspected. I was able to generate the bootstraps SFS with your modifications, thank you for all the help !