Efficient catalog creation v2

861 views
Skip to first unread message

Christos Palaiokostas

unread,
Apr 5, 2018, 11:39:32 AM4/5/18
to Stacks
Hi,

In the scenario of a reference based analysis including a large number of animals is there a way of constructing the catalog based on only a subset of those?

In the previous versions was able to do it through pstacks and cstacks.  However in the new version is not apparent to me how to achieve it through gstacks?

Cheers,
Christos

Julian Catchen

unread,
Apr 5, 2018, 12:08:45 PM4/5/18
to stacks...@googlegroups.com, Christos Palaiokostas
Hi Christos,

You don't need to restrict the catalog in a reference based analysis,
because the reference itself acts as a filter preventing lots of noise
in the catalog.

Best,

julian

Christos Palaiokostas

unread,
Apr 6, 2018, 10:22:28 AM4/6/18
to Stacks
Hi Julian,

In continuation to my previous post when trying to analyse more than 1,000 animals gstacks breaks, while if I break it to smaller pieces it runs.

Is there a way of joining those two datasets using the stacks pipeline? I see that the population program has an argument --in_vcf

In the previous versions was just creating the catalog using a subset of animals. Actually in my case (working with family data) catalog was constructed only from the parents and everything was joined through sstacks.

Cheers,
Christos

Nicolas Rochette

unread,
Apr 6, 2018, 10:53:41 AM4/6/18
to stacks...@googlegroups.com

Hi Christos,

What do you mean by "gstacks breaks"?

Best,

Nicolas

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
Visit this group at https://groups.google.com/group/stacks-users.
For more options, visit https://groups.google.com/d/optout.

Χρήστος Παλαιοκώστας

unread,
Apr 6, 2018, 11:10:55 AM4/6/18
to stacks...@googlegroups.com
Hi Nicolas,

Just gives a general error message that failed to open the file

[E::hts_open_format] Failed to open file Offspring_bam/Off562.pp.bam
Error: Failed to open BAM file 'Offspring_bam/Off562.pp.bam'.
Aborted.

If I split the population map file it runs without a problem. 

In more detail the entire dataset consists of 1,500 animals. gstacks works for either first 1,018 or e.g if I just take last 500-600 etc.

Cheers,
Christos

To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users+unsubscribe@googlegroups.com.

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users+unsubscribe@googlegroups.com.

Nicolas Rochette

unread,
Apr 6, 2018, 12:04:21 PM4/6/18
to stacks...@googlegroups.com

I see—I think I know where it comes from, and I was expecting this problem to come up at some point.

TL;DR: This is because of your UNIX configuration does not allow you to open more than (most likely) 1024 files at the same time. The solution is to increase it.

If you run in a shell:

ulimit -n

This will probably be 1024. But if you run

ulimit -Hn

you will most likely see a higher number. Then you can increase the effective (soft) limit by running e.g.

ulimit -n 4096

before calling stacks/gstacks, or in your ~/.profile (or ~/.bashrc). Then it should work.

The other option is to merge all the samples' matches.bam files first, into a single catalog.bam file, and then run gstacks, but you're going to run into the same issue with samtools merge and will have to do it iteratively. It's much easier to just increase the limit for open files.

I'll see if the error message can be improved.

Best,

Nicolas

To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.

Χρήστος Παλαιοκώστας

unread,
Apr 6, 2018, 1:24:36 PM4/6/18
to stacks...@googlegroups.com
Hi Nicholas,

Thanks makes sense :)

Cheers,
Christos

Stacks newbie

unread,
Dec 6, 2019, 11:53:35 AM12/6/19
to Stacks
Hi Nicolas, 
I increased my ulimit to 4096 as suggested, it did not work. I tried to use "cat" to merge all my .matches.bam files into one large file... and then ran gstacks again. That too failed with the error:

[E::hts_open_format] Failed to open file /lustre/mngeve/catdata/cleandataplus1/ustacks_M3_out/ACR1_2.matches.bam

Error: Failed to open BAM file '/lustre/mngeve/catdata/cleandataplus1/ustacks_M3_out/ACR1_2.matches.bam'.

Aborted.


I am running Stacks on the cluster and I have a data set of about 1700 individuals from about 20 populations of a plant species across N. America.
Please, can you advice me how I might proceed??

Best,
Magdalene

To unsubscribe from this group and stop receiving emails from it, send an email to stacks...@googlegroups.com.
--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks...@googlegroups.com.
--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks...@googlegroups.com.

Nicolas Rochette

unread,
Dec 6, 2019, 1:33:58 PM12/6/19
to Stacks Users Group

Hi Magdalene,

Could you confirm that the file actually exists at that path and can be opened with samtools?

Best,

Nicolas

To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/48b39959-6d83-4d0a-a1c5-55884a1f8f7e%40googlegroups.com.

David Dayan

unread,
May 18, 2020, 1:00:05 PM5/18/20
to Stacks
Hi all,

I'm running into this same problem: stacks crashes and directs me to this google group thread, despite the limit on the number of open files exceeding the number of .bam files. 

I have ~2400 individuals and ulimit -n returns 4096, so I'm a bit confused about what is breaking here. Does stacks need more than one file per individual to be opened? Will the same error get thrown if the issue is running out of memory, instead of open files?


Thanks,
David

p.s. here's the error:

[E::hts_open_format] Failed to open file ../alignments/cl_f_130.bam

Error: Failed to open BAM file '../alignments/cl_f_130.bam'.

Error: You might need to increase your system's max-open-files limit, see https://groups.google.com/d/msg/stacks-users/GZqJM_WkMhI/m9Hreg4oBgAJ

Aborted.


After the error is thrown, the offending file is replaced with several named "cl_f_130.bam.tmp.0001", "cl_f_130.bam.tmp.0002" etc
To unsubscribe from this group and stop receiving emails from it, send an email to stacks...@googlegroups.com.

Julian Catchen

unread,
May 18, 2020, 4:41:29 PM5/18/20
to stacks...@googlegroups.com, David Dayan
Hi David,

Yes, it needs to open more than one file per sample. You can increase
the limit as specified in Nicolas's message. It is not a permanent
change to the system and it is a problem that occurs in all UNIX systems
(the file limit is a security feature).

julian

David Dayan wrote on 5/18/20 12:00 PM:
>>>> <https://groups.google.com/group/stacks-users>.
>>>> For more options, visit
>>>> https://groups.google.com/d/optout
>>>> <https://groups.google.com/d/optout>.
>>>
>>> --
>>> Stacks website:
>>> http://catchenlab.life.illinois.edu/stacks/
>>> <http://catchenlab.life.illinois.edu/stacks/>
>>> ---
>>> You received this message because you are subscribed to
>>> the Google Groups "Stacks" group.
>>> To unsubscribe from this group and stop receiving emails
>>> from it, send an email to stacks...@googlegroups.com.
>>> Visit this group at
>>> https://groups.google.com/group/stacks-users
>>> <https://groups.google.com/group/stacks-users>.
>>> For more options, visit
>>> https://groups.google.com/d/optout
>>> <https://groups.google.com/d/optout>.
>>>
>>>
>>> --
>>> Stacks website: http://catchenlab.life.illinois.edu/stacks/
>>> <http://catchenlab.life.illinois.edu/stacks/>
>>> ---
>>> You received this message because you are subscribed to the
>>> Google Groups "Stacks" group.
>>> To unsubscribe from this group and stop receiving emails from
>>> it, send an email to stacks...@googlegroups.com.
>>> Visit this group at
>>> https://groups.google.com/group/stacks-users
>>> <https://groups.google.com/group/stacks-users>.
>>> For more options, visit https://groups.google.com/d/optout
>>> <https://groups.google.com/d/optout>.
>>
>> --
>> Stacks website: http://catchenlab.life.illinois.edu/stacks/
>> <http://catchenlab.life.illinois.edu/stacks/>
>> ---
>> You received this message because you are subscribed to the Google
>> Groups "Stacks" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to stacks...@googlegroups.com <javascript:>.
>> <https://groups.google.com/d/msgid/stacks-users/48b39959-6d83-4d0a-a1c5-55884a1f8f7e%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> Stacks website: http://catchenlab.life.illinois.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stacks-users...@googlegroups.com
> <mailto:stacks-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/stacks-users/92e8c0d3-4c70-4f34-8d35-c81c22f28b93%40googlegroups.com
> <https://groups.google.com/d/msgid/stacks-users/92e8c0d3-4c70-4f34-8d35-c81c22f28b93%40googlegroups.com?utm_medium=email&utm_source=footer>.


--
Julian Catchen, Ph.D.
Assistant Professor
Department of Evolution, Ecology, and Behavior
Carl R. Woese Institute for Genomic Biology
University of Illinois, Urbana-Champaign
--
jcat...@illinois.edu; @jcatchen

David Dayan

unread,
May 18, 2020, 4:49:07 PM5/18/20
to Stacks
Thanks

Benjamin Goetz

unread,
Aug 21, 2020, 3:48:08 PM8/21/20
to Stacks
Howdy,

Running into the same problem. I'm using a reference, so I'm trying to run gstacks with BAM files aligned to a pre-existing reference (Stickleback). I have 3,331 samples, so if processing every BAM file requires opening at least two files, setting ulimit 4096 won't save me (and I tried it).

I am using a cluster at the Texas Advanced Computing Center (TACC), so I have hundreds of nodes available to me. But I can't think of a way of splitting the files across nodes. Splitting by chromosome won't reduce the number of sample files. But I'm new to Stacks, so perhaps there's some clever splitting strategy that wouldn't occur to me? I only have 5-6 populations.

Any help would be greatly appreciated!

Benni Goetz
Bioinformatics Consulting Group
Center for Biomedical Research Support
University of Texas at Austin

Nicolas Rochette

unread,
Aug 21, 2020, 5:01:38 PM8/21/20
to stacks...@googlegroups.com, Benjamin Goetz

Hi Benni,

I don't have a pre-made answer, but I would like to point out two things:

* gstacks doesn't actually detect a problem with file limits. It prints the google groups link whenever the opening of a BAM file fails while there are already more than 250 successfully opened BAM files. Do check your files individually (and I believe the name of the file causing an error is given)

* gstacks should open one file per sample (the *.matches.bam) plus a few more (log/output files, etc). But some unrelated, concurrently running programs may use up a few files as well.

Best,

Nicolas

To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/b20a418c-6c6a-4d91-ac19-d7ecf258babcn%40googlegroups.com.

Benjamin Goetz

unread,
Aug 21, 2020, 7:44:06 PM8/21/20
to Stacks
Thanks for the fast response!

I don't have any *.matches.bam files, just *.bam files. I assume that's because I'm using a reference, and just used BWA for the alignment. (And sorted, converted to BAM format with samtools.)

I'll try opening all the *.bam files to see if I get any errors, maybe use a loop to have samtools echo the first line to /dev/null or something.

FWIW, here's the error message I got, it doesn't mention the name of a file giving an error. Just what looks like a line in the code?

At src/BamI.cc:102 This should never happen.
Error: You might need to increase your system's max-open-files limit, see https://groups.google.com/d/msg/stacks-users/GZqJM_WkMhI/m9Hreg4oBgAJ
Aborted.

Benni


Benjamin Goetz

unread,
Aug 21, 2020, 7:55:24 PM8/21/20
to Stacks
Hmmm, trying to just look at a few of the BAM files, I get no output from samtools view. The filesizes are >0, and when I try less on them I can see SAM headers, chromosome and scaffold names, and a bunch of weird gibberish from the binary format. But no output, no errors from samtools view. If samtools view doesn't show anything, the error is likely upstream of gstacks, I guess. I'll have to see if I can find something that went wrong. Weird.

Thanks for the help,

Benni

Benjamin Goetz

unread,
Aug 21, 2020, 8:29:44 PM8/21/20
to Stacks
Hello yet again,

The reason samtools view doesn't return anything for some of the BAM files is that the demultiplexed FASTQ files are empty after demultiplexing, so there *are* no alignments!

Is there any way of seeing which original FASTQ file a demultiplexed file came from? I'm curious whether one problematic FASTQ file from the sequencer contains all the empty demultiplexed files.

BTW, thanks for some great software. process_radtags is pretty cool just by itself!

Benni

Nicolas Rochette

unread,
Aug 23, 2020, 1:09:52 AM8/23/20
to stacks...@googlegroups.com, Benjamin Goetz

Hi Benni,

Julian may have something to add, but you would have to ask the people who did the demultiplexing and/or designed the barcodes.

Best,

Nicolas

Lorenzo Bertola

unread,
Apr 14, 2021, 10:16:15 PM4/14/21
to Stacks
Hello everyone,

I have encountered the same error, and the very useful error message led me here. 

ulimit -n 4096 seems to have fixed the problem.

I just wanted to note, for anyone encountering this error, to confirm whether the file number limit is causing the issue just check which one is your (file number limit + 1) file when catting all matches.bam files in your input directory. For instance, my file size limit is 1024, and the sample that couldn't get loaded was sample 1025 in the directory, which confirmed the issue was the file size limit. 

Then, as Julian and Nicholas pointed out, check that the erroneous file exists, can be loaded by e.g. samtools, etc.

paige...@ucsb.edu

unread,
Jan 22, 2022, 6:53:29 PM1/22/22
to Stacks
Hi, I just wanted to followup on this comment.  I just learned the hard way that it isn't always as obvious you are failing due to the number of sample files by checking the sample number you were on against the total of the soft limit of files you can open.  When running sstacks apparently each sample requires multiple prior output files to be open at one time, so I hit the soft file limit on the remote server trying to run 300 samples against the catalog I prebuilt with a smaller core set of samples even though the ulimit was 1040 files.
Reply all
Reply to author
Forward
0 new messages