sfm option

43 views
Skip to first unread message

Brian Simison

unread,
Jun 1, 2021, 7:11:39 PM6/1/21
to elprep
I have a question about how the sfm option uses threads. If I ask for 16 thread, my "top" command indicates that elprep is only using 3-6 threads, averaging about 3.5 threads. I see that it has broken my bam into 21 groups, but elprep seems to only use four threads at a time and updating only one group vcf at a time. Each group is taking about 4 to10 hours each. I was expecting elprep to be processing many groups simultaneously rather than on at a time.

Is the sfm option supposed to process multiple bam groups in parallel and distribute the threads equally? Or is sfm more of a memory conservation option?

elprep is also not using much memory (we have 2 TB memory, so we have heaps of space available for parallel analyses, and we have 256 threads)

my commands is:
elprep sfm  ../bams/MZ202499.bam MZ202499_HapCall.bam \
--nr-of-threads 16  \
--reference Nm_1.1.elfasta  \
--haplotypecaller MZ202499_HapCall.vcf.gz

Charlotte Herzeel (imec)

unread,
Jun 2, 2021, 11:31:34 AM6/2/21
to Brian Simison, elprep
Hi,

I have a number of different answers to your question:

- The sfm mode is designed to reduce RAM use by splitting up the input data into smaller chunks that are processed one by one (see the section Split and Merge tools in the README). It is possible to write your own split/merge tools that split up data into groups and process those groups in parallel on a cluster or on the same node, similar to what you are suggesting. We used to provide such an example script using gnu parallel, but we have dropped it because of maintenance overhead. You can still find it on github under the older releases (see for example version 3.0 under scripts).

- You may want to actually run the elprep filter mode. Just replace “sfm” in your command by “filter”. That mode will not split up the data at all. It depends on the size of your input bam if this will run on your server, but 2TB seems large.

- It is normally not necessary to tell elPrep the specific number of threads to use. The runtime of Go (our implementation language) normally does an optimal allocation and management of runtime threads.

- elPrep is best used to execute pipelines that consists of multiple steps. The example command only uses the haplotype caller, but if your pipeline consists of multiple steps, it is best to combine them in a single commandline invocation of elPrep. elPrep internally merges and parallises the execution of multiple pipeline steps.

- There are multiple, long-running phases in the haplotype caller algorithm, some may use less threads than others. I would also look at the overall CPU usage after an entire run.

Thanks,
Charlotte

--
You received this message because you are subscribed to the Google Groups "elprep" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elprep+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elprep/88346ce9-2c41-4eda-8f60-ed4a139d6257n%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages