Running processes in parallel

Matt Dyer

unread,

Feb 14, 2015, 2:32:40 PM2/14/15

to next...@googlegroups.com

I am new to Nextflow and just wondering if someone has any links to documentation or guidance about how to run processes in parallel. Take the following example.

#!/usr/bin/env nextflow


process A {
    output:
    file 'a'
    
    """
    echo 'Starting A'
    echo 'Hello' > a
    sleep 10
    echo 'Finished A'
    """
}


process B {
    output:
    file 'b'
    
    """
    echo 'Starting B'
    echo 'World' > b
    sleep 5
    echo 'Finished B'
    """
}

process C {
    input:
    file a
    file b

    output:
    file 'c'
    
    """
    echo 'Starting C'
    cat a b > c
    echo 'Finished C'
    """
}

Given A and B have no dependencies on each other I'd like them to run in parallel, but by default they only seem to be run sequentially (B then A).

Best,

Matt

Paolo Di Tommaso

unread,

Feb 14, 2015, 5:00:46 PM2/14/15

to nextflow

Hi Matt,

You do need extra code to run processes in parallel. Actually in nextflow *all* processes are executed in parallel, and wait for data to be processes.

I think you are being misled by the fact the a process output is printed to the console at process termination, thus you won't see the echo result as soon as it happens.

You can verify that printing the current timestamp in your process script, for example:

process A {
output:
file 'a'

"""

echo 'Starting A' `date +%H-%M-%S`

echo 'Hello' > a
sleep 10

echo 'Finished A' `date +%H-%M-%S`
"""
}

and running it adding the -process.echo true command line option.

Best,

Paolo

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/nextflow.
For more options, visit https://groups.google.com/d/optout.

Matt Dyer

unread,

Feb 15, 2015, 9:01:50 PM2/15/15

to next...@googlegroups.com

Thanks Paolo. It does look like the run serially and not in parallel. If they were in parallel I would expect the start times to be the same, but you can see A does not start till B finishes. Is there some other code I need to it to start A and B at the same time?

N E X T F L O W  ~  version 0.12.2
[warm up] executor > local
[da/7667a6] Submitted process > B (1)
[46/16174e] Submitted process > A (1)
Starting B 17-58-01
Finished B 17-58-06
Starting A 17-58-06
Finished A 17-58-16
[5e/f85e58] Submitted process > C (1)
Starting C
Finished C

Matt Dyer

unread,

Feb 15, 2015, 10:12:13 PM2/15/15

to next...@googlegroups.com

So it looks like it may be something with my OS (Ubuntu 14 .04LTS) as when the same file is OS X it works as you say.

N E X T F L O W  ~  version 0.12.2
[warm up] executor > local

[2b/27cad2] Submitted process > B (1)
[88/571715] Submitted process > A (1)
Starting B 21-22-51
Finished B 21-22-56
Starting A 21-22-51
Finished A 21-23-01
[9d/b137d4] Submitted process > C (1)
Starting C 21-23-01
Finished C 21-23-01

Matt Dyer

unread,

Feb 15, 2015, 10:46:12 PM2/15/15

to next...@googlegroups.com

OK, tracked down the problem. On my Ubuntu VM I had only two cores allocated to the VM, when I bumped this to three, it worked as expected.

N E X T F L O W  ~  version 0.12.2
[warm up] executor > local

[25/2ab4de] Submitted process > B (1)
[08/003850] Submitted process > A (1)
Starting B 19-35-00
Finished B 19-35-05
Starting A 19-35-00
Finished A 19-35-10

Can you share any info on how the number of CPUs available for usage is calculated?

Alex Rothberg

unread,

Feb 15, 2015, 10:52:58 PM2/15/15

to next...@googlegroups.com

It looks to be calculated here: https://github.com/nextflow-io/nextflow/blob/f5284e325055a0207c31616f568fa48bfbe0bb19/src/main/groovy/nextflow/Session.groovy#L152

The default value (cpu - 1) can be overridden in the config by setting poolSize.

Paolo Di Tommaso

unread,

Feb 16, 2015, 4:14:11 AM2/16/15

to nextflow

Yes, to be exact the thread pool used to run processes is defined as MAX( 2, cpus-1 ).

However the number of processes that can run in parallel is controlled by the size of an internal bounded queue, that by default should be as the thread pool size. It turns out there's a small glitch here because is defined as cpus -1, while it should the same value as poolSize. See:

https://github.com/nextflow-io/nextflow/blob/f5284e325055a0207c31616f568fa48bfbe0bb19/src/main/groovy/nextflow/executor/LocalExecutor.groovy#L52

This explain the reason why Matt was observing that inspected behaviour with a two core cpu.

Thanks guys for your useful feedback.

Cheers,

Paolo

--

Message has been deleted

Leif Gong

unread,

Sep 19, 2018, 11:47:20 AM9/19/18

to Nextflow

sorry I am reposting my question below:

Very interesting discussion. I learned a lot. This thread is the example of parallelism. Is there any example of "scalable"? sorry if this question is very basic.

Best,

Leif

Paolo Di Tommaso

unread,

Sep 20, 2018, 5:09:49 AM9/20/18

to nextflow

Not sure to understand you question. Can you elaborate more?

p

--

You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

Visit this group at https://groups.google.com/group/nextflow.

Leif Gong

unread,

Sep 20, 2018, 6:02:56 AM9/20/18

to Nextflow

Hi Paolo,

Nextflow is a fluent DSL modelled around the UNIX pipe concept that simplifies writing parallel and scalable pipelines in a portable manner (ref). I found your nice slides about parallelization. What is the meaning of "scalable"? Does it mean that it is easy to move from local to cluster or cloud? sorry for this basic question.

Best,

Leif

Damian Loska

unread,

Sep 20, 2018, 6:16:07 AM9/20/18

to Nextflow

I guess "scalable" means that if you have a pipline with processA, processB and processC, you can easily add processD, processF, keeping results of processes A, B, C unmodified (this is my understanding, to scale up to pipeline)

Paolo Di Tommaso

unread,

Sep 20, 2018, 6:28:08 AM9/20/18

to nextflow

It means that you can scale your workflow from a single computer to a computers cluster easily and without changing the application code.

p

Leif Gong

unread,

Sep 20, 2018, 6:57:36 AM9/20/18

to Nextflow

Thank you Damian! Thank you Paolo!

Leif Gong

unread,

Sep 25, 2018, 4:17:09 AM9/25/18

to Nextflow

Hi Paolo,

In your ppt slides, one of the Dataflow goodness is no synchronization/lock / current access hell. Does that mean one file can be accessed simultaneously by many users without io performance issue? Thank you in advance.

Best,

Leif

Paolo Di Tommaso

unread,

Sep 25, 2018, 9:20:47 AM9/25/18

to nextflow

No. This means that the Dataflow model implicitly handles synchronisation for you. Moreover being on functional principles such as immutability, there's no need to lock shared resources, because you are not supposed to concurrently modify the same data structure.

p

Leif Gong

unread,

Sep 25, 2018, 9:58:23 AM9/25/18

to Nextflow

thank you!

Reply all

Reply to author

Forward