Approaches to debugging cache usage

Robert Syme

unread,

Nov 20, 2019, 8:52:10 AM11/20/19

to Nextflow

Hi all

I don't think I have a good understanding about how the caching mechanism works. Take these two runs, for example:

Screen Shot 2019-11-20 at 8.46.29 AM.png

The workflow take two lanes of sequencing, demultiplexes them into 192 samples and then runs stacks over each sample. The only thing I changed between runs was the addition of an extra process at the end. For the StacksMapped:MapReadSingles process, nextflow used the cached version for 191 of the 192 samples that it ran last time, but for one of the samples, it decided that the cache was no good and needed to re-run the process (and everything downstream). I've used the default cache strategies for all processes.

Is there a sensible debugging approach to working out why the cached version wasn't used in this instance?

Thanks!

-Rob

Paolo Di Tommaso

unread,

Nov 21, 2019, 1:49:05 AM11/21/19

to nextflow

Hi Rob,

Is the process restarted taking the inputs from two separate upstream tasks/channels that are expected to be paired? If yes, how the input is declared?

p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/97b1c639-f50d-47fa-87ed-2141228d6f75%40googlegroups.com.

Robert Syme

unread,

Nov 21, 2019, 10:15:54 AM11/21/19

to Nextflow

Hi Paolo

Thanks for responding. I was trying to keep the question a little vague in an attempt to solicit general advice rather than drawing you into solving this particular problem, but since you asked... ;)

The processes that are being unnecessarily restarted are being fed read pairs and read singletons fed from a demultiplexing step. Those processes have input "shape":

    input:
    tuple name, path(readPairs)
    path(genomeIndex)
    path(genome)

Upstream, the process "ProcessRadtags" emits a many fastqgzs. The compressed reads come in fours for each sample name. The fastq filenames have the form:

- <samplename>.1.fq.gz,

- <samplename>.2.fq.gz,

- <samplename>.rem.1.fq.gz, and

- <samplename>.rem.2.fq.gz

Where the first two are properly paired reads, and the last two are the 'remainders' - singleton reads without a matching pair.

I want to turn this channel of individual fastq files into something that matches the input "shape" described above:

1) A "ReadsPaired" channel where elements in the channel are a tuple of [foo, ["foo.f.fq.gz", "foo.2.fq.gz"]], and

2) A "ReadsSingle" channel where elements in the channel are a tuple of [foo, ["foo.rem.1.fq.gz", "foo.rem.2.fq.gz"]]

I do this by regex on the filename, transforming the channel into tuples with three elements [<samplename>, <path>, <is the path a singleton?>

ProcessRadtags.out.reads.flatten() \
| map {
    def pattern = ~/([^(rem)]+)\.(rem)?\.?[12]/;
    def matcher = it.getBaseName() =~ pattern;
    matcher.find();
    [ matcher[0][1], it, matcher[0][2] ? true : false ]
} \
| branch {
    single: it[2]
    paired: true
} \
| set { reads }

This is *almost* what I want, but I'd like to remove the (now extraneous) last member of that tuple, so:

reads.paired \
| groupTuple() \
| map { name, paths, rems -> [name, paths] } \
| set { ReadsPaired }

reads.single \
| groupTuple() \
| map { name, paths, rems -> [name, paths] } \
| set { ReadsSingle }

It is these "ReadsPaired" and "ReadsSingle" channels (each with 192 of the [name, paths] tuples) which feed the MapReadPairs and MapReadSingles processes that are being unnecessarily restarted. What's unexpected to me is that of the 192 instantiations of "MapReadPairs" or "MapReadSingles", it's only ever one or two that need restarting and 190 or 191 happily pull from the cache.

-rob

Paolo Di Tommaso

unread,

Nov 21, 2019, 10:27:53 AM11/21/19

to nextflow

Hi Rob,

> The processes that are being unnecessarily restarted are being fed read pairs and read singletons fed from a demultiplexing step. Those processes have input "shape":

Then it's not what I was supsecting. Try to use the `-dump-hashes` command-line options and compare the logs of two different executions. Beware is not a friendly log to ding into, use a good text diff tool.

p

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/b19a384e-52f2-4142-80c3-77afabbcac15%40googlegroups.com.

Robert Syme

unread,

Nov 21, 2019, 10:43:13 AM11/21/19

to Nextflow

Ah -dump-hashes is probably what I'm after. Thanks Paolo. If I find something interesting/unexpected, I'll drop a reply here.

-Rob

On Thursday, 21 November 2019 10:27:53 UTC-5, Paolo Di Tommaso wrote:

Hi Rob,

> The processes that are being unnecessarily restarted are being fed read pairs and read singletons fed from a demultiplexing step. Those processes have input "shape":

Then it's not what I was supsecting. Try to use the `-dump-hashes` command-line options and compare the logs of two different executions. Beware is not a friendly log to ding into, use a good text diff tool.

p

To unsubscribe from this group and stop receiving emails from it, send an email to next...@googlegroups.com.

Paolo Di Tommaso

unread,

Nov 21, 2019, 11:03:06 AM11/21/19

to nextflow

Let me know. Thanks.

p

To unsubscribe from this group and stop receiving emails from it, send an email to nextflow+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/5932b1f2-c73d-4fdb-adec-fc6566c28c5d%40googlegroups.com.

Rob Syme

unread,

Nov 21, 2019, 3:02:00 PM11/21/19

to Nextflow

Hmmm.

I ran the pipeline twice with `-dump-hashes`, without changing anything about the input files, config, or pipeline script itself.

In the second instance, the process "StacksMapped:Bamstats (OQ093)" was re-run, so I've found the section of .nextflow.log from each of the nextflow runs.

In the first run:

Nov-21 13:24:06.441 [Actor Thread 38] INFO nextflow.processor.TaskProcessor - [StacksMapped:Bamstats (OQ093)] cache hash: 4af25aa13d5fe58b08bc1d0bbdad4ec7; mode: STANDARD; entries:
52ccec5ab302b8b39c409275e173cbbf [java.util.UUID] 48d9cd3d-44af-46db-a2b2-611d872f7c48
1ea0e901e5c1dfab9559bab2b8779fb0 [java.lang.String] StacksMapped:Bamstats
f0b1ad28ac218170982e27bd18123d00 [java.lang.String] """
mkdir out
samtools flagstat $inbam > out/${inbam.getBaseName(1)}.flagstat
samtools idxstats $inbam > out/${inbam.getBaseName(1)}.idxstats
"""

dc34d57d7e4457e9f37bfff3f5ba1433 [java.lang.String] inbam
c6b3ca14542f2a42bf4f39dcf9f28b8d [nextflow.util.ArrayBag] [FileHolder(sourceObj:/lustre03/project/6007512/robsyme/Kesanakurti.Quinoa.C3G-PK-425/work/cc/77226f6201a90ddb73b5c099c99814/OQ093.bam, storePath:/lustre03/project/6007512/robsyme/Kesanakurti.Quinoa.C3G-PK-425/work/cc/77226f6201a90ddb73b5c099c99814/OQ093.bam, stageName:OQ093.bam)]
4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $
16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
a457f14b87cdf8af85f707d410342e34 [java.lang.String] mugqic/samtools/1.9

Nov-21 13:24:06.447 [Actor Thread 38] INFO nextflow.processor.TaskProcessor - [67/b04947] Cached process > StacksMapped:Bamstats (OQ093)

<SNIP>

And in the second run:

Nov-21 13:30:43.780 [Actor Thread 71] INFO nextflow.processor.TaskProcessor - [StacksMapped:Bamstats (OQ093)] cache hash: 1d7cb5bbfbadb6fbba94f3d8d02fdaad; mode: STANDARD; entries:
52ccec5ab302b8b39c409275e173cbbf [java.util.UUID] 48d9cd3d-44af-46db-a2b2-611d872f7c48
1ea0e901e5c1dfab9559bab2b8779fb0 [java.lang.String] StacksMapped:Bamstats
f0b1ad28ac218170982e27bd18123d00 [java.lang.String] """
mkdir out
samtools flagstat $inbam > out/${inbam.getBaseName(1)}.flagstat
samtools idxstats $inbam > out/${inbam.getBaseName(1)}.idxstats
"""

dc34d57d7e4457e9f37bfff3f5ba1433 [java.lang.String] inbam
c6b3ca14542f2a42bf4f39dcf9f28b8d [nextflow.util.ArrayBag] [FileHolder(sourceObj:/lustre03/project/6007512/robsyme/Kesanakurti.Quinoa.C3G-PK-425/work/cc/77226f6201a90ddb73b5c099c99814/OQ093.bam, storePath:/lustre03/project/6007512/robsyme/Kesanakurti.Quinoa.C3G-PK-425/work/cc/77226f6201a90ddb73b5c099c99814/OQ093.bam, stageName:OQ093.bam)]
4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $
16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
a457f14b87cdf8af85f707d410342e34 [java.lang.String] mugqic/samtools/1.9

Nov-21 13:30:43.781 [Actor Thread 74] INFO nextflow.processor.TaskProcessor - [StacksMapped:FastQC (OQ098)] cache hash: 83ef570959ce7cb163d4ce64541982e6; mode: STANDARD; entries:

<SNIP>

I've highlighted the hashes, which don't change across runs *except* for the final cache hash (highlighted in red). Is there anything that goes into calculating the cache hash other than the hashes I've highlighted?

-Rob

You received this message because you are subscribed to a topic in the Google Groups "Nextflow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nextflow/MkdLwUSCBu8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nextflow+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/CADgKzdxsCSH4%3DpWDnXWHnwH4TmwMztdEGq1Zi26-8azAT2TzLQ%40mail.gmail.com.

Jason Steen

unread,

Nov 21, 2019, 4:03:15 PM11/21/19

to next...@googlegroups.com

What sort of file system are you on. I'm on lustre and still can't for the life of me work out why nextflow sometimes can't resume.

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/CAEf4xgecuZXfrha8trT9mj4pyWxwBWY%3D9cd4wMngi9gPxy-zhg%40mail.gmail.com.

Paolo Di Tommaso

unread,

Nov 21, 2019, 4:41:21 PM11/21/19

to nextflow

Umm, this is weird. Any chance to create a test case that I can replicate?

p

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/CAEf4xgecuZXfrha8trT9mj4pyWxwBWY%3D9cd4wMngi9gPxy-zhg%40mail.gmail.com.

Rob Syme

unread,

Nov 21, 2019, 4:41:41 PM11/21/19

to Nextflow

It *is* lustre, ugh :(

I've had mercifully few complicated problems with Nextflow, but almost all of them wind me through a convoluted path of debugging eventually, and inevitably to some problem with the filesystem.

Could it be a timing issue? Can NF be checking the location of .exitcode, the filesystem takes too long and then Java reports back that the file doesn't exist?

I appreciate that it might just be shared filesystems being flaky in which case nothing can be done.

-Rob

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/CACDUJVmksOqfMzUMMLkN9BMXpKN9P26V5GT6VpOfr6zQbBZBBg%40mail.gmail.com.

Rob Syme

unread,

Nov 21, 2019, 5:20:16 PM11/21/19

to Nextflow

I'll have a poke around and see how replicable the problem is on different systems on the same cluster and then across different facilities. If the problem persists, I'll git in and put together a small example set.

-Rob

Rob Syme

unread,

Nov 21, 2019, 5:20:50 PM11/21/19

to Nextflow

Correction:

see how replicable the problem is on different filesystems on the same cluster

Paolo Di Tommaso

unread,

Nov 21, 2019, 5:26:24 PM11/21/19

to nextflow

If it was a FS problem there should have been a different hash for a file component. Instead, it's very strange that changes only the resulting hash number. 🤔

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/CAEf4xgf1CB%3DZW9kocot2tb5FUgEisKG5OAXZUypNzWwgUk_Cpg%40mail.gmail.com.

Kemin Zhou

unread,

Dec 4, 2019, 5:06:36 PM12/4/19

to next...@googlegroups.com

I was having the same problem, it got fixed by removing the publishDir "/absolute/path". Could be something else. I was fighting this problem for two days.

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/CADgKzdyU6mSQ%2BohXt7zvZMJBWqk1yked3HeNkDeHTWQrX9vy0w%40mail.gmail.com.

--

Kemin Zhou

858 366 8260

Robert Syme

unread,

Dec 6, 2019, 8:51:12 AM12/6/19

to Nextflow

I promise that I'll get you a replicable test. This seems to be one of those slippery bugs that moves about as you try to draw in the parameters.

In my initial project, I just had to finish, so switched to "cache 'deep'" to sweep the problem under the rug. I've not forgotten about this, and as soon as I have a spare morning, I'll put some time into getting a replicable situation up on github. Apologies for the delay!

-Rob

On Thursday, 21 November 2019 16:41:21 UTC-5, Paolo Di Tommaso wrote:

Umm, this is weird. Any chance to create a test case that I can replicate?

p

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/5932b1f2-c73d-4fdb-adec-fc6566c28c5d%40googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Nextflow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nextflow/MkdLwUSCBu8/unsubscribe.

To unsubscribe from this group and all its topics, send an email to next...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/CADgKzdxsCSH4%3DpWDnXWHnwH4TmwMztdEGq1Zi26-8azAT2TzLQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Nextflow" group.

To unsubscribe from this group and stop receiving emails from it, send an email to next...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/nextflow/CAEf4xgecuZXfrha8trT9mj4pyWxwBWY%3D9cd4wMngi9gPxy-zhg%40mail.gmail.com.

Reply all

Reply to author

Forward