GC failing with "unable to get tag"

Giles Brown

unread,

Apr 16, 2016, 10:30:22 AM4/16/16

to Disco-development

It seems I can no longer run garbage collection on my 0.5.4 Disco cluster.

I see this kind of thing in the error.log[.?]

disco@disco-master-9635:/disco/log$ grep "GC: stopping" error.log.*

error.log.0:2016-04-15 10:27:00.886 [error] <0.4386.48>@ddfs_gc_main:handle_cast:387 GC: stopping, unable to get tag <<"wexin:2015:11:20:07:16:51:bowie-node-6625">>: {error,timeout}

error.log.0:2016-04-15 13:01:02.066 [error] <0.11431.59>@ddfs_gc_main:handle_cast:387 GC: stopping, unable to get tag <<"wexin:2015:11:20:07:16:51:bowie-node-6625">>: {error,timeout}

error.log.1:2016-04-14 16:47:26.320 [error] <0.3981.29>@ddfs_gc_main:handle_cast:387 GC: stopping, unable to get tag <<"wexin:2015:11:20:07:16:51:bowie-node-6625">>: {error,timeout}

error.log.1:2016-04-14 21:04:51.049 [error] <0.8738.36>@ddfs_gc_main:handle_cast:387 GC: stopping, unable to get tag <<"wexin:2015:11:20:07:16:51:bowie-node-6625">>: {error,timeout}

error.log.1:2016-04-14 22:40:18.085 [error] <0.4831.42>@ddfs_gc_main:handle_cast:387 GC: stopping, unable to get tag <<"wexin:2015:11:20:07:16:51:bowie-node-6625">>: {error,timeout}

error.log.2:2016-04-13 19:11:44.951 [error] <0.29076.15>@ddfs_gc_main:handle_cast:387 GC: stopping, unable to get tag <<"wexin:2015:11:20:07:16:51:bowie-node-6625">>: {error,timeout}

I was imaging that the "unable to get tag" was a sign of a timeout on getting the particular tag data, but I can successfully "ddfs get" that tag from the command line. It seems significant that it is always the same tag that it fails to get. That seems to suggest something persistent (on disk) rather than node being randomly slow.

Any suggestions on how to investigate this problem would be gratefully received.

Thanks,

Giles

Giles Brown

unread,

Apr 16, 2016, 10:45:16 AM4/16/16

to Disco-development

I'll add some important extra context that I forgot to mention. Recently one of the cluster nodes died on me. The system identified the node as down an correctly re-replicated stuff, but I suspect that the garbage collection failures might be related to there being some references to this node with is now no longer part of the cluster.

I still don't have any idea how or why or what to do about it though, so suggestions are still welcome.

Thanks.

Giles Brown

unread,

Apr 19, 2016, 1:06:44 PM4/19/16

to Disco-development

OK so I tried manually removing the blob (disco://) urls referencing the dead node and this did indeed let the GC run.

Raised an issue here: https://github.com/discoproject/disco/issues/638

Giles Brown

unread,

Apr 24, 2016, 7:36:58 PM4/24/16

to Disco-development

It seems from the docs (http://disco.readthedocs.org/en/latest/howto/administer.html#blacklisting-a-ddfs-node) that the correct procedure for handling a failed node might have been to blacklist it from DDFS and then wait for it to go green before removing it from the configuration. I didn't do that so that may very well by why things when wrong for me.

Giles

Reply all

Reply to author

Forward