Thanks Anubhav.
I am running a 3,000 node workflow that represents a real workflow we have in terms of # nodes and # of links, but have replaced all the tasks with a call to "time". I'll verify that it has the same problem and forward it to you.
I was trying to restart 500 jobs because of an out of disk failure that killed them. I fixed the space issue and wanted to rerun them.
The restart took so long (8+hrs) that my ssh connection dropped overnight.
Now I have inconsistencies (I am guessing the command was holding a WFlock when it got killed due to terminal disconnect) that the detect_lostruns command line seems unable to recover from:
lpad -l charris.yaml detect_lostruns --fizzle
successfully loaded your custom FW_config.yaml!
2016-04-29 11:39:27,584 DEBUG Detected 1 lost launches: [5395]
2016-04-29 11:39:27,584 INFO Detected 1 lost FWs: [2934]
2016-04-29 11:39:27,585 INFO Detected 231 inconsistent FWs: [8042, 8040, 8038, 8036, 8034, 8032, 8030, 8028, 8026, 8024, 8016, 8014, 8012, 8010, 8008, 8006, 8004, 8002, 8000, 7997, 7996, 7994, 7992, 7990, 7981, 7979, 7977, 7975, 7973, 7971, 7969, 7967, 7965, 7963, 7961, 7959, 7957, 7955, 7953, 7945, 7943, 7941, 7939, 7937, 7935, 7933, 7931, 7929, 7927, 7925, 7923, 7921, 7919, 7917, 7915, 7907, 7905, 7903, 7901, 7899, 7897, 7895, 7893, 7891, 7889, 7887, 7885, 7883, 7881, 7879, 7871, 7869, 7867, 7865, 7863, 7861, 7859, 7857, 7855, 7853, 7851, 7849, 7847, 7845, 7843, 7835, 7833, 7831, 7828, 7825, 7823, 7817, 7815, 7813, 7811, 7809, 7807, 7799, 7797, 7793, 7791, 7789, 7787, 7785, 7783, 7781, 7779, 7777, 7775, 7773, 7771, 7763, 7761, 7759, 7757, 7755, 7753, 7751, 7749, 7747, 7745, 7741, 7739, 7737, 7735, 7727, 7725, 7723, 7721, 7719, 7717, 7713, 7711, 7709, 7707, 7705, 7703, 7701, 7699, 7691, 7687, 7685, 7683, 7681, 7679, 7677, 7675, 7673, 7671, 7669, 7667, 7653, 7651, 7649, 7647, 7645, 7643, 7641, 7637, 7635, 7633, 7631, 7629, 7627, 7619, 7617, 7615, 7613, 7611, 7609, 7607, 7605, 7603, 7601, 7599, 7597, 7595, 7593, 7591, 7547, 7545, 7543, 7541, 7539, 7537, 7533, 7531, 7527, 7525, 7523, 7521, 7519, 7292, 7290, 7288, 7282, 7280, 7278, 7276, 7274, 7272, 7270, 7268, 7226, 7224, 6888, 6884, 6882, 8735, 6405, 6403, 7512, 7479, 7443, 7440, 7407, 7404, 7371, 7368, 7335, 7262, 7198, 7195, 7166, 7134, 7102, 7068, 7065, 7036, 6972, 6937]
You can fix inconsistent FWs using the --refresh argument to the detect_lostruns command
charris@s01 ~/code>lpad -l charris.yaml detect_lostruns --fizzle --refresh
...
Do you have any advice on a manual mongo repair at this point? Or what are my options?