Hi everyone,
I figured out a handy way to do cross-validation in ducttape…not sure if others have used this before:
The idea is to have a sequence of tasks and corresponding branch points that split the data in half. Conditional branch points allow you to choose the half that will contain your evaluation data in a given fold/group of folds; all the rest will be lumped together for training. For 8-fold cross-validation:
task xsplitA : pyutil < traintags=@to_tags > batch0 batch1 {
# split $traintags in half into $batch0 and $batch1
}
task xsplitB : pyutil < inbatch=(FoldA: 0=$batch0@xsplitA 1=$batch1@xsplitA) otherbatch=(FoldA: 0=$batch1@xsplitA 1=$batch0@xsplitA) > batch0 batch1 rest {
# split $inbatch in half into $batch0 and $batch1
cp $otherbatch $rest
}
task xsplitC : pyutil < inbatch=(FoldB: 0=$batch0@xsplitB 1=$batch1@xsplitB) otherbatch=(FoldB: 0=$batch1@xsplitB 1=$batch0@xsplitB) inrest=$rest@xsplitB > batch0 batch1 rest {
# split $inbatch in half into $batch0 and $batch1
cat $otherbatch $inrest > $rest
}
task xsplitD : pyutil < inbatch=(FoldC: 0=$batch0@xsplitC 1=$batch1@xsplitC) otherbatch=(FoldC: 0=$batch1@xsplitC 1=$batch0@xsplitC) inrest=$rest@xsplitC > traintags devtags {
cat $otherbatch $inrest > $traintags
cp $inbatch $devtags
}
task learn : pytagger < traintags=@xsplitD devtags=@xsplitD > devpredictions {
# train on $traintags and predict on $devtags
}
This is a bit ugly but it (a) parallelizes easily, and (b) avoids nasty bash bookkeeping for the train vs. eval data within each fold. Plus, it does not require any branch points with more than 2 branches; currently more than a few branches is a known performance bottleneck during workflow analysis.
Cheers,
Nathan