design pattern for cross-validation

3 views

Skip to first unread message

Nathan Schneider

unread,

Jun 29, 2013, 2:23:40 AM6/29/13

to ducttap...@googlegroups.com

Hi everyone,

I figured out a handy way to do cross-validation in ducttape…not sure if others have used this before:

The idea is to have a sequence of tasks and corresponding branch points that split the data in half. Conditional branch points allow you to choose the half that will contain your evaluation data in a given fold/group of folds; all the rest will be lumped together for training. For 8-fold cross-validation:


task xsplitA : pyutil < traintags=@to_tags > batch0 batch1 {
    # split $traintags in half into $batch0 and $batch1
}
task xsplitB : pyutil < inbatch=(FoldA: 0=$batch0@xsplitA 1=$batch1@xsplitA) otherbatch=(FoldA: 0=$batch1@xsplitA 1=$batch0@xsplitA) > batch0 batch1 rest {
    # split $inbatch in half into $batch0 and $batch1
    cp $otherbatch $rest
}
task xsplitC : pyutil < inbatch=(FoldB: 0=$batch0@xsplitB 1=$batch1@xsplitB) otherbatch=(FoldB: 0=$batch1@xsplitB 1=$batch0@xsplitB) inrest=$rest@xsplitB > batch0 batch1 rest {
    # split $inbatch in half into $batch0 and $batch1
    cat $otherbatch $inrest > $rest
}
task xsplitD : pyutil < inbatch=(FoldC: 0=$batch0@xsplitC 1=$batch1@xsplitC) otherbatch=(FoldC: 0=$batch1@xsplitC 1=$batch0@xsplitC) inrest=$rest@xsplitC > traintags devtags {
    cat $otherbatch $inrest > $traintags
    cp $inbatch $devtags
}
task learn : pytagger < traintags=@xsplitD devtags=@xsplitD > devpredictions {
    # train on $traintags and predict on $devtags
}

This is a bit ugly but it (a) parallelizes easily, and (b) avoids nasty bash bookkeeping for the train vs. eval data within each fold. Plus, it does not require any branch points with more than 2 branches; currently more than a few branches is a known performance bottleneck during workflow analysis.