Useful Introductory Questions from Nathan

4 views

Skip to first unread message

Jonathan Clark

unread,

Dec 4, 2012, 9:35:15 AM12/4/12

to ducttap...@googlegroups.com

>>>> Hey,
>>>>
>>>> So Victor convinced me to try out ducttape for my NAACL experiments. :)
>>>> Right now I have a pretty simple workflow—train/test/eval, with options
>>>> along the way for: number of training iterations, and which of 2 evaluation
>>>> sets to use. It looks like it should make my life easier, but I have a few
>>>> basic questions. Here's my tape-file-hacked-together-from-bash-scripts:
>>>>
>>>> global {
>>>> ducttape_structure=flat
>>>> ducttape_experimental_submitters=enable
>>>> }
>>>>
>>>> 1) Does the structure option matter? From my skim of the tutorial it sounded
>>>> like it controlled the directory structure, but either one would work.
>>>
>>> Yes! In general I recommend that you use the "hyper" structure (i.e. equivalent
>>> to not mentioning structure in your config), which allows you to use multiple
>>> realizations (experimental configurations) in your workflow. This is one of
>>> the major features of ducttape in my opinion and "flat" likely only makes
>>> sense for the simplest of use cases (and perhaps the tutorial). I'll ping
>>> you later about a way to clarify this in the tutorial.
>>>
>>>>
>>>> task train <
>>>> feats=ar-data/wiki.unlabeled.bio.head100k3.cut-f1-26,31,34.stemfeats
>>>> dict=ar-data/wiki.unlabeled.lexsstTagDictStrict > sstmodel :: T=81
>>>> .submitter=sge .k=oe .walltime="24:00:00" .q=default .vmem=15g .d="." {
>>>>
>>>> java -Xmx16g -Dfile.encoding=UTF-8 -classpath "bin:lib/*"
>>>> edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger \
>>>> --trainOrTest train \
>>>> --modelFile $sstmodel \
>>>> --unlabeledFeatureFile $feats \
>>>> --useTagDictionary --tagDictionaryFile $dict --tagDictionaryKeyFields
>>>> 28,15 \
>>>> --numTags $T \
>>>> ...more java options...
>>>>
>>>> }
>>>>
>>>> 2) I'm using the (experimental?) sge submitter. Will this work for the PBS
>>>> system on the cab cluster?
>>>
>>> SGE and PBS are very similar, but this probably won't work. I've just checked in
>>> a pbs.tape that you can just copy in to your builtins/ directory
>>> (https://github.com/jhclark/ducttape/blob/master/builtins/pbs.tape). Since
>>> scheduler configurations tend to vary greatly from cluster to cluster,
>>> I generally find that I have to copy these things into my own userspace
>>> or just define a "submitter" block in my workflow that's customized to
>>> fit my current cluster's configuration.
>>>
>>>>
>>>> 3) Will the working directory change such that I need to notify java (with a
>>>> modified classpath, output directory options, etc.)?
>>>
>>> Yes. Ducttape creates a brand-spankin-new directory for each task
>>> and each realization (experimental configuration) of that task so that
>>> experiments don't step on each other. In general, when you're specifying
>>> the location to your software (e.g. your Java program which has been compiled
>>> into bin/), you use ducttape's "package" mechanism, which will keep track of
>>> what version of the software you used at each point in the workflow. Bug fixes
>>> and adding new features to your software tends to mean that you'll use several
>>> versions over the course of an experiment. The basic idea is that you declare
>>> a "package" block for each piece of software you want to use (e.g.
>>> ark_pos_tagger),
>>> you tell ducttape where to find the git/svn repo for that software,
>>> ducttape checks
>>> it out and records the revision number, and then builds it using the
>>> build commands
>>> you specify in the package block, and then when you specify "task
>>> train : ark_pos_tagger"
>>> you get a variable "$ark_pos_tagger" that points to the current
>>> version of your package.
>>>
>>> See https://github.com/jhclark/ducttape/blob/master/examples/cdec_kftt.tape
>>> for an example of using packages.
>>>
>>>>
>>>> # PBS flags for test:
>>>> #PBS -l cput=05:00:00,nodes=1,mem=8g
>>>> #PBS -k oe
>>>> #PBS -d .
>>>>
>>>> task test < model=$sstmodel@train feats=(WhichAnn:
>>>> annA=ar-data/wiki.dev.sst.annA.feats.sym.cut-f1-26,31,34,37.stemfeats
>>>> annB=ar-data/wiki.dev.sst.annB.feats.sym.cut-f1-26,31,34,37.stemfeats) >
>>>> pred tokgoldpred chkeval {
>>>>
>>>> set -eu
>>>> for MODELITERS in 4000 2000 1000 400 200 100; do
>>>>
>>>> # e.g. ar-data/wiki.dev.sst.annA.feats.sym.cut-f1-26,31,34,37
>>>>
>>>> EVALDATAFILE=`basename $feats`
>>>> EVALDATAPREFIX=`echo $EVALDATAFILE | egrep -o '^[^\.]+\.[^\.]+\.'`
>>>> # e.g. wiki.dev.
>>>>
>>>> java -Xmx8g -Dfile.encoding=UTF-8 -classpath "bin:lib/*"
>>>> edu.cmu.cs.lti.ark.ssl.pos.SemiSupervisedPOSTagger \
>>>> --trainOrTest test \
>>>> --testFeatureFile <(sed 's/\S\+$/O/' $feats) \
>>>> --runOutput $pred \
>>>> --modelFile ${model}_$MODELITERS \
>>>> --useTagDictionary --tagDictionaryFile
>>>> ar-data/${EVALDATAPREFIX}lexsstTagDict --tagDictionaryKeyFields 28,15 \
>>>> ...more java options...
>>>>
>>>> paste <(awk ' { print ( $1 "\t" $NF ) }' $feats) <(awk ' { print ( $NF )
>>>> }' $pred) > $tokgoldpred
>>>>
>>>> ./chunkeval.py BIO $tokgoldpred > $chkeval
>>>>
>>>> done
>>>> }
>>>>
>>>> 4) I got warnings for MODELITERS, EVALDATAFILE, and EVALDATAPREFIX. Is it
>>>> bad to define and use bash variables within the task?
>>>
>>> No. This is a bug. You shouldn't be getting a warning here.
>>>
>>>>
>>>> 5) Can MODELITERS instead be specified as a parameter that takes one of
>>>> several values, without having to name each of them?
>>>
>>> Yes! Bash for loops in ducttape are an anti-pattern. Use "branch points"
>>> to run multiple experimental configurations.
>>>
>>> See https://github.com/jhclark/ducttape/blob/master/tutorial/03-01-hello-hyper.tape.
>>>
>>
>> That example has
>>
>> in=(WhichSize: smaller=small.txt bigger=big.txt)
>>
>> Would I have to say iters=(WhichIters: onehundred=100, twohundred=200,
>> ...)? It's pretty clunky to have to specify a name for each numeric
>> value.
>
> No need to be so verbose. Ducttape also understands: iters=(WhichIters: 100 200)
> which is equivalent to iters=(WhichIters: 100=100 200=200)
>
>>
>>>> 6) The ${model}_$MODELITERS construction is due to the fact that the
>>>> training step creates a bunch of intermediate files with the iteration
>>>> number appended to the end of the filename. Is this the right way to refer
>>>> to them?
>>>
>>> Not quite, but branch points will fix this. Once you have branch points,
>>> you can just say $model since "train" will get run once for each branch
>>> of your ModelIters branch point, "test" will also get run once for each of them
>>> and "model" will always refer to the correct one (but it will refer to
>>> a different
>>> directory each time).
>>
>> The problem is that I don't want to rerun the training for different
>> numbers of iterations—I want to run it once for the max number, and
>> then refer to the intermediate files in different test scenarios.
>
> Ah. In that case, let me propose this design pattern: Move the MODELITERS
> from a loop into a branch point defined on the test task:
>
> task test
> < model=$sstmodel@train
> :: model_iters=(ModelIters: 4000 2000 1000 400 200 100) {
> # ...
> -modelFile ${model}_${MODELITERS}
> #...
> }
>
>
>>
>>>>
>>>> 7) Should keep set -eu from my bash script in the task?
>>>
>>> No need. Ducttape adds 'set -ueo pipefail' by default.
>>>
>>>>
>>>> 8) Originally I tried to separate the end of the above task as a separate
>>>> task, like so (because the previous part is slower and I wanted to be able
>>>> to redo just this last part in the event that it fails):
>>>>
>>>> task eval < feats=$feats@test pred=$pred@test > tokgoldpred chkeval {
>>>> paste <(awk ' { print ( $1 "\t" $NF ) }' $feats) <(awk ' { print ( $NF )
>>>> }' $pred) > $tokgoldpred
>>>>
>>>> ./chunkeval.py BIO $tokgoldpred > $chkeval
>>>> }
>>>>
>>>> Unfortunately I can't refer to the same input file as the previous task (for
>>>> $feats). Is there a way to handle this?
>>>
>>> Yes. This is a good idea and I encourage that design pattern. But I'm a little
>>> fuzzy on what you're asking. Is it that you want to use the following
>>> branch point to be used in "train" and "eval"?
>>>
>>> feats=(WhichAnn:
>>> annA=ar-data/wiki.dev.sst.annA.feats.sym.cut-f1-26,31,34,37.stemfeats
>>> annB=ar-data/wiki.dev.sst.annB.feats.sym.cut-f1-26,31,34,37.stemfeats)
>>>
>>> If so, you could define a global variable and use it in both tasks:
>>>
>>> global {
>>> ann_feats_branch_point=(WhichAnn:
>>> annA=ar-data/wiki.dev.sst.annA.feats.sym.cut-f1-26,31,34,37.stemfeats
>>> annB=ar-data/wiki.dev.sst.annB.feats.sym.cut-f1-26,31,34,37.stemfeats)
>>> }
>>
>> I want to ensure that the eval step uses the same dataset as the test
>> step. I don't want two branch points because that would mean 4
>> combinations, two of which are useless. If I use a global variable, is
>> that equivalent to having two branch points?
>
> There's a couple of issues here and I'm not sure which you mean.
>
> First, branch points defined in different places with the same name act
> as being the same branch point -- once one branch is chosen for a branch point
> that realization (experimental configuration) will always use only that branch
> every time that branch point is seen in the future. So defining the point twice
> and using a global variable are equivalent.

Oh! I didn't realize that branch point names were global. So I guess
this means I can do:

global { anns=(WhichAnn:
annA=/mal2/nschneid/locally-normalized-EM-sequence-model/ar-data/wiki.dev.sst.annA.feats.sym.cut-f1-26,31,34,37.stemfeats
annB=/mal2/nschneid/locally-normalized-EM-sequence-model/ar-data/wiki.dev.sst.annB.feats.sym.cut-f1-26,31,34,37.stemfeats)
}

task test < feats=$anns ...

task eval < feats=$anns ...

> Second, just because there are two branch points with two branches each,
> doesn't mean ducttape will definitely run all 4 combinations. By default,
> it will run the "one off" configurations meaning 3 situations:
> https://github.com/jhclark/ducttape/blob/master/tutorial/03-02-one-off.tape. But
> using "plans" you can run any arbitrary subset of experimental
> configurations that
> you want (see https://github.com/jhclark/ducttape/blob/master/tutorial/03-04-realization-plans.tape)

I didn't understand this part of the tutorial. Do I need to define a
custom plan as follows to get the full cross-product?:

plan All {
reach eval via (WhichIters: *) * (WhichAnn: *)
}

Would it go in a separate config file?

>>>>>>>> It could if you like. That's where I usually put it. It could also go in the main workflow itself.

>>
>>> Also, remember that "./chunkeveal.py" should become something that refers
>>> to a software package (e.g. "task eval : chunkeval" ...
>>> "$chunkeval/chunkeval.py")
>>
>> Or if I don't want to mess with creating a package for a single
>> script, can I just use an absolute path?
>
> Sure. I discourage this design pattern for long-term usage, but if you
> want a quick hack
> that will technically work.
>
>>
>>> On a side note, I discourage you from using " <(awk ' { print ( $NF )
>>>> }' $pred)".
>>> It's fancy bash and cool, but the bash implementation of process
>>> substitution sucks
>>> and if awk fails or $pred doesn't exist, you won't actually get an
>>> non-zero exit code
>>> and your task may fail silently. I recommend just doing it synchronously.
>>
>> You mean, create intermediate files like this?:
>>
>> awk ' { print ( $1 "\t" $NF ) }' $feats > file1
>> awk ' { print ( $NF ) }' $pred > file2
>> paste file1 file2 > $tokgoldpred
>
> Yes. It's more verbose, but bash will properly catch the errors.
>
>>
>>>>
>>>> Cheers,
>>>> Nathan
>>>
>>> Hope this helps,
>>> Jon

Reply all

Reply to author

Forward

0 new messages