I'm using Luigi to build out a workflow that ends in training a statistical model with a lot of free parameters on some processed data.
Schematically, the workflow is
RawData -> ProcessedData -> TrainedModel
where:
RawData is an external dependency that only has a single parameter that points to the data location
- data_name
ProcessedData is a Task that has a few parameters, e.g.
- smoothing_amount
- smoothing_type
TrainedModel is a Task that has a separate set of parameters for the statistical model, e.g.
- alpha
- beta
- gamma
All parameters besides "data_name" have useful defaults.
Ideally, I'd like to be able to run something like
python pipeline.py TrainedModel --alpha=0.5 --smoothing_amount=3 --data_name=/path/to/data.csv
Right now, I'm making TrainedModel be a subclass of ProcessedData, which is a subclass of RawData, and then doing a little bit of argument stuffing using get_params() and get_param_values() during initialization.
Is there a Luigi-approved way for passing parameters up a batching chain without explicitly shuttling them around?
# TaskB depends on TaskA
# I'm using inheritance to get around copy/pasting all the paramters.
# For the kinds of models I'm training, there can be a lot of knobs
class TaskB(TaskA):
...
def requires(self):
# Get all the arguments in common between TaskA and TaskB
common_params = list(set.intersection(set(TaskA.get_params()),set(self.get_params())))
common_kwargs = dict([(key,self.param_kwargs[key]) for key in dict(common_params).keys()])
vals = dict(self.get_param_values(common_params, [], common_kwargs))
return TaskA(**vals)
...
Here's one pattern that we use:class FooParamsMixin(object):param1 = luigi.Parameter()param2 = luigi.Parameter()...def foo_params(self):return { 'param1': self.param1, 'param2' : self.param2, ... }class TaskA(FooParamsMixin, luigi.Task):def requires(self):return TaskB(**self.foo_params(), # plus any other params)class TaskB(FooParamsMixin, luigi.Task):passI'd be interested to hear if anyone else had a more succinct way to do this.