Dynamic requirements' luigi.Parameters are converted to strings

888 views
Skip to first unread message

Noah Maze

unread,
Nov 25, 2015, 1:18:55 PM11/25/15
to Luigi
I solved the most confusing bug yesterday: 

I have a task hierarchy that passes around an instance of a python class as a parameter.  This is handy because it reduces the burden of having to share every parameter with every task in the hierarchy.  It works okay for the most part, but recently I kept experiencing an issue where the object would be correctly passed to all but one task: one task was receiving a string like '<__main__.ObjectName instance at 0x30e4fe0>'.  

I scrutinized the code for literally hours, stepping through it with pdb and everything, trying to suss out why my object was getting converted to a string.  I finally figured out that the bug wasn't in my code, but actually a part of worker.py (I'm using version 1.1.2 but the most recent version of the code still does it).  If a task is dynamically required (i.e. it is yielded in the run method of another task), the parameters are serialized and then parsed (which by default turns the luigi.Parameter's value into a string).  I knew that a parsing step would occur between contexts (e.g. from the command line to python), but I never expected that it could occur within python.  

I'm mostly writing this post because nothing came up when I googled about luigi parameters being converted to strings.  Here's how I solved the problem:
  1. Create a __str__ method on the object.  The output must contain all of the arguments necessary to re-create it.
    class ExampleObject(object):    
      """ Class to contain all the input related logic so it's all in one place """
      def __init__(self, start_month, end_month, foo):
        self.start_month = start_month
        self.end_month = end_month
        self.foo = foo
      
      def __str__(self):
        return "ExampleObject(%s,%s,%s)"%(self.start_month, self.end_month, self.foo)

  2. Create a custom parameter that inherits from luigi.Parameter and over-writes the "parse" 
    class ExampleObjectParameter(luigi.Parameter):
      def parse(self, x):
        # Make sure the incoming string matches str(ExampleObject())'s output
        assert x.find("ExampleObject")==0, "Parse failed: \"ExampleObject\" not found in string."
        # Ignore the module name and the parens, but split the rest into arguments
        arguments = x[14:-1].split(',')
        # Make sure there's exactly 3 args
        assert len(arguments)==3, "Parse failed: Expected 3 arguments, but got %s"%(len(arguments))
        # Return the ExampleObject instance
        return ExampleObject(arguments[0], arguments[1], arguments[2])


Erik Bernhardsson

unread,
Nov 25, 2015, 1:49:50 PM11/25/15
to Noah Maze, Luigi
This sounds right – parameters are assumed to be (de)serializable in a few places, in particular the dynamic requirements since those are communicated to the server back and forth.

I don't think we should change that – BUT we should probably make sure to throw an exception if a parameter can't be serialized and make the base class not serializable (only subclasses such as IntParameter etc)



--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erik Bernhardsson

unread,
Nov 25, 2015, 1:50:13 PM11/25/15
to Noah Maze, Luigi
Would love it if you can put together a quick pull request if it's possible

Ruhan Bidart

unread,
Nov 26, 2015, 10:30:52 AM11/26/15
to Luigi, nbm0...@gmail.com
Maybe you could use a JSON object instead of parsing the string manually. I thought about something like:

class ExampleObject(object):    
   
def __init__(self, start_month, end_month, foo):

     
self.start_month = start_month
     
self.end_month = end_month
     
self.foo = foo
   
   
def __str__(self):

     
return json.dumps({'start_month': self.start_month, 'end_month': self.end_month, 'foo': self.foo})

 
 
class ExampleObjectParameter(luigi.Parameter):
   
def parse(self, x):


 

   
return ExampleObject(**json.loads(x))



If you want to keep the object name (ExampleObject) in the __str__ of the object you might use a little more complex JSON serialization.


BTW, good solution and thanks to share that problem.


Cheers,


Reply all
Reply to author
Forward
0 new messages