Trying to output Gzip, getting error

598 views
Skip to first unread message

Mike Massey

unread,
Jan 3, 2016, 6:29:32 PM1/3/16
to Luigi
I have a CSV which is utf-16 encoded (outputted from SSIS).  Before I can load it into my database, I need to clean up the formatting of the data a bit.  The idea is that I open the CSV into a CSVDict, process it (formats, names, etc), then output it as a gzip.  When I run my code, I get this error: " TypeError: a bytes-like object is required, not 'str' "

I had to use with open(file, 'r', encoding='utf16') to properly load the file.  When I tried to implement this logic completely in Luigi, I got an error: 

class ProcessFile(luigi.Task):

filename = luigi.Parameter()
filedate = luigi.DateParameter(default = (datetime.date.today() - datetime.timedelta(1)))
filetype = luigi.Parameter(default='.csv')

def requires(self):
return GetFileFromFTP(self.filename, self.filedate, self.filetype)

def output(self):
return luigi.LocalTarget('output.csv')

def run(self):
r = self.input().open('r')
csvread = csv.DictReader(r, delimiter='|', quotechar='"')

w = self.output().open('w')
headers = ['Column1', 'Column2', 'Column3', 'Column4', 
'Column5', 'Column6', 'Column7', 'Column8', 
'Column9', 'Column10', 'Column11', 'Column12', 'Column13']


writer = csv.DictWriter(w, headers, extrasaction='ignore', delimiter='|', quoting=csv.QUOTE_ALL)
writer.writeheader()
#A bunch of for row in r logic
writer.writerow(row)

Error: " UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte "

It seems that this is due to the file being in UTF-16 encoding, so I re-wrote the code like so:

class ProcessFile(luigi.ExternalTask):

filename = luigi.Parameter()
filedate = luigi.DateParameter(default = (datetime.date.today() - datetime.timedelta(1)))
filetype = luigi.Parameter(default='.csv')

def requires(self):
return GetFileFromFTP(self.filename, self.filedate, self.filetype)

def output(self):
return luigi.LocalTarget('output.gz', format=luigi.format.Gzip)
def run(self):
file =  self.filename + self.filedate.strftime("%Y%m%d") + self.filetype
with open(file, 'r', encoding='utf16') as readfile:
csvread = csv.DictReader(readfile, delimiter='|', quotechar='"')

w = self.output().open('w')
headers = ['Column1', 'Column2', 'Column3', 'Column4', 
'Column5', 'Column6', 'Column7', 'Column8', 
'Column9', 'Column10', 'Column11', 'Column12', 'Column13']


writer = csv.DictWriter(w, headers, extrasaction='ignore', delimiter='|', quoting=csv.QUOTE_ALL)
writer.writeheader()


for row in csvread:
row['Column14'] = datetime.datetime.strptime(row['Column13'], '%Y-%m-%d %H:%M:%S').strftime("%Y-%m-%d")
if len(row['Column12']) == 29:
row['Column15'] = datetime.datetime.strptime(row['Column12'][:26], '%Y-%m-%d %H:%M:%S.%f').strftime("%Y-%m-%d")
elif len(row['CreatedDate']) == 19:
row['Column15'] = datetime.datetime.strptime(row['Column12'], '%Y-%m-%d %H:%M:%S').strftime("%Y-%m-%d")
else: 
row['Column15'] = datetime.datetime.strptime(row['Column12'], '%Y-%m-%d').strftime("%Y-%m-%d")
#Cleanup unformatted decimals


#If the decimal does not have a number before it, add a leading zero
for key, value in row.items():
if value[:1] == '.' and value.replace('.','',1).isdigit() == True:
row[key] = '0' + value

writer.writerow(row)
w.close()


When I run this, I get the error: " TypeError: a bytes-like object is required, not 'str' ".  I could add another Task to gzip, but I know Luigi can gzip a file as the output.  Anyone have an idea of what I am doing incorrectly?

Thanks


Erik Bernhardsson

unread,
Jan 3, 2016, 6:49:19 PM1/3/16
to Mike Massey, Luigi
Where do you get the error? writer.writerow(row)?


--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike Massey

unread,
Jan 3, 2016, 6:59:49 PM1/3/16
to Luigi, mikeb...@gmail.com
With: 

def output(self):
return luigi.LocalTarget('output.gz', format=luigi.format.Gzip)

If I output a CSV, it works fine:
def output(self):
return luigi.LocalTarget('output.csv')

Erik Bernhardsson

unread,
Jan 3, 2016, 8:40:44 PM1/3/16
to Mike Massey, Luigi
Can you provide a traceback? I'm still confused where the error occurs

Mike Massey

unread,
Jan 3, 2016, 9:17:31 PM1/3/16
to Luigi, mikeb...@gmail.com
Here is the line of code:
return luigi.LocalTarget('output.gz', format=luigi.format.Gzip)

Here is the traceback:

Traceback (most recent call last):
  File "/Users/foo/Python/python3.5.1/lib/python3.5/site-packages/luigi/worker.py", line 162, in run
    new_deps = self._run_get_new_deps()
  File "/Users/foo/Python/python3.5.1/lib/python3.5/site-packages/luigi/worker.py", line 113, in _run_get_new_deps
    task_gen = self.task.run()
  File "run_luigi.py", line 181, in run
    writer.writeheader()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/csv.py", line 142, in writeheader
    self.writerow(header)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/csv.py", line 153, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
  File "/Users/foo/Python/python3.5.1/lib/python3.5/site-packages/luigi/format.py", line 183, in write
    self._process.stdin.write(*args, **kwargs)
TypeError: a bytes-like object is required, not 'str'


To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.

Brian Bloniarz

unread,
Jan 4, 2016, 2:35:25 PM1/4/16
to Mike Massey, Luigi
The luigi Gzip format in python3 expects the inputs and outputs to be bytes, but the
csv library expects outputs to be opened in text mode.

To make the types compatible, try w = io.TextIOWrapper(self.output().open('w'))
as your writable handle.

-Brian

Mike Massey

unread,
Jan 4, 2016, 3:34:48 PM1/4/16
to Luigi, mikeb...@gmail.com
Thank you - This appears to work for my needs.  

Also - I was able to re-write the "read" of the method like so: 

readfile = io.TextIOWrapper(self.input().open('r'), encoding = 'utf16')

I appreciate the help.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages