Filtering records from a TSV file

Benjamin Bengfort

unread,

Nov 12, 2013, 1:56:03 PM11/12/13

to dumbo...@googlegroups.com

Hello,

I have a task to filter a sample of records from a TSV data set. A random sample of IDs has been pre-calculated, and if the key is in the record, it should be yielded from the mapper, otherwise the record is dropped. This works well enough, although I may have some questions about being able to chain the job that computes the random sample of IDs to the filter job, depending on how this conversation goes.

I am using the identity reducer, and I understand that I cannot yield simply a key, so I'd like to yield (row, None) and write it to output, but the output writer escapes the tab character. Is there a way I could force the output writer to write the unescaped string as the key? (E.g. print the str(row) not the repr(row)).

Below are my mapper and reducer for reference:

class FilterMapper(object):

def __init__(self):
with open(self.params['includes'], 'r') as includes:
self.include = set(int(line.strip()) for line in includes)

def __call__(self, key, value):
parts = value.strip().split('\t')
try:
cid = int(parts[4])
if cid in self.include:
yield (cid, value)
except:
pass

class WriteReducer(object):

def __call__(self, key, values):
for val in values:
yield (value, None)

Troubled output:

'9900025-001-XL\t1354074\tweb\tcustomer\t72220\t01MAY2012:00:00:00\t999\t1\tXL\t001\t3\t29.99'
'9900025-001-L\t1411713\tE4X\tE4X\t83722\t17JUL2012:00:00:00\t999\t2\tL\t001\t1\t41.99'
'9900025-001-S\t1581147\tweb\tcustomer\t11347\t31MAR2012:00:00:00\t999\t6\tS\t001\t1\t29.99'
'9900025-001-XL\t1826870\tE4X\tE4X\t83712\t13AUG2012:00:00:00\t999\t1\tXL\t001\t1\t29.99'

Klaas Bosteels

unread,

Nov 13, 2013, 2:06:16 AM11/13/13

to dumbo...@googlegroups.com

Hey Benjamin,

Is that output from a local run? You can run "dumbo cat" on that to get non-repr strings.

Also, if you want to get a TSV then you might want to do something like

yield parts[0], "\t".join(parts[1:])

and run with -outputformat text (this is assuming you skip the reducer, which doesn't seem to be necessary as far as I can tell), since the key and value are always separated by a tab by this outputformat...

-K

--
You received this message because you are subscribed to the Google Groups "dumbo-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dumbo-user+...@googlegroups.com.
To post to this group, send email to dumbo...@googlegroups.com.
Visit this group at http://groups.google.com/group/dumbo-user.
For more options, visit https://groups.google.com/groups/opt_out.

Benjamin Bengfort

unread,

Nov 13, 2013, 7:39:19 AM11/13/13

to dumbo...@googlegroups.com

Hi Klaas,

It's a run on the cluster, and unfortunately I don't want to have to use an ancillary command since this is a preprocessing job for other jobs that are expecting TSV input (e.g. manually run dumbo cat then pipe to another output file). When you send the output of this job to the other jobs, the tabs are escaped "\\t". I've updated the code as follows with your suggestions. Unfortunately the repr output still remains:

9900025-001-XL '1354074\tweb\tcustomer\t72220\t01MAY2012:00:00:00\t999\t1\tXL\t001\t3\t29.99'

9900025-001-L '1411713\tE4X\tE4X\t83722\t17JUL2012:00:00:00\t999\t2\tL\t001\t1\t41.99'

9900025-001-S '1581147\tweb\tcustomer\t11347\t31MAR2012:00:00:00\t999\t6\tS\t001\t1\t29.99'

9900025-001-XL '1826870\tE4X\tE4X\t83712\t13AUG2012:00:00:00\t999\t1\tXL\t001\t1\t29.99'

class FilterMapper(object):

def __init__(self):

with open(self.params['includes'], 'r') as includes:

self.include = set(int(line.strip()) for line in includes)

def __call__(self, key, value):

parts = value.strip().split('\t')

try:

cid = int(parts[4])

if cid in self.include:

yield (value[0], '\t'.join(value[1:])

except:

pass

def runner(job):

job.additer(FilterMapper)

def starter(program):

includes = program.delopt("includes")

if not includes:

raise Error("Must specify an includes file with -includes")

program.addopt("param", "includes=" + includes)

program.addopt("outputformat", "text")

program.addopt("numreducetaks", 0)

if __name__ == '__main__':

main(runner, starter)

--
You received this message because you are subscribed to a topic in the Google Groups "dumbo-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dumbo-user/g_kysvzD68c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dumbo-user+...@googlegroups.com.

To post to this group, send email to dumbo...@googlegroups.com.
Visit this group at http://groups.google.com/group/dumbo-user.
For more options, visit https://groups.google.com/groups/opt_out.

--

Best Regards,
Benjamin Bengfort

Sent from my mobile, please ignore unfortunate typos and autocorrects!

Klaas Bosteels

unread,

Nov 18, 2013, 2:38:07 AM11/18/13

to dumbo...@googlegroups.com

Hmm, weird. How are you printing the contents of the output file(s)?

Benjamin Bengfort

unread,

Nov 18, 2013, 7:30:00 AM11/18/13

to dumbo...@googlegroups.com

I'm just yielding out of the mapper/reducer. If you mean how am I viewing the files out of HDFS- I'm using dumbo cat - but honestly, anything including hadoop fs -cat must work for this to work with other jobs.

It's no problem- this is an easy task to write in Java- it's just far faster/easier to write it in Python!

While I'm at it, here's a slightly different question on the same topic.

Suppose I have dumbo write out TypedBytes of Python objects. When you chain MapReduce jobs with job.additer - the next MapReduce jobs all appear to get the correct type from the output_pre* files. However, if you run a separate job on this output, you have to use a literal_eval. I suspect I could just unpickle rather than using ast - is that true?

Ben

Reply all

Reply to author

Forward