Job map_input_stream with DDFS and JSON

39 views

Skip to first unread message

jan.te...@camenergydatalab.com

unread,

Oct 8, 2014, 10:58:36 AM10/8/14

to disc...@googlegroups.com

I'm new to disco and I'm struggling :(

I have an input file in JSON format which is basically a list of objects:

[

{"id": 1,

"value": 121},

{"id: 2,

"value": 32},

{"id": 3,

"value": 6656}

]

I try to write my custom map_reader function to yield single objects from the JSON file.

If I use the file directly as Job input: SomeJobClass().run(input=["file://path"]) I am able to write a custom_json_reader() function which yields single objects with map_reader=custom_json_reader

However, If I push chunked data into DDFS and use it as input SomeJobClass().run(input=["tag://data:log"]) I can't get it to work.

As far as I understand the documentation, DDFS data is binary and I have to use disco.worker.task_io.chain_reader

So I have the following class

from disco.worker.task_io import chain_reader

from json_reader import custom_json_reader

class SomeJobClass(Job):

map_input_stream = [chain_reader, custom_json_reader]

def map():

pass

def reduce():

pass

I cannot figure out how to get the data out of DDFS through chain_reader into custom_json_reader to yield single objects from the JSON file.

The custom_json_reader itself works fine if I use the file directly as input. How do I make it work when the data is in the DDFS?

Shayan Pooya

unread,

Oct 9, 2014, 6:20:26 PM10/9/14

to disc...@googlegroups.com

Have you tried using a reader when chunking data into ddfs with a custom reader (the --reader option)?

There is an example here:
https://github.com/discoproject/disco/blob/develop/examples/util/xml_reader.py

(There is a comment about its usage).

--
You received this message because you are subscribed to the Google Groups "Disco-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to disco-dev+...@googlegroups.com.
To post to this group, send email to disc...@googlegroups.com.
Visit this group at http://groups.google.com/group/disco-dev.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages