working around issue #50 (special prefix characters, and cascading.avro)

482 views
Skip to first unread message

Mike Stanley

unread,
Apr 25, 2012, 4:56:47 PM4/25/12
to cascalog-user
hey all,

I'm attempting to use Avro files as source and sink taps in some cascalog jobs.  I'm running up against the same issue described in this ticket https://github.com/nathanmarz/cascalog/issues/50 and this email thread http://groups.google.com/group/cascalog-user/browse_thread/thread/a0e308b286234975

Illegal initial character: ?id
  [Thrown class org.apache.avro.SchemaParseException]

Restarts:
 0: [QUIT] Quit to the SLIME top level
 1: [ABORT] ABORT to SLIME level 0

Backtrace:
  0: org.apache.avro.Schema.validateName(Schema.java:1079)
  1: org.apache.avro.Schema.access$200(Schema.java:77)
  2: org.apache.avro.Schema$Field.<init>(Schema.java:410)
  3: org.apache.avro.Schema$Field.<init>(Schema.java:406)
  4: com.bixolabs.cascading.avro.AvroScheme.generateSchema(AvroScheme.java:263)
  5: com.bixolabs.cascading.avro.AvroScheme.getSchema(AvroScheme.java:230)
  6: com.bixolabs.cascading.avro.AvroScheme.sinkInit(AvroScheme.java:118)
  7: cascading.tap.Tap.sinkInit(Tap.java:196)
  8: cascading.tap.Hfs.sinkInit(Hfs.java:332)
  9: cascading.flow.FlowStep.initFromSink(FlowStep.java:408)



Does anyone have any suggestions on how to workaround this?  

The only thing I can think of doing is patching AvroSchema to strip Cascalog vars (but that seems like a big hack, and I would prefer to work around it locally rather than patching a dependency if possible).

Thanks in advance,
... Mike  

Mike Stanley

unread,
Apr 25, 2012, 10:53:23 PM4/25/12
to cascalog-user
I worked around this by adding alias support to the cascading.avro class.  I essentially added another constructor that takes an List of avroFieldNames that will be mapped to tuple names.  Then I can do something like this in cascalog:

(let [fields (cascading.tuple.Fields. (into-array ^Comparable ["?id" "?title"]))
      types (into-array [Integer String])
      avro-fields ["id" "title"]
      schema (com.bixolabs.cascading.avro.AvroScheme. fields types avro-fields)
      tap (lfs-tap schema "output")]


I still have a little more debugging to work through but this seems to be a reasonable enough approach right now.  I'll be pushing my changes to cascading.avro fork and submitting a pull request to the cascading.avro team.   I'm not sure it's the best solution to the problem still, but it does seem like an ok way to go.

i may create a little clojure/cascalog DSL for working with Avro Schemas though (as the above is a bit clunky).

Cheers,
... Mike

Nathan Marz

unread,
Apr 26, 2012, 12:56:44 AM4/26/12
to cascal...@googlegroups.com
That sounds like a good way to do it. You may want to take a look at making a "sink function". If you provide a function as a sink, Cascalog will call the function with the subquery being "sunk" and expects a pair of [tap new-subquery] back. So in this case, you could call get-out-fields on the subquery in your sink function and auto-set the aliases in the underlying Cascading Avro tap.


--
Twitter: @nathanmarz
http://nathanmarz.com

Andrew Xue

unread,
Apr 28, 2012, 5:47:58 AM4/28/12
to cascalog-user
hey Mike check this out

https://github.com/MaxPoint/cascading-avro

"RenamerScheme - used to coerce field names (mostly for using with
Cascalog)"
> > On Wed, Apr 25, 2012 at 4:56 PM, Mike Stanley <m...@mikestanley.org>wrote:
>
> >> hey all,
>
> >> I'm attempting to use Avro files as source and sink taps in some cascalog
> >> jobs.  I'm running up against the same issue described in this ticket
> >>https://github.com/nathanmarz/cascalog/issues/50and this email thread
> >>http://groups.google.com/group/cascalog-user/browse_thread/thread/a0e...

Andrew Xue

unread,
Apr 28, 2012, 6:25:24 AM4/28/12
to cascalog-user
Reply all
Reply to author
Forward
0 new messages