Baffling error message

524 views
Skip to first unread message

Pat Ferrel

unread,
May 23, 2013, 2:54:51 PM5/23/13
to cascadi...@googlegroups.com
I am manipulating pairs of strings? I've created a PreferenceDatum type that encapsulated the pair of strings and use the fields from it pretty much all the way through the flow. But for some reason that baffles me I get errors throughout. What are the rules for field names?

I get exceptions for the following:
PlannerException: could not build flow from assembly: [[formatting output text...][finderbots.utilities.MergeMinerDataWorkflow.createFlow(MergeMinerDataWorkflow.java:195)] unable to resolve output selector: [{2}:'PreferenceDatum-text user id', 'PreferenceDatum-text item id'], with incoming: [{2}:'PreferenceDatum-text user id', 'PreferenceDatum-text item id'] and declared: [{2}:'PreferenceDatum-text user id', 'PreferenceDatum-text item id']]
Huh?

further down in the same exception trace I get
Caused by: cascading.tuple.TupleException: field name already exists: PreferenceDatum-text user id

In every place that calls for a Fields instance i'm passing in "MyDatum.FIELDS"

        RegexSplitter regexSplitter = new RegexSplitter( PreferenceDatum.FIELDS, "," );
        Pipe importPipe = new Each( "import preferences", new Fields( "line" ), regexSplitter, PreferenceDatum.FIELDS );

        Pipe switchIDsPipe = new Pipe("swithed to create followed-by relationship", importPipe);
error-->  switchIDsPipe = new Each(
            switchIDsPipe,
            PreferenceDatum.FIELDS,
            new SwitchIDsOperation(PreferenceDatum.FIELDS),
            PreferenceDatum.FIELDS
        );

Clearly there are some rules about field names that I haven't absorbed. I'm treating them like a set of class names. There must be some rule that in certain cases they must be unique like an instance name? Obviously passing MyDatum.FIELDS everywhere is wrong, maybe there must be new instances of the Fields class in some cases? Why is it bad to have "field name already exists"?

Can't I just have one datum class and use it everywhere if there is no change in the types of data or must I have new names in certain cases. If the later I have tried several methods of changing the names and the pattern escapes me.


Seth Rogers

unread,
May 23, 2013, 3:38:55 PM5/23/13
to cascadi...@googlegroups.com
I would guess that cascading is confused whether you want to keep the "new" PreferenceDatum fields created by the function or the "old" PreferenceDatum fields passed into the function.  You could use a different name for the new fields, or try Fields.SWAP to replace the old fields with the new ones.  Also I recommend using legal Java symbols for your field names- no spaces or operators.  You will need it for ExpressionFunction.

--Seth


From: Pat Ferrel <pat.f...@gmail.com>
To: "cascadi...@googlegroups.com" <cascadi...@googlegroups.com>
Sent: Thu, May 23, 2013 12:05:11 PM
Subject: Baffling error message
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ken Krugler

unread,
May 23, 2013, 3:40:26 PM5/23/13
to cascadi...@googlegroups.com
Hi Pat,

Without digging into your specific issue, it looks like you've got field names with spaces in them, like 'PreferenceDatum-text user id'

In general you want your field names to be valid Java identifiers. Additionally they shouldn't start with a capital letter, as that can throw off the Janino compiler used for ExpressionFunctions and ExpressionFilters.

-- Ken


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Pat Ferrel

unread,
May 23, 2013, 4:14:49 PM5/23/13
to cascadi...@googlegroups.com
OK, about the Java compatible names, thanks.

As to Fields.SWAP and the rest of the special field sets. I read the docs and am still scratching my head. Couldn't find a description of when field names are expected to be different. I haven't found a use for the names, they just seem to get in my way, failing at runtime. I guess I don't understand why field names matter if the incoming datum class is the same as the outdoing. Pig, as I recall, defaults to using the same ids for incoming as outgoing unless you change them with 'AS' but I'm no Pig expert either.

It seems like you are saying that the incoming fields to a Pipe must have different names from the outgoing ones. And the incoming must match the outgoing from the previous Pipe. I thought I tried that but will have another go. 

It sounds like Fields.SWAP may be what I want, but do you give that as the incoming fields or outgoing or both?

Anyway thanks for the help.

Chris K Wensel

unread,
May 23, 2013, 4:32:19 PM5/23/13
to cascadi...@googlegroups.com
You might play with the Impatient code to get up to speed.


But if you don't want to use field names, don't. they are not required, they are only there for humans.

but if you do use them, you cannot ever have two fields in the stream with the same name. 

ckw

Pat Ferrel

unread,
May 23, 2013, 4:39:08 PM5/23/13
to cascadi...@googlegroups.com
I impatiently read the impatient code, docs, and blog posts but I missed that part (too impatient I guess). I'm using 1.2 and thought they were required--d'oh. Back to the drawing board, thanks.

Ken Krugler

unread,
May 23, 2013, 4:54:47 PM5/23/13
to cascadi...@googlegroups.com
Hi Pat,

Meta-comment - you (never?) pass explicit fields for the last parameter to Each().

By default Each() assumes Fields.RESULTS, which means whatever the operation used with Each declares as its output fields is what you'll get.

See below for more inline.

On May 23, 2013, at 11:54am, Pat Ferrel wrote:

I am manipulating pairs of strings? I've created a PreferenceDatum type that encapsulated the pair of strings and use the fields from it pretty much all the way through the flow. But for some reason that baffles me I get errors throughout. What are the rules for field names?

I get exceptions for the following:
PlannerException: could not build flow from assembly: [[formatting output text...][finderbots.utilities.MergeMinerDataWorkflow.createFlow(MergeMinerDataWorkflow.java:195)] unable to resolve output selector: [{2}:'PreferenceDatum-text user id', 'PreferenceDatum-text item id'], with incoming: [{2}:'PreferenceDatum-text user id', 'PreferenceDatum-text item id'] and declared: [{2}:'PreferenceDatum-text user id', 'PreferenceDatum-text item id']]
Huh?

further down in the same exception trace I get
Caused by: cascading.tuple.TupleException: field name already exists: PreferenceDatum-text user id

In every place that calls for a Fields instance i'm passing in "MyDatum.FIELDS"

        RegexSplitter regexSplitter = new RegexSplitter( PreferenceDatum.FIELDS, "," );
        Pipe importPipe = new Each( "import preferences", new Fields( "line" ), regexSplitter, PreferenceDatum.FIELDS );

Here you don't need the final PreferenceDatum.FIELDS parameter to Each().


        Pipe switchIDsPipe = new Pipe("swithed to create followed-by relationship", importPipe);
error-->  switchIDsPipe = new Each(
            switchIDsPipe,
            PreferenceDatum.FIELDS,

You don't need to pass PreferenceDatum.FIELDS, since every Tuple has these fields. You only need to use this optional parameter when you want to restrict what gets passed to the operation or filter.

            new SwitchIDsOperation(PreferenceDatum.FIELDS),

Your SwitchIDsOperation function should call super(PreferenceDatum.FIELDS) in its constructor (to declare what it emits).

And since (I assume) that's not dynamically configurable, there's no need to pass in PreferenceDatum.FIELDS as a constructor parameter.

            PreferenceDatum.FIELDS
        );


You don't need PreferenceDatum.FIELDS as the final parameter to Each().

Clearly there are some rules about field names that I haven't absorbed.

It's not really about field names, but rather "field algebra", where you need to understand the interaction of the fields selector, what an operation declares as its resulting fields, and the special Fields constants you can use to alter the results.

-- Ken

I'm treating them like a set of class names. There must be some rule that in certain cases they must be unique like an instance name? Obviously passing MyDatum.FIELDS everywhere is wrong, maybe there must be new instances of the Fields class in some cases? Why is it bad to have "field name already exists"?

Can't I just have one datum class and use it everywhere if there is no change in the types of data or must I have new names in certain cases. If the later I have tried several methods of changing the names and the pattern escapes me.



--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Pat Ferrel

unread,
May 23, 2013, 6:02:56 PM5/23/13
to cascadi...@googlegroups.com
Whew, gotit!

I picked the wrong version of the Each constructor to use. Also using Fields.ALL in various places simplified things.

Still have to use names. I tried creating a Fields instance with no names and got complaints at runtime.

Thanks

Paul Baclace

unread,
May 23, 2013, 7:20:37 PM5/23/13
to cascadi...@googlegroups.com
On 20130523 13:32 , Chris K Wensel wrote:
But if you don't want to use field names, don't. they are not required, they are only there for humans.

but if you do use them, you cannot ever have two fields in the stream with the same name. 

I think this should be highlighted more in the userguide, along with a discussion of how positional field addressing works with named fields. The diagrams used are nice, but it would be easier to grok if syntactic correspondence for diagrams parts were explicit.

More user guide edit suggestions...

(user guide note: the term "result selector" is used once, but I think it was meant to be "output selector".)

The doc could also be more clear about the purpose of using an output selector with a list of idioms; the description leaves open the possibility that fields could be renamed (or at least I have not resisted the temptation to try). I'm now thinking that the output selector should almost always be one of the predefined constants.

(user guide note: that the scare quotes around "appended" in the description of output selector really call out for more explanation. The source code for this is on the clever side.)

The baffling runtime error that had me looking at output selector source:  
    unable to resolve output selector: [{3}:'tmc', 'min', 'speed'], with incoming: [{3}:'tmc', 'min', 'speed'] and declared: [{3}:'tmc', 'min', 'speed']

with a deeper error that "tmc" was already in use, but that still did not make sense. Using the default output selector Fields.RESULTS made the error go away. I wonder if Fields.ALL would produce the same error.

Expanding the cookbook (idiom cheat sheet) could help best practices and maximize reusable code by limiting the named field coupling:
* example of best way to apply a function that adds a new field and is not hardwired to what field(s) it processes.
** currently have "Insert constant"
* example of best way to apply a function that replaces a field and renames it in the output without hardwiring the field names.
** when the meaning of a field changes, it is good to change the name in the same syntactic step
* for symmetry, example of best way to apply a function that modifies a field and leaves everything else alone (no field name changes, no new fields).
** for example, Rounding a number so that text output does not have 5.6666666667

Another practical matter related to explaining field names and position is that use of Rename() surprisingly changed the field position for me in my current project. The actual problem was that I normally define the field positions in my output taps, but when using TextLine() for output the ordering is implicit (I am actually fixing up Cascading code written by someone else (a first!)... I sped it up by 12x so far).


Paul

Paul Baclace

unread,
May 31, 2013, 11:06:02 PM5/31/13
to cascadi...@googlegroups.com
I did not see the userguide source in the github repo, so I don't see a
way to submit documentation changes, so I'm posting my suggestions here.

Remove red herring in userguide v2.1:

* sec. 3.3, page 13 "result selector" should be "output selector"


Reply all
Reply to author
Forward
0 new messages