Trying to use MarkovPairs but keep getting errors

84 views
Skip to first unread message

Johan Gustavsson

unread,
Oct 5, 2012, 2:00:35 PM10/5/12
to dat...@googlegroups.com
Hi, I've been trying to use datafu.pig.stats.MarkovPairs for quite some time but due to lack of examples I'm yet to have any success.
My code is something like:
    REGISTER /usr/lib/pig/datafu-0.0.4-cdh4.1.0.jar
    a = LOAD '/user/me/0' USING PigStorage('\t') AS (id, mess);
    b = FOREACH a GENERATE TOKENIZE(LOWER(mess), ' ') AS messgae;
    c = FOREACH b GENERATE datafu.pig.stats.MarkovPairs(mess);
    dump pairs;
any feedback on correct usage would be much appreciated.

Thanks
Johan

Matthew Hayes

unread,
Oct 5, 2012, 6:48:18 PM10/5/12
to dat...@googlegroups.com
Hi Johan,

TOKZENIZE doesn't split on the empty string as far as I can tell.  At least it didn't work when I tried it.  You can you STRSPLIT instead, however this returns tuples, and MarkovPairs wants bags.  You may think TOBAG is good at handling this but it actually isn't because you have to explicitly list the fields.  

Below is a working example.  It loads a dictionary of English words, counts the markov pairs, and then lists the most common pairs beginning with 'q'.  As expected, among these the pair ('q','u') is the most prevalent.  To convert the tuple returned by STRSPLIT to a bag I wrote a simple Python UDF.  It actually would be nice if we made MarkovPairs handle either tuples or bags as input to make this easier.  I'll take this as a TODO.


-- MarkovExample.pig
--
register 'udf.py' using jython as myfuncs;

-- default lookahead of 1
DEFINE markov datafu.pig.stats.MarkovPairs();

-- load a list of english words
words = LOAD 'english-words.txt' using PigStorage('\t') as (word);

-- convert each word to bag of characters
tokenized_words = FOREACH words GENERATE myfuncs.tobag(STRSPLIT(LOWER(word), '')) AS chars;

-- generate the markov pairs of characters
markov_pairs = FOREACH tokenized_words GENERATE markov(chars) as pairs;
markov_pairs = FOREACH markov_pairs GENERATE FLATTEN(pairs) as (c1, c2);

markov_pairs_grouped = GROUP markov_pairs BY (c1, c2);
pair_counts = FOREACH markov_pairs_grouped
  GENERATE group.c1 as c1, 
           group.c2 as c2,
           COUNT(markov_pairs) as cnt;

-- find most common characters following 'q'
q_pairs = FILTER pair_counts BY c1.c == 'q';
q_pairs = ORDER q_pairs BY cnt DESC;
q_pairs = LIMIT q_pairs 10;
DUMP q_pairs;

-- output:
-- ((q),(u),3538)
-- ((q),(a),22)
-- ((q),(i),16)
-- ((q),(e),5)
-- ((q),('),5)
-- ((q),(t),4)
-- ((q),(r),4)
-- ((q),(w),3)
-- ((q),(q),3)
-- ((q),(s),3)


-- udf.py
--
@outputSchema("c:bag{t:tuple(c:chararray)}")
def tobag(chars):
  return [(c,) for c in chars]


Hope this help,
Matt

Johan Gustavsson

unread,
Oct 6, 2012, 4:12:08 AM10/6/12
to dat...@googlegroups.com
Hi Matt,

Thanks for you quick feedback, while it helped me understand pig better and also introduced me to jython udfs it doesn't do what I want to do. 
Maybe the MarkovPairs UDF isn't built for it but from the name I was expecting it to be... I'm not trying to do this letter for letter but word for word.
For example with the input "I have a dog", I wanted to generate the pairs {(i),(have)}{(have),(a)}{(a),(dog)}.
If this isn't possible with MarkovPairs then I guess my only choice is to try and write my own UDF.

Thanks ones again
Johan

Johan Gustavsson

unread,
Oct 6, 2012, 9:42:56 AM10/6/12
to dat...@googlegroups.com
Just thought I should add this in case there might be someone else looking for a solution in the future.


In case it gets deleted the content is a custom UDF as following:

public class SlidingTuple extends EvalFunc<DataBag> {

    private static final BagFactory bagFactory = BagFactory.getInstance();
    private static final TupleFactory tupleFactory = TupleFactory.getInstance();

    @Override
    public DataBag exec(Tuple input) throws IOException {
        try {
            DataBag inputBag = (DataBag) input.get(0);
            DataBag result = null;
            if (inputBag != null) {
                result = bagFactory.newDefaultBag();
                Iterator<Tuple> it = inputBag.iterator();
                Tuple previous = it.next();
                while (it.hasNext()) {
                    Tuple current = it.next();
                    Tuple tuple = tupleFactory.newTuple(2);
                    tuple.set(0, previous.get(0));
                    tuple.set(1, current.get(0));
                    result.add(tuple);
                    previous = current;
                }
            }
            return result;
        }
        catch (Exception e) {
            throw new RuntimeException("SlidingTuple error", e);
        }
    }
}
and then using it in the following manner:
A = LOAD '/user/hive/warehouse/twitter_raw/$date' USING PigStorage('\t') 
      AS (id:chararray,  mess:chararray);

B = foreach A generate TOKENIZE(mess, ' ') as words;
C = foreach B generate com.example.SlidingTuple(words);
Hope it helps someone..

Matthew Hayes

unread,
Oct 6, 2012, 2:08:55 PM10/6/12
to dat...@googlegroups.com
Your "I have a dog" example is possible with MarkovPairs.  I couldn't tell you had a space in your TOKENIZE call.  MarkovPairs doesn't know anything about how you split up the string.  It just expects a bag.  If you tokenize a string into a bag of words then the markov pairs will be of words.

Matthew Hayes

unread,
Oct 6, 2012, 2:21:10 PM10/6/12
to dat...@googlegroups.com
Here is a working example for computing markov pairs of words.  My "dog-story.txt" has a single line "I have a dog".  This seems to be exactly what you were trying originally.  What errors were you getting before?  I don't see why it wasn't working for you.

-- load a story about a dog
words = LOAD 'dog-story.txt' using PigStorage('\t') as (text);

-- convert each word to bag of characters
tokenized_words = FOREACH words GENERATE TOKENIZE(LOWER(text)) AS words;

-- generate the markov pairs of words
markov_pairs = FOREACH tokenized_words GENERATE datafu.pig.stats.MarkovPairs(words) as pairs;

DUMP markov_pairs

-- output:
-- ({((i),(have)),((have),(a)),((a),(dog))})

Johan Gustavsson

unread,
Nov 14, 2012, 7:11:31 AM11/14/12
to dat...@googlegroups.com
Sorry for the late reply,
I'm not sure why it wouldn't work, in fact I haven't been able to replicate the error, so it seems like it was something in the data set I was using.
I was trying to use it with crawled twitter feeds and I guess there were something in there that caused an error.
Anyways I'm very thankful for all your help and persistence on this problem. 
Best regards
Johan
Reply all
Reply to author
Forward
0 new messages