TOKZENIZE doesn't split on the empty string as far as I can tell. At least it didn't work when I tried it. You can you STRSPLIT instead, however this returns tuples, and MarkovPairs wants bags. You may think TOBAG is good at handling this but it actually isn't because you have to explicitly list the fields.
Below is a working example. It loads a dictionary of English words, counts the markov pairs, and then lists the most common pairs beginning with 'q'. As expected, among these the pair ('q','u') is the most prevalent. To convert the tuple returned by STRSPLIT to a bag I wrote a simple Python UDF. It actually would be nice if we made MarkovPairs handle either tuples or bags as input to make this easier. I'll take this as a TODO.
register 'udf.py' using jython as myfuncs;
-- default lookahead of 1
DEFINE markov datafu.pig.stats.MarkovPairs();
-- load a list of english words
words = LOAD 'english-words.txt' using PigStorage('\t') as (word);
-- convert each word to bag of characters
tokenized_words = FOREACH words GENERATE myfuncs.tobag(STRSPLIT(LOWER(word), '')) AS chars;
-- generate the markov pairs of characters
markov_pairs = FOREACH tokenized_words GENERATE markov(chars) as pairs;
markov_pairs = FOREACH markov_pairs GENERATE FLATTEN(pairs) as (c1, c2);
markov_pairs_grouped = GROUP markov_pairs BY (c1, c2);
pair_counts = FOREACH markov_pairs_grouped
GENERATE group.c1 as c1,
group.c2 as c2,
COUNT(markov_pairs) as cnt;
-- find most common characters following 'q'
q_pairs = FILTER pair_counts BY c1.c == 'q';
q_pairs = ORDER q_pairs BY cnt DESC;
q_pairs = LIMIT q_pairs 10;
DUMP q_pairs;
-- output:
-- ((q),(u),3538)
-- ((q),(a),22)
-- ((q),(i),16)
-- ((q),(e),5)
-- ((q),('),5)
-- ((q),(t),4)
-- ((q),(r),4)
-- ((q),(w),3)
-- ((q),(q),3)
-- ((q),(s),3)
@outputSchema("c:bag{t:tuple(c:chararray)}")
def tobag(chars):
return [(c,) for c in chars]