Word co-occurrence

375 views
Skip to first unread message

Philip Johnson

unread,
Oct 10, 2013, 2:47:15 PM10/10/13
to cascadi...@googlegroups.com
Hi all,

New to Cascading...

What's the best approach for word co-occurrence logic?   In particular:  given a set of documents and 2 sets of keywords, if  document contains any word from set A and contains any word from set B, then record (count) all occurrences of such word pairs, and move on to next document.  Easily done in plain Java but need some suggestion for Cascading.

Thanks!

Ken Krugler

unread,
Oct 10, 2013, 5:33:06 PM10/10/13
to cascadi...@googlegroups.com
On Oct 10, 2013, at 11:47am, Philip Johnson wrote:

Hi all,

New to Cascading...

What's the best approach for word co-occurrence logic?   In particular:  given a set of documents and 2 sets of keywords, if  document contains any word from set A and contains any word from set B, then record (count) all occurrences of such word pairs,

If either word can occur anywhere in the document, then what does it mean to count "word pairs"?

An example of (fake) input and expected results is always useful.

-- Ken


and move on to next document.  Easily done in plain Java but need some suggestion for Cascading.

Thanks!

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
For more options, visit https://groups.google.com/groups/opt_out.

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Philip Johnson

unread,
Oct 11, 2013, 1:25:57 PM10/11/13
to cascadi...@googlegroups.com
Here's what I mean:

Sample inputs:
Documents: 
     D1 \t The quick brown fox jumped over the lazy dog.
     D2 \t The boy ran to the market.
     D3 \t The girl walked over to the cat.

Word Set A:    fox chicken hen dog cat
Word set B:    ran jumped walked crawled

Intended output tuples:
      D1, fox, jumped
      D1, dog, jumped
      D3, cat, walked

There are 10's of millions of documents, 100K word set A, 1K word set B.
Thanks for your help on this!
-Phil

Oscar Boykin

unread,
Oct 11, 2013, 2:07:44 PM10/11/13
to cascadi...@googlegroups.com, Argyris Zymnis
The answer to all things: scalding.  This could be done as a Matrix as well, starting with the same source variable:

// load the (document, line) pair
val source: TypedPipe[(String, String)] = TypedPipe.from(TypedTsv[(String, String)]("doc-line.tsv"))
val matrix = Matrix2(source.flatMap { case (doc, line) =>
  line.split(" ").map { word => (doc, word, 1L) }
})
(matrix.transpose * matrix).write(TypedTsv[(String, String, Long)]("output"))

/////////////////////

Or you can manually implement matrix multiplication if you prefer to ignore the relationship to matrices:

val docWord = source.flatMap { case (document, line) =>
  val tokens = line.split(" ") // crappy tokenizer
  tokens.map { word => (document, word)
}

val joined: TypedPipe[(String, (String, String))] =
  docWord.group.join( docWord.group ).toTypedPipe //inner join
// group on pairs and count:

val cooc: TypedPipe[((String,String), Long)]joined.groupBy { case (doc, wordPair) => wordPair }.size

//write it out:

cooc.map {case ((w1, w2), count) => (w1,w2,count) }
  . write(TypedTsv[(String, String, Long)]("output"))

With type safety all the way through!  The compiler will tell you if you mess up.



Oscar Boykin :: @posco :: http://twitter.com/posco

Ken Krugler

unread,
Oct 11, 2013, 2:16:52 PM10/11/13
to cascadi...@googlegroups.com
Hi Phil,

On Oct 11, 2013, at 10:25am, Philip Johnson wrote:

Here's what I mean:

Sample inputs:
Documents: 
     D1 \t The quick brown fox jumped over the lazy dog.
     D2 \t The boy ran to the market.
     D3 \t The girl walked over to the cat.

Word Set A:    fox chicken hen dog cat
Word set B:    ran jumped walked crawled

Intended output tuples:
      D1, fox, jumped
      D1, dog, jumped
      D3, cat, walked

So if D1 had two occurrences of the word "fox", would there still only be a single D1, fox, jumped result?

-- Ken

Ken Krugler

unread,
Oct 11, 2013, 2:18:24 PM10/11/13
to cascadi...@googlegroups.com
Hi Oscar,

From what I can tell, Phil only cares about words from the input docs that are found in his two restrictive sets of keywords.

I would expect the Scalding code below to calculate all word co-occurences.

-- Ken

Oscar Boykin

unread,
Oct 11, 2013, 2:23:24 PM10/11/13
to cascadi...@googlegroups.com
He could filter with at the end. Keep a list of all the single words that are in the pairs, .filter first on that before the join, then use the pair filter after the join.

Philip Johnson

unread,
Oct 11, 2013, 3:00:20 PM10/11/13
to cascadi...@googlegroups.com
Oscar, Ken,
Thanks for this assistance!  I'll study up on Scalding.
Just thinking about the scale of that matrix -- seems like it'll be pretty big for a matrix multiply, but then I'm not sure what's practical in either Cascading or Scalding.  If there were a "contains" operator in Cascading I'd be in good shape, in think.
Thx,
-Phil

Ken Krugler

unread,
Oct 11, 2013, 5:37:29 PM10/11/13
to cascadi...@googlegroups.com
Hi Phil,

On Oct 11, 2013, at 12:00pm, Philip Johnson wrote:

Oscar, Ken,
Thanks for this assistance!  I'll study up on Scalding.
Just thinking about the scale of that matrix -- seems like it'll be pretty big for a matrix multiply, but then I'm not sure what's practical in either Cascading or Scalding.  If there were a "contains" operator in Cascading I'd be in good shape, in think.

A much more verbose version in Java, but one that avoids generating all possible combinations of words…

public class WordCooccurrence extends SubAssembly {

    private static class SetFilter extends BaseOperation<NullContext> implements Filter<NullContext> {

        private Set<String> _validWords;

        

        public SetFilter(Set<String> validWords) {
            _validWords = validWords;
        }

        

        @SuppressWarnings("rawtypes")
        @Override
        public boolean isRemove(FlowProcess flowProcess, FilterCall<NullContext> filterCall) {
            return !_validWords.contains(filterCall.getArguments().getTuple().getString(1));
        }

    }

    

    public WordCooccurrence(Pipe documentsPipe, String docIdFieldname, String docTextFieldname, Set<String> listA, Set<String> listB) {

        

        // Filter out everything but what we need.
        documentsPipe = new Each(documentsPipe, new Fields(docIdFieldname, docTextFieldname), new Identity());

        // Parse the text into separate words. We could do this more efficiently by implementing a custom function that
        // outputs <doc id> and <word> as two fields.
        //
        // Normally you'd do something better here for parsing, and you'd want to normalize text
        documentsPipe = new Each(documentsPipe, new Fields(docTextFieldname), new RegexSplitGenerator(new Fields("listAword"), "\\s+"), Fields.ALL);

        

        // Filter out the original docTextFieldname.
        documentsPipe = new Each(documentsPipe, new Fields(docIdFieldname, "listAword"), new Identity());

        // We only care about unique terms.
        documentsPipe = new Unique(documentsPipe, new Fields(docIdFieldname, "listAword"));

        

        // Split, and only leave around words that exist in either set
        Pipe listAPipe = new Pipe("list A words", documentsPipe);
        listAPipe = new Each(listAPipe, new SetFilter(listA));

        Pipe listBPipe = new Pipe("list B words", documentsPipe);
        listBPipe = new Each(listBPipe, new SetFilter(listB));

        // Rename this right side to avoid field name collisions in the CoGroup
        listBPipe = new Rename(listBPipe, new Fields(docIdFieldname, "listAword"), new Fields("listBdocID", "listBword"));

        // Do a CoGroup with an inner join to get all the word combinations for a given document.
        Pipe resultsPipe = new CoGroup( listAPipe, new Fields(docIdFieldname),
                                        listBPipe, new Fields("listBdocID"),
                                        null,
                                        new InnerJoin());

        // Get rid of the extra listBdocID field that we don't need
        resultsPipe = new Each(resultsPipe, new Fields(docIdFieldname, "listAword", "listBword"), new Identity());

        setTails(resultsPipe);
    }
}

And a snippet of code that tests it:

    public void test() throws Exception {
        Tap sourceTap = new InMemoryTap(new Fields("docid", "text"));
        TupleEntryCollector writer = sourceTap.openForWrite(new LocalFlowProcess());
        writer.add(new Tuple("D1", "The quick brown fox jumped over the lazy dog"));
        writer.add(new Tuple("D2", "The boy ran to the market"));
        writer.add(new Tuple("D3", "The girl walked over to the cat"));
        writer.close();

        

        Set<String> listA = new HashSet<String>();
        Collections.addAll(listA, "fox chicken hen dog cat".split(" "));

        

        Set<String> listB = new HashSet<String>();
        Collections.addAll(listB, "ran jumped walked crawled".split(" "));

        

        Pipe p = new Pipe("cooccurrence test");
        p = new WordCooccurrence(p, "docid", "text", listA, listB);
        p = new Each(p, new Debug(true));

        

        FlowDef fd = new FlowDef();
        fd.addSource(p, sourceTap);

        

        fd.addTailSink(p, new NullSinkTap());
        new LocalFlowConnector().connect(fd).complete();

-- Ken

Philip Johnson

unread,
Oct 11, 2013, 9:49:49 PM10/11/13
to cascadi...@googlegroups.com
Ken,
above & beyond!  this is exactly the kind of direction I was hoping for, and then some --  thanks!
-phil

Ted Dunning

unread,
Oct 12, 2013, 1:38:38 AM10/12/13
to cascadi...@googlegroups.com

There is no need to filter at the end.

Create two matrices.  One is doc x classA, then other is doc x classB.  Obviously, the row indices have to be consistent.  Call the first matrix A, the second B.  It is best to binarize these matrices to avoid cartesian product excitement.

Now compute A' B to get the cross occurrence between the two sets.


komal latif

unread,
Mar 21, 2017, 4:23:05 AM3/21/17
to cascading-user
how its easily done in plain java? plz tell me if u have a code of calculating co-occurrence in plain java. i need
Reply all
Reply to author
Forward
0 new messages