Implementing my Hierarchical Coreference Inference in Factorie

Shyam Upadhyay

unread,

Apr 12, 2016, 7:39:02 PM4/12/16

to Factorie

Hi,

I have a hierarchical clustering problem, where I have 3 levels of clustering -- clusters of within document mentions (lowest level), across document links b/w these clusters (middle level), and document-clustering based on the previous two levels (top most). I want to implement this in a alternating sampling fashion, similar to what was described in this paper. (Hopefully this is doable in factorie)

I can run the code for HierCoref using the instructions here. The problem is I am not too familiar with factorie's primitives and understanding the code is somewhat difficult (my scala is scratchy at best).

In particular,

1. What exactly is a canopy? Is it akin to cluster id?

2. What is the difference between CanopyPairGenerator and DeterministicPairGenerator?

3. What is a cubbie?

4. Why do you have nonexistentEnts in the CanopyPairGenerator? This seems strange, as I cannot see when a node will be nonexistent mention.

5. How to propogate features from child to parent?

6. How to learn the model params and then use them for inference.

Is there someone who is currently maintaining the coreference code who can guide me? I will be extremely grateful for your help.

Thanks

Nicholas Monath

unread,

Apr 20, 2016, 8:42:08 AM4/20/16

to Factorie

Hello Shyam,

Thanks for your email. I'm terribly sorry for the delayed response. I hope this helps clarify your questions:

1. What exactly is a canopy? Is it akin to cluster id?

A canopy is similar (but not the same as) "blocking" as described in the final paragraph of this section of the record linkage Wikipedia article (https://en.wikipedia.org/wiki/Record_linkage#Probabilistic_record_linkage). The difference is that using blocking each point is placed in exactly one of the non-overlapping blocks, while using canopies each point can be assigned to multiple overlapping canopies.

2. What is the difference between CanopyPairGenerator and DeterministicPairGenerator?

The DeterministicPairGenerator requires you to specify the pairs that will be sampled ahead of time in the mentionSequence field where as the CanopyPairGenerator will randomly sample pairs of nodes from the same canopy.

3. What is a cubbie?

A cubbie is a serializable data structure in factorie. Please see the comments here for more description (https://github.com/factorie/factorie/blob/master/src/main/scala/cc/factorie/util/Cubbie.scala) . There is also code to read/write cubbies to/from MongoDB. Each "slot" in the cubbie is a definition for a field in the class. An example cubbie might be:

import cc.factorie.Cubbie
class Point extends Cubbie {
val x = new IntSlot("x_coord")
val y = new IntSlot("y_coord")
val label = new StringSlot("label")

def this(x: Int,y: Int, label: String) = {
    this()
    this.x.set(x)
    this.y.set(y)
    this.label.set(label)
}
}

class PointsCollection extends Cubbie {
// Note that the second argument expects a function with zero arguments that returns a new Point()
// This uses the default constructor for the class, which does not set anything by default.
val pts = new CubbieListSlot[Point]("points", () => new Point())
}

4. Why do you have nonexistentEnts in the CanopyPairGenerator? This seems strange, as I cannot see when a node will be nonexistent mention.

It is my understanding that nonexistentEnts keeps track of the internal (non-mention) nodes that have no parent or child pointers and removes these from the candidate nodes to sample from, such nodes can occur in the sampling procedure. (see note at end of my email).

5. How to propagate features from child to parent?

The example code that you refer to in your email, the HierCorefDemo, is a good source for a way to do this. You'll see that there is a class WikiCorefVars which extends the NodeVariables. This class represents the statistics stored at each node in the tree. The ++= and --= methods defined how these statistics are propagated from one node to another. That is:

class WikiCorefVars(val names:BagOfWordsVariable, val context:BagOfWordsVariable, val mentions:BagOfWordsVariable, val truth:BagOfWordsVariable) extends NodeVariables[WikiCorefVars] with Canopy with GroundTruth {
....
def ++=(other: WikiCorefVars)(implicit d: DiffList) {
      this.names.add(other.names.members)(d) // adds all of the names from other's bag of words to this nodes bag of words
      this.context.add(other.context.members)(d)
      this.mentions.add(other.mentions.members)(d)
      this.truth.add(other.truth.members)(d)
      d += new NoopDiff(this) // this is needed to trigger unrolling on SingleBagTemplates like Entropy
    }
....
}

This describes how to do this mechanically; you may change it how you like in terms of modeling.

6. How to learn the model params and then use them for inference.

To use the parameters, you will pass them into the CorefModel object and respective feature templates as is done in HierCorefDemo. You'll want to take a look at how the feature templates are defined in this demo in the CorefModel object and select/change these as needed for your application. For learning, it will depend on your problem specifically. Other factorie tutorials can provide an overview of doing learning in general.

Also, you also will need to make a change to the "MoveGenerator" so that proposals fit the three levels of the model you describe. The Hierarchical Coref code here in factorie is the extension of the paper you reference that allows for arbitrarily deep trees to be inferred (see MoveGenerator and the classes that extend it). You'll also want to change the CanopyPairGenerator to only propose pairs that are valid in your model.

Hope this helps!

Best,

Nick

Shyam Upadhyay

unread,

Apr 22, 2016, 10:13:36 AM4/22/16

to Factorie

A few follow up questions ...

1. I have features defined for my mentions as follows, (all features follow the same structure)

private Feature EntityMatch(Mention p1, Mention p2) {
        // do something
        return new RealFeature("entitymatch",val);
    }

I noticed that your features are defined in NodeTemplates.scala. I want to transfer my features to the factorie way of defining features. Can you suggest what I need to do?

2. Does the truth field in WikiCorefVars contain a cluster id? For eg. "John Doe", "Mr Doe", "John" etc. all have the field value as "ENT10001" or something similar?

3. I understand that my Mention class is equivalent to your HCorefCubbie (the leaves of your cluster tree), and I need to write me own WikiCorefVars class which represents the internal nodes.

Nicholas Monath

unread,

May 4, 2016, 9:44:22 AM5/4/16

to Factorie

Hi Shyam,

My apologies for the delayed response. I hope this helps answer your questions.

1. I have features defined for my mentions as follows, (all features follow the same structure) I noticed that your features are defined in NodeTemplates.scala. I want to transfer my features to the factorie way of defining features. Can you suggest what I need to do?

I think you'll want to define a Vars object for your problem, which contains your different features stored as Variable objects. In the code in factorie, these are bag of words variables, but you should be able to have them be booleans, real valued etc. Then you can define feature templates as in NodeTemplates.scala that score your Vars object using each of its internal Variable objects.

2. Does the truth field in WikiCorefVars contain a cluster id? For eg. "John Doe", "Mr Doe", "John" etc. all have the field value as "ENT10001" or something similar?

If I understand your question correctly, yes. The "truth" field is there to store the ground truth cluster id.

3. I understand that my Mention class is equivalent to your HCorefCubbie (the leaves of your cluster tree), and I need to write me own WikiCorefVars class which represents the internal nodes.

You don't need to use cubbies. They are used to serialize the tree structures, not for the actual computation. You don't want to have your Vars class be a cubbie for efficiency reasons. If you do not need to store the trees on disk for your application, you don't need to implement this.