Analog for Mallet's "unsupported feature trick"?

66 views
Skip to first unread message

Steve Ash

unread,
Jan 20, 2015, 9:04:09 AM1/20/15
to dis...@factorie.cs.umass.edu
Does factorie have an analog to Mallet's "unsupported features trick"?  I'm trying to track down why I see lower accuracy in Factorie vs Mallet on an identical linear chain model for structured prediction.  I've checked the features that I'm adding in both and both am using lbfgs + l2 regularization with a gaussian variance of 2.0.  Factorie gets a 0.928 accuracy, whereas Mallet gets a 0.986.  The biggest difference I see is that Factories model.parameters.length returns 122,250 whereas Mallet's reports 71,230.  Consequently Factorie takes longer to train as well and then gets worse accuracy.

I'm using a linear chain setup similar to the ChainNERExample setup with DenseTensors.  I'm not sure why Factorie is reporting _more_ features.  I was expecting that one difference might be mallets "unsupported feature trick" which adds weights for features that aren't observed in the training data.  It does this by running inference of the training examples and adding weights for any transition that gets more than 0.20 probability of occurring.  I was thinking that this would explain the performance difference -- but I don't know why Factorie is reporting more.  Factorie is not reporting the count of "dense" weights as that is in the millions.  So I'm a little confused.  I'll keep digging but any insight anyone might have would be helpful.  I am obviously new to Factorie.

Thanks,
Steve

Emma Strubell

unread,
Jan 20, 2015, 3:03:27 PM1/20/15
to dis...@factorie.cs.umass.edu
Hi Steve,

I've never worked with Mallet, but it does seem strange that Factorie is giving you so many more features for what should be the same model. As far as I know there's no equivalent to the "unsupported features" trick built in to Factorie (although you could implement it :). I can say that the significant difference in features is probably causing the difference in accuracy.

One thing that could possibly be happening is that Mallet is applying some kind of feature count cutoff, reducing features to those observed >=2x in the training data? The Factorie equivalent of this is calling trimBelowCount(cutoff) on dimensionDomain in the features domain (after computing the features), and this is not done in ChainNERExample. 

Another thing to try to get a similar accuracy with Factorie's larger feature set would be trying different variances in the l2 regularization. Although 2.0 may have been a good parameter for the 70k feature set, it may be less good for the larger 120k set of features.

Hope this helps,

Emma

--
--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

Steve Ash

unread,
Jan 21, 2015, 12:34:54 AM1/21/15
to dis...@factorie.cs.umass.edu
Thanks for the tips Emma.  I did a little more digging.  For the sake of anyone trying to move their models from Mallet to Factorie, I'm going to describe some of my findings.  If anyone has any comments that would be helpful.

In Mallet I was using the CRF#addOrderNStates method which creates states for each _pair_ of labels that are observed in the training data.  There are 125 total output labels in my model and about 1541 pairs of labels observed in any output sequence in the training data.  Each observed feature on such a pair then becomes a weight.  And then if you use the "unsupported features trick" (on by default) you get additional weights for any features on any pair of labels with an expectation > 0.2 (after running forward backward with just the initial model).  That created a model with 71,230 weights in my particular case.

In factorie, just following the ChainNERExample, so I ended up with bias terms for each output label (still 125), DenseTensor2[Observed,Label], and DenseTensor2[Label,Label]. I had 852 features.  Since I started with DenseTensors that gave me 125 + 125 * 852 + 125 * 125 features = 122,250 features total.  

So all of that makes sense.  I guess I can emulate what mallet is doing by creating the template over the pair of labels instead of individual labels then using sparse tensors instead of dense tensors then only including weights for observed label x features + implement the unsupported feature trick.  But I think trying to combine the pair of labels in the graph is going to break all of the existing evaluation tools to check accuracy.  Can I instead try to model the pair of labels + observed features as a Factor3?  I guess not without losing linear chain CRF niceness...?

I need to dig a little deeper to reconcile my understanding of how Mallet is modelling these output labels in the sequence model.
If anyone has any thoughts or tips that would be appreciated.

Steve

Luke Vilnis

unread,
Jan 21, 2015, 2:24:54 PM1/21/15
to dis...@factorie.cs.umass.edu
Hi Steve, you are right that Factorie chain model does not currently support edge features, only node features. This probably explains the discrepancy. It would be fairly simple to add them (I did something similar this summer), so this is a very reasonable and useful feature request (in fact there is already a switch in chainmodel that is currently unimplemented called "obsMarkov" for exactly this purpose).
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@factorie.cs.umass.edu.

Steve Ash

unread,
Jan 21, 2015, 2:29:01 PM1/21/15
to dis...@factorie.cs.umass.edu

I saw that obsMarkov in the POS example, and was confused how it was doing anything. I dont completely understand the differences in using a Model that implements the factors() method vs using the templates and implementing unroll. Is the approach you took to use Factor3 with two labels and the observed variable? Id like to try and implement this. Do you have tips or a sketch of what you did this summer?

Steve Ash

unread,
Jan 23, 2015, 9:16:04 AM1/23/15
to dis...@factorie.cs.umass.edu
Is it something as simple as:
* (similar to useObsMarkov) add a Factor three for preLabel, Label, FeatureVector --  in addition to the existing [Label, Label] factor, the [Label] bias term factor, and the [Label, FeatureVector] factor
* During inference:
** send messages from bias term factor to label variable
** send messages from [Label, FeatureVector] factor to label variable (the other edge of this factor is going to the fixed obs feature variable
** send messages from [Label,Label,FeatureVector] factor, factor -> prevLabel edge to the label variable
** send messages from [Label,Label,FeatureVector] factor, factor -> Label edge to the label variable
** send forward along the chain
** send backward along the chain
** send messages from [Label,Label,FeatureVector] factor, Label->Factor edge back to the factor
** send messages from [Label,Label,FeatureVector] factor, PrevLabel->Factor edge back to the factor
** send messages from [Label, FeatureVector] factor from the variable to the factor
** send messages from the label variable to the bias term factor

doesn't seem difficult to implement at all --am I missing something? I'll try this tonight.  If it works, I'll submit a PR.

Steve

Luke Vilnis

unread,
Jan 23, 2015, 4:32:23 PM1/23/15
to dis...@factorie.cs.umass.edu
PR's are always welcome! The way I would do it would be to just modify the current fast chain inference code with an extra argument that lets you modify the edge potentials as well as the local potentials. 

When I implemented my version I also added some new classes to generate arbitrary potentials for inference rather than including the Factor3's explicitly in inference like BP does. I'm not sure how it maps onto yours - it sounds like you're using the standard Factorie BP vs the fast chain inference code? One other issue is you have to be smart about storing the weights to prevent the tensor from being too big. I believe I made my weights totally sparse to handle this.

PS. Let me see if I can reconstruct some of what I did and put it on github - I modified a lot of code, though, and I didn't end up using the ChainModel class so I'm not sure how helpful it will be.

--

Luke Vilnis

unread,
Feb 5, 2015, 5:52:59 PM2/5/15
to dis...@factorie.cs.umass.edu
Hi Steve,
Sorry for the lateness in circling back to this issue. I have a pull request that may or may not be merged into factorie which has a quick bare-bones way of getting sparse efficient edge features. Combined with the small change to DotFactors I just pushed, this code should allow you to do what you were asking. Let me know if you have any questions.
Luke

To unsubscribe, email: discuss+unsubscribe@factorie.cs.umass.edu

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@factorie.cs.umass.edu.

Steve Ash

unread,
Feb 5, 2015, 7:02:41 PM2/5/15
to dis...@factorie.cs.umass.edu
Great! I tried getting the sparse stuff working in BP with mixed success...after banging my head over the many += overloads some of which have different semantic meanings (eek!).  I ended up just going back to Mallet and doing what I was trying to do with the ACRF class.  I ended up getting poor results, but I think that has more to do with my task than it does with mallet or factorie for that matter.  I'll come back to factorie soon and give this a shot.  
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@factorie.cs.umass.edu.
Reply all
Reply to author
Forward
0 new messages