Hello,
We're using Factorie to create a CRF for query understanding, to both segment
words into search phrases and label them in various categories. So we have
something like the following:
sealed trait Label
case object Name extends Label
case object Company extends Label
case object Skill extends Label
case object Location extends Label
//etc..
sealed trait Position
case object Begin extends Position
case object Internal extends Position
The CRF's hidden markov states are each a (Label, Position) pair. The Label
is a category label for a phrase, and the Position is for labelling
phrase boundaries because a phrase boundary always preceeds a Begin Position
and nothing else.
Using Factorie we model this like this, pretty much just following the linear
chain CRF examples in the tutorials:
class Query extends Chain[Query,Token]
class Token(val word:String, label: Label, pos: Postion))
extends FeatureVectorVariable[String] with ChainLink[Token,Query] {
//boilerplate like the example crf code
}
object CRFLabelDomain extends CategoricalDomain[(Label, Position)]
class CRFLabel(label: Label, pos: Position, val token: Token)
extends LabeledCategoricalVariable((label, pos)) {
//boilerplate...
}
class FactorieCRF(val tokenDomain: CategoricalVectorDomain[String])
extends TemplateModel with Parameters {
object transition extends DotTemplateWithStatistics2[CRFLabel, CRFLabel] {
val weights = Weights(new la.DenseTensor2(CRFLabelDomain.size,
CRFLabelDomain.size))
def unroll1(label: CRFLabel): Iterable[Factor] =
if (label.hasPrev) Factor(label.prev, label) else Nil
def unroll2(label: CRFLabel): Iterable[Factor] =
if (label.hasNext) Factor(label, label.next) else Nil
}
object evidence extends DotTemplateWithStatistics2[CRFLabel, Token] {
val weights = Weights(new la.DenseTensor2(CRFLabelDomain.size,
tokenDomain.dimensionSize))
def unroll1(label: CRFLabel): Iterable[Factor] = Factor(label, label.token)
def unroll2(token: Token): Iterable[Factor] =
throw new Error("Token values shouldn't change")
}
this += evidence
this += transition
}
There are two ways I want to amend this model.
1) Rule out illegal state transitions. A transition from Name, Internal
to Skill, Internal is illegal. A transition to any Internal state can only
come from a previous state with an identical Label.
I tried setting the illegal transition weights to -∞, but caused learning to
crash. It seems like those transitions just should not be parameters, so I
think I should be able to do something like the following (with unroll2
elided and label.prev expressed as an Option for clarity):
object transition extends Template2[CRFLabel, CRFLabel] with ??? {
val weights = //some approprately sized tensor
def unroll1(label: CRFLabel): Iterable[Factor] = (label.prev, label) match {
case (None, _) => Nil
case (Some(fst), snd@(_, Begin)) => Factor(weights((fst, snd)))
case (Some(fst@(l1, _)), snd@(l2, Internal)) => if(l1 == l2)
Factor(weigts(fst, snd)) else Factor(Double.NegativeInfinity)
}
}
Is this possible? I'm not sure how to correctly express this with Factorie.
This definitely doesn't seem like the right use of Family2.Factor. Instead
it looks like I should provide a score definition directly somehow, but
something about how to proceed here eludes me.
2) Because of limitations of our training data, we really can't estimate
transitions between phrases within a query. We can really only estimate
approximate distributions of likely segment lengths. So it seems reasonable to
only have parameters for (Name, Begin) -> (Name, Internal) and
(Name, Internal) -> (Name, Internal) transitions and replace the second
case in my pseudo scala above with:
case (Some(fst), snd@(_, Begin)) => 0.0
and have a transition parameter tensor with only 2*|Labels| weights.
Any guidance as to how I can do this?
Thank you!
adam