Featurizers

47 views
Skip to first unread message

Omid

unread,
Aug 28, 2014, 5:29:28 PM8/28/14
to scalanlp...@googlegroups.com
Hi again, sorry if I ask these many questions, I want to use my features. I am following the example of POSTagger, this is the default wordFeaturizer in CRF Model:

def myFeaturizer(counts: Counter2[String, String, Double]) = {
    val dsl
= new WordFeaturizer.DSL[String](counts)
   
import dsl._
   
(
      unigrams
(word + clss, 1)
         
+bigrams(clss, 2)
         
+bigrams(tagDict, 2)
       
+ suffixes()
       
+ prefixes()
       
+  props
   
)
 
}

If I want to add my features, which is a list of String features for a word (e.g. dependency path or any string). what exactly should I add to the function? Here is my Map of String and features:

val myFeatures: Map[String,List[String]] //for a word,list of string feature

David Hall

unread,
Aug 28, 2014, 6:03:01 PM8/28/14
to scalanlp...@googlegroups.com
Questions are good. They help me figure out where documentation is lacking, if nothing else.

You need to make a WordFeaturizer; look at IdentityWordFeaturizer (which isn't quite identity because of some Unking--and it's kind of ugly so I should rewrite it) which is a reasonable example. Probably an even better example is the BrownClusterFeaturizer.

val myFeaturizer = ???

You can add featurizers together with + and (if the dsl is imported like the above) you can get cartesian products/cross products with *

-- David


Omid

unread,
Aug 29, 2014, 11:28:07 PM8/29/14
to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu
I looked at the codes, as I can see a Feature get an string value like:

case class BrownClusterFeature(f: String) extends Feature

And a Featurizer gets sometimes: "lengths: Array[Int]" or "wordCounts: Counter[W, Double]" or Counter2 ... what exactly is the input of a featurizer ? and what does it do? would you explain it more and give and example like a Featurizer that get the word pair with POS and make faeture of it.


Thanks

David Hall

unread,
Aug 30, 2014, 10:27:12 PM8/30/14
to scalanlp...@googlegroups.com
A WordFeaturizer is just this interface:

trait WordFeaturizer[W] {
  def anchor(words: IndexedSeq[W]):WordFeatureAnchoring[W]
}

and WordFeatureAnchoring is just this interface:

trait WordFeatureAnchoring[W] {
  def words: IndexedSeq[W]
  def featuresForWord(pos: Int):Array[Feature]
}

So you want something like:

case class MyFeature(f: String) extends Feature

class MyFeaturizer(features: Map[String, Seq[String]]) extends WordFeaturizer[String] {
   def anchor(w: IndexedSeq[W]) = new WordFeatureAnchoring[String] {
      def words = w
      

      def featuresForWord(pos: Int) = if(pos < 0 || pos >= length) Array.empty else features(words(pos)).map(MyFeature).toArray
   }
  
}

Omid

unread,
Aug 30, 2014, 11:10:36 PM8/30/14
to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu
Thanks,

length in pos >= length is the length of features?

David Hall

unread,
Aug 30, 2014, 11:10:55 PM8/30/14
to scalanlp...@googlegroups.com
length of the sentence

Omid

unread,
Aug 30, 2014, 11:30:39 PM8/30/14
to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu
Then, when we pass a function with these:


  unigrams
(word + clss, 1)
         
+bigrams(clss, 2)
         
+bigrams(tagDict, 2)

what are "word" and "clss" and "tagDict" ?

David Hall

unread,
Aug 30, 2014, 11:44:46 PM8/30/14
to scalanlp...@googlegroups.com
word is word identity. clss is WordClassFeaturizer. tagDict is the most common tag for each word.  Look at WordFeaturizer.DSL

Omid

unread,
Aug 31, 2014, 12:01:14 AM8/31/14
to scalanlp...@googlegroups.com, dl...@cs.berkeley.edu
Thanks a lot :)

David Hall

unread,
Aug 31, 2014, 12:05:10 AM8/31/14
to scalanlp...@googlegroups.com
you're most welcome
Reply all
Reply to author
Forward
0 new messages