CrfSuiteStringOutcomeDataWriter and feature encoding

53 views
Skip to first unread message

jakob.s...@gmail.com

unread,
Nov 5, 2014, 5:28:39 AM11/5/14
to cleart...@googlegroups.com

Hi,
Maybe this should be an issue but I’ll start by asking. Is there a reason why the NameNumber.value is ignored in CrfSuiteStringOutcomeDataWriter when encoding even though the CRFSuite implementation should be able to handle real-valued features? See following links:

http://fnl.es/tag/nlp.html#feature-modeling
http://python-crfsuite.readthedocs.org/en/latest/pycrfsuite.html#api-reference

  @Override
  public void writeEncoded(List<NameNumber> features, String outcome) {
    this.trainingDataWriter.print(outcome);
    for (NameNumber nameNumber : features) {
      this.trainingDataWriter.print(featureSeparator);
      this.trainingDataWriter.print(nameNumber.name);
    }
    this.trainingDataWriter.println();
  }

If having the encoding like this, what would be the best way of encoding a a non-binary feature e.g. the length of the covered text? 

There is a similar question regarding the MalletCRFStringOutcomeDataWriter:
https://groups.google.com/forum/#!topic/cleartk-users/16YBkiQ_Sk4


Steven Bethard

unread,
Nov 5, 2014, 8:16:09 AM11/5/14
to cleart...@googlegroups.com
Yeah, it looks like this is just a missing feature, very likely
because the person who implemented the CRFSuite wrapper started by
looking at the Mallet wrapper. It seems like a "scaling value" could
be used for numeric attributes:

http://www.chokkan.org/software/crfsuite/manual.html

The only workaround I can suggest for the moment is for you to
copy-paste the CrfSuiteStringOutcomeDataWriter and add the 2-3 lines
to check for a non-1 value and print it with a colon before it.

Please do file an issue for this. Unlike the Mallet issue, this one
looks like it has a clear solution.

Steve
> --
> You received this message because you are subscribed to the Google Groups
> "cleartk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cleartk-user...@googlegroups.com.
> To post to this group, send email to cleart...@googlegroups.com.
> Visit this group at http://groups.google.com/group/cleartk-users.
> For more options, visit https://groups.google.com/d/optout.

Jakob Sahlström

unread,
Nov 5, 2014, 9:52:02 AM11/5/14
to cleart...@googlegroups.com
Thanks for the suggestion! Will do that and file an issue right away.

Jakob Sahlström

unread,
Nov 5, 2014, 11:11:54 AM11/5/14
to cleart...@googlegroups.com
Though, it might get confusing if the value for a continuous feature actually is 1.0. According to CRFSuite's manual leaving the scaling factor out is equivalent to setting it to 1.0 so another option is to just print the number out after a ':' for all features.
Reply all
Reply to author
Forward
0 new messages