standard way of lower-casing token features

20 views
Skip to first unread message

Tim Miller

unread,
May 21, 2013, 3:39:11 PM5/21/13
to cleart...@googlegroups.com
It seems that cleartk token features will normally be case-sensitive,
but for some learning tasks having case-insensitive versions may
increase statistical strength (e.g., in clinical text the phrases "NO
TUMOR", "No tumor", and "no tumor" all should be equally certainly
negated). Is there any built-in mechanism or best practice for doing
this? I was thinking of just going through all the extracted features
and lower-casing them but that seems very hacky.
Tim

Lee Becker

unread,
May 21, 2013, 4:33:09 PM5/21/13
to cleart...@googlegroups.com

On Tue, May 21, 2013 at 1:39 PM, Tim Miller <timothy...@childrens.harvard.edu> wrote:
It seems that cleartk token features will normally be case-sensitive, but for some learning tasks having case-insensitive versions may increase statistical strength (e.g., in clinical text the phrases "NO TUMOR", "No tumor", and "no tumor" all should be equally certainly negated). Is there any built-in mechanism or best practice for doing this? I was thinking of just going through all the extracted features and lower-casing them but that seems very hacky.

So the common way to create a bag of words extractor would be to do something like the following:
this.extractor = new CleartkExtractor(Token.class, new CoveredTextExtractor(), new Covered());

Looking at the CleartkExtractor constructor signature you see it has three parameters
* annotationClass
* extractor
* contexts

Which means the above statement creates an extractor that operates on Tokens (annotationClass), by running the CoveredTextExtractor (extractor) on the context covering tokens (as opposed to say left or right contexts).

The simplest way to get lowercase features would be to write your own LowerCaseCoveredTextExtractor.  This is as simple as copying CoveredTextExtractor and adding a toLower() to the appropriate part of the extract() method.

Alternatively you could try to implement your own Context (see Count in CleartkExtractor), but this is a bit less straightforward and I imagine less reusable for your purposes.

Let me know if you need any clarification,
Lee



Philip Ogren

unread,
May 21, 2013, 4:44:03 PM5/21/13
to cleart...@googlegroups.com
I think the preferred way to get lower cased token features would be to do something like this:

SimpleFeatureExtractor tokenFeatureExtractor = new FeatureFunctionExtractor(
        new CoveredTextExtractor(),
        new LowerCaseFeatureFunction());

and then use it like this:

List<Feature> tokenFeatures = new ArrayList<Feature>();
tokenFeatures.addAll(tokenFeatureExtractor.extract(jCas, token));

I cut-n-paste this from org.cleartk.examples.pos.ExamplePOSAnnotator.  The code shows how to combine this with CleartkAnnotator too.
--
You received this message because you are subscribed to the Google Groups "cleartk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cleartk-user...@googlegroups.com.
To post to this group, send email to cleart...@googlegroups.com.
Visit this group at http://groups.google.com/group/cleartk-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Lee Becker

unread,
May 21, 2013, 4:46:55 PM5/21/13
to cleart...@googlegroups.com

On Tue, May 21, 2013 at 2:44 PM, Philip Ogren <philip...@oracle.com> wrote:

SimpleFeatureExtractor tokenFeatureExtractor = new FeatureFunctionExtractor(
        new CoveredTextExtractor(),
        new LowerCaseFeatureFunction());

and then use it like this:

List<Feature> tokenFeatures = new ArrayList<Feature>();
tokenFeatures.addAll(tokenFeatureExtractor.extract(jCas, token));

I cut-n-paste this from org.cleartk.examples.pos.ExamplePOSAnnotator.  The code shows how to combine this with CleartkAnnotator too.

I forgot about feature functions.  Ignore what I said earlier.  It works, but this is much cleaner.

Tim Miller

unread,
May 21, 2013, 5:28:56 PM5/21/13
to cleart...@googlegroups.com
Great, I got it working, thanks for the help.
Tim
Reply all
Reply to author
Forward
0 new messages