feature selection and FeatureExtractor

74 views
Skip to first unread message

fschilder

unread,
Dec 6, 2011, 11:16:30 AM12/6/11
to cleartk-users
Hi,

I was wondering whether anybody working with clearTk's feature
extractors has done any experiments with feature selection.

Would it be possible/sensible to incorporate the feature selection
into the feature extractor class? I can see simple feature selections
such as stop word lists into the FeatureExtractor, but anything more
sophisticated would need to be done after the feature extraction has
finished, I think.

I used various feature selection methods provided in WEKA or Least
Angle Regression (LARS) in R for feature selection.

What would be a good way to incorporate the findings from the feature
selection into the clearTK classifier? Could that be automated in any
way?

Looking forward to your comment/feedback/ideas.

Frank

Lee Becker

unread,
Dec 6, 2011, 11:29:43 PM12/6/11
to cleartk-users


Hi Frank,

I haven't done anything with feature selection in ClearTK mainly
because the UIMA pipelines really only allow one pass through the
CAS. Consequently, even doing simple things like normalizing features
to have mean=0, stddev=1 has not been doable. In environments like
Weka, and R they can do feature selection because they already have
all of the rows of the feature (instance) matrix, which makes it
easier to do calculations like feature-feature and feature-outcome
correlation.

Steve, Philip and I have been discussing new flows that will
accomodate this kind of experimentation, that will allow you to
manipulate the features and instances before sending them off for
training/classification. I have written some initial code for this
flow, and I hope to push it out in the next month. Basically, this
flow will allow you to tag features for some purpose like
normalization. During training your instances (features+outcome) will
get written to disk, and then you will have another annotator load
these instances to compute any relevant statistics, and then you will
run your annotator (with feature extraction) again to fix up your
features before getting written out by the dataWriter for training
with your classifier (liblinear, svmlight, mallet, etc). During
classification, you can load these statistics directly and modify your
feature extraction behavior accordingly. Hopefully when it's
finished, I will also have examples such as normalization, and TF*IDF
calculations which can then be used as a starting point for more
complex behavior like feature selection.

Alternatively, in the near term, if you had a way to dump your
features and outcomes for Weka, you could then run Weka feature
selection, and use the findings from the Weka model directly in your
feature extraction flow.

I apologize for potentially confusing you more than helping, so don't
hesitate to ask more questions and keep the discussion going.

Cheers,
Lee

Philip Ogren

unread,
Feb 12, 2012, 5:38:23 PM2/12/12
to cleart...@googlegroups.com
Frank,

Apologies for the delayed response.  It seems timely to respond to Lee's comments since we have recently addressed them in a couple of different ways.  For starters, I have implemented a simple weka data writer for ClearTK.  It is in the project cleartk-ml-weka.  You could use the data writer to create an arff file and use Weka to perform feature selection in the Weka environment.  I'm not exactly sure how you would feed back your feature selection analysis back into your analysis engine.  The weka wrapper is not complete - in particular, there is no classifier.  I've filed several issues with the tag 'cleartk-ml-weka' that you might consult.  

Lee and Steve also implemented infrastructure for TrainableExtractor's.  The example and documentation is still under construction but the basic idea is there and I think it will be a fairly powerful addition to ClearTK and how one can improve feature extraction and perhaps feature selection too.  You can look at cleartk-examples and consult the example 'featuretransformation'.  Note that this example is not done and will be moved (to 'documentclassification') - so email back if you can't find it or make sense of it.  

Regards,
Philip


--
You received this message because you are subscribed to the Google Groups "cleartk-users" group.
To post to this group, send email to cleart...@googlegroups.com.
To unsubscribe from this group, send email to cleartk-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cleartk-users?hl=en.


Reply all
Reply to author
Forward
0 new messages