I was wondering whether anybody working with clearTk's feature
extractors has done any experiments with feature selection.
Would it be possible/sensible to incorporate the feature selection
into the feature extractor class? I can see simple feature selections
such as stop word lists into the FeatureExtractor, but anything more
sophisticated would need to be done after the feature extraction has
finished, I think.
I used various feature selection methods provided in WEKA or Least
Angle Regression (LARS) in R for feature selection.
What would be a good way to incorporate the findings from the feature
selection into the clearTK classifier? Could that be automated in any
way?
Looking forward to your comment/feedback/ideas.
Frank
Hi Frank,
I haven't done anything with feature selection in ClearTK mainly
because the UIMA pipelines really only allow one pass through the
CAS. Consequently, even doing simple things like normalizing features
to have mean=0, stddev=1 has not been doable. In environments like
Weka, and R they can do feature selection because they already have
all of the rows of the feature (instance) matrix, which makes it
easier to do calculations like feature-feature and feature-outcome
correlation.
Steve, Philip and I have been discussing new flows that will
accomodate this kind of experimentation, that will allow you to
manipulate the features and instances before sending them off for
training/classification. I have written some initial code for this
flow, and I hope to push it out in the next month. Basically, this
flow will allow you to tag features for some purpose like
normalization. During training your instances (features+outcome) will
get written to disk, and then you will have another annotator load
these instances to compute any relevant statistics, and then you will
run your annotator (with feature extraction) again to fix up your
features before getting written out by the dataWriter for training
with your classifier (liblinear, svmlight, mallet, etc). During
classification, you can load these statistics directly and modify your
feature extraction behavior accordingly. Hopefully when it's
finished, I will also have examples such as normalization, and TF*IDF
calculations which can then be used as a starting point for more
complex behavior like feature selection.
Alternatively, in the near term, if you had a way to dump your
features and outcomes for Weka, you could then run Weka feature
selection, and use the findings from the Weka model directly in your
feature extraction flow.
I apologize for potentially confusing you more than helping, so don't
hesitate to ask more questions and keep the discussion going.
Cheers,
Lee
--
You received this message because you are subscribed to the Google Groups "cleartk-users" group.
To post to this group, send email to cleart...@googlegroups.com.
To unsubscribe from this group, send email to cleartk-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cleartk-users?hl=en.