How to create an Annotator class to train model using CSV file

pravin...@gmail.com

unread,

Mar 16, 2015, 2:26:20 PM3/16/15

to cleart...@googlegroups.com

I am new to cleartk and UIMA. I need help to write annotator class for below scenario.
I have a CSV file contains two columns first outcome and second description.
For example:
outcome        description
A                    apple is a fruit.
B                    boys are playing with ball.

I need to remove stop words and tokenize the string in description column. For each token, I want to determine parts of speech and use this information to create
training model for classifier.

I had following question:
- What classifier should be used for this type of data? Is Maxent appropriate for this data?
- How I can create an Annotator class to train model using CSV file and classify a input sentence in to appropriate outcome?
Thanks,
Pravin Bhogan

Steven Bethard

unread,

Mar 17, 2015, 11:03:17 AM3/17/15

to cleart...@googlegroups.com

On Mon, Mar 16, 2015 at 6:08 AM, <pravin...@gmail.com> wrote:
> I had following question:
> - What classifier should be used for this type of data? Is Maxent
> appropriate for this data?

Maxent (a.k.a. logistic regression) would be fine. I'd recommend the
LIBLINEAR one.

> - How I can create an Annotator class to train model using CSV file and
> classify a input sentence in to appropriate outcome?

You'll have to write some UIMA code that loads the CSV file into the
UIMA CAS. I would recommend asking on the UIMA Users mailing list for
this part: https://uima.apache.org/mail-lists.html. You probably want
to aim for having only the "description" part as text in the CAS, and
the "outcome" part stored somehow in your type system.

Once you have your outcomes and descriptions stored somehow in the
CAS, then you can train a model similarly to what is shown in the
chunking example:
https://code.google.com/p/cleartk/wiki/TutorialNamedEntityChunkingClassifier
Though in your case, you probably want just a CleartkAnnotator instead
of a CleartkSequenceAnnotator (since your outcomes are determined only
by the description, and not by their order in the CSV file).

Hope that helps,

Steve

pravin bhogan

unread,

Apr 6, 2015, 8:38:56 AM4/6/15

to cleart...@googlegroups.com

Hi Steve,

Thanks for your help. I was able to create Annotator class :-)

I needed some information regarding classification algorithm. I have outcome type which is not binary, instead it can be one of the more than 100 outcomes. Which library and which algorithm will be suitable for above scenario. [LibSVM, LibLinear, mallet(MaxEnt, NavieBayes, C4.5)]

Thanks,
Pravin

Steven Bethard

unread,

Apr 6, 2015, 2:36:01 PM4/6/15

to cleart...@googlegroups.com

Pretty much any of those should be fine, and as usual, my default
suggestion is LibLinear.

Steve

> --
> You received this message because you are subscribed to the Google Groups
> "cleartk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cleartk-user...@googlegroups.com.
> To post to this group, send email to cleart...@googlegroups.com.
> Visit this group at http://groups.google.com/group/cleartk-users.
> For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward