Add custom features to Token type

Sujan Perera

unread,

Dec 30, 2014, 8:58:44 PM12/30/14

to cleart...@googlegroups.com

Hi,

I am newbie for clearTK. I started with NER example.

My question is how to add custom feature for Token type.

Here is my example:

Consider my training data is on following format.

U.N.         NNP  I-NP  I-ORG  1
official     NN   I-NP  O      0 
Ekeus        NNP  I-NP  I-PER  0
heads        VBZ  I-VP  O      0
for          IN   I-PP  O      0
Baghdad      NNP  I-NP  I-LOC  1

This is exactly the conll data format, plus extra column indicating whether the term is in a dictionary or not.

I want to add the last column as a feature to the Token.

My collection reader reads this example and create the Token for each term and set POS using setPOS method.

How can I add feature value in last column to the type Token?

I found Token has setStringValue method. But it asks for org.apache.uima.cas.Feature and String as arguments. I could

not figure out how to create a instance of type org.apache.uima.cas.Feature.

Then how can I get this feature in my analysis engine? I know it is possible to use TypePathExtractor to get POS tag?

Appreciate your help.

Lee Becker

unread,

Dec 31, 2014, 1:42:16 AM12/31/14

to cleart...@googlegroups.com

One thing that takes some getting used to in UIMA is the notion of creating type definitions to define the types of annotations and feature structures you can stuff into the CAS. Most ClearTK uses the type systems defined in the cleartk-type-system module. This is a set of XML files, which are then used to automatically generate the classes used in the code. If done correctly, you should never have to manually create instances of org.apache.uima.cas.Feature. Here is the type definition for Token in cleartk-type-system. If you poke around more in the pom.xml in that module, you can see what plugins get invoked to kick off the autogeneration of the type system classes.

Another thing to keep in mind is that UIMA has the notion of Features and Feature Structures which borrows heavily from the usage in Linguistics where different chunks of data may have properties known as features. This is different from the Machine Learning sense of feature that the ClearTK feature extractors produce. You probably have the terms sorted out, but I like to point this out for clarification.

To get back to your original question, there are a few approaches you can take:

1) If you can dynamically determine whether the token of interest is in a dictionary, then you can write a simple feature extractor to return that data without needing to stuff it into the CONLL-like input. This will save you from the hassle of creating/extending the typesystem, and it also saves your from having to store superfluous information in the CAS. Presumably whatever you're doing to add the column to the CoNLL file could be embedded into your own FeatureExtractor code.

2) A simple approach would be to create a new type named something like InDictionary. Then during collection reading, you would create and add to the CAS an InDictionary object for each token that had the value set to true. Then you would need to write a custom feature extractor that does a jCas.select() over the bounds of the token of interest and returns true if the number found is greater than 0.

3) If you feel like you need to have this information embedded in the CAS, a better approach would be to extend the ClearTK Token class in your own typesystem definition. It would something like this:

<name>some.namespace.type.DictionaryToken</name>

<supertypeName>org.cleartk.token.type.Token</supertypeName>

<name>inDictionary</name>

<rangeTypeName>uima.cas.Boolean</rangeTypeName>

</featureDescription>

</features>

</typeDescription>

Then your collection reader would create instances of DictionaryToken instead of the regular Token. Then to get this value out during feature extraction, you could use the TypePathExtractor or write your own if you need to convert the boolean to a specific value (1/0, T/F, etc...)

Cheers,

Lee

Sujan Perera

unread,

Dec 31, 2014, 12:46:32 PM12/31/14

to cleart...@googlegroups.com

Thanks for the reply.

I am fairly familiar with UIMA type system.

I solved my problem by adding new type extending Token type since I have multiple attributes to add to Tokens.

Reply all

Reply to author

Forward