TextClassification API v1.1

26 views
Skip to first unread message

julien nioche

unread,
Sep 16, 2008, 8:50:34 AM9/16/08
to DigitalPebble
We are pleased to announce the availability of a new version of our
TextClassification API

List of changes :

* added new Learner/Classifier : libLinear

* new Field-based representation of a document
Using named fields allows to compute a relative frequency or tf.idf
based on the length of the fields and not of the whole document. This
should be useful for representing richer documents e.g. coming from
Nutch or SOLR
In the future we will be able to define other types of fields, like
for instance numerical fields. Such fields won't be tokenized, instead
the numerical value will be used directly in the Vector.

* removed deprecated classifier (JNISVMLight)

* separated GATE PR from core library

* bug fixes + code improvement + small optimisations

Feel free to contact us if you'd like to give it a try. The
TextClassification API is free for research and evaluation purpose; a
license is available for commercial applications

Kind regards

Julien


baoquoc

unread,
Sep 17, 2008, 8:09:36 AM9/17/08
to DigitalPebble
Dear Julien,

I have integrated your TextClassificationPR into GATE.
It is well loaded in GATE GUI. But It could not find out the
"Learner"' (error messages as below)

CREOLE plugin loaded: file:/home/baoquoc/GATE-4.0-b2974/plugins/
TextClassificationGATE/
java.lang.NoClassDefFoundError: com/digitalpebble/classification/
Learner
at
com.digitalpebble.classification.gate.TextClassificationModelCreatorPR.init(TextClassificationModelCreatorPR.java:
81)
at gate.Factory.createResource(Factory.java:337)
at gate.gui.NewResourceDialog$3.run(NewResourceDialog.java:195)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.ClassNotFoundException:
com.digitalpebble.classification.Learner
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at gate.util.GateClassLoader.loadClass(GateClassLoader.java:60)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

I think that I must install libSVM as learner, is that right ?

Thank you in advance

Bao Quoc Ho,


On Sep 16, 2:50 pm, julien nioche <digitalpeb...@googlemail.com>
wrote:

DigitalPebble

unread,
Sep 17, 2008, 8:19:02 AM9/17/08
to digita...@googlegroups.com
Hi,

Try adding       <JAR>lib/textclassification-1.1.jar</JAR>
to each resource in the Creole file. The jar above contains libSVM already, it's just that GATE is not told to load the jar above.

J.


2008/9/17 baoquoc <ho.ba...@gmail.com>



--
DigitalPebble Ltd
Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Bao Quoc HO

unread,
Sep 17, 2008, 8:27:17 AM9/17/08
to digita...@googlegroups.com
Hi,

Thank you very much.

Bao Quoc HO

julien nioche

unread,
Sep 17, 2008, 8:30:45 AM9/17/08
to DigitalPebble
Thank you for reporting the problem, I've fixed the creole.xml
accordingly

On Sep 17, 1:27 pm, "Bao Quoc HO" <ho.baoq...@gmail.com> wrote:
> Hi,
>
> Thank you very much.
>
> Bao Quoc HO
>
> On Wed, Sep 17, 2008 at 2:19 PM, DigitalPebble <jul...@digitalpebble.com>wrote:
>
> > Hi,
>
> > Try adding       <JAR>lib/textclassification-1.1.jar</JAR>
> > to each resource in the Creole file. The jar above contains libSVM already,
> > it's just that GATE is not told to load the jar above.
>
> > J.
>
> > 2008/9/17 baoquoc <ho.baoq...@gmail.com>

Bah

unread,
Oct 29, 2008, 9:29:55 AM10/29/08
to DigitalPebble
Hey there. I have got the API yesterday and I have integrated it to
GATE. At first I had the same problem mentioned here, which I could
solve by adding
<JAR>lib/textclassification-1.1.jar</JAR> to each resource in the
Creole file.
This was now solved. But whenever I try to run the creator I get some
errors I couldn't figure out how to solve

There are no annotations of type Sentence available in document!
gate.creole.ExecutionException: java.lang.Exception: There must be at
least two different class values in the training corpus
at
com.digitalpebble.classification.gate.TextClassificationModelCreatorPR.execute(TextClassificationModelCreatorPR.java:
185)
at gate.creole.SerialController.runComponent(SerialController.java:
177)
at gate.creole.SerialController.executeImpl(SerialController.java:
136)
at
gate.creole.SerialAnalyserController.executeImpl(SerialAnalyserController.java:
67)
at gate.creole.AbstractController.execute(AbstractController.java:42)
at gate.gui.SerialControllerEditor$RunAction
$1.run(SerialControllerEditor.java:1253)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.Exception: There must be at least two different
class values in the training corpus
at
com.digitalpebble.classification.Learner.generateVectorFile(Learner.java:
104)
at com.digitalpebble.classification.Learner.learn(Learner.java:140)
at com.digitalpebble.classification.Learner.learn(Learner.java:91)
at
com.digitalpebble.classification.gate.TextClassificationModelCreatorPR.execute(TextClassificationModelCreatorPR.java:
183)
... 6 more
Caused by:
java.lang.Exception: There must be at least two different class values
in the training corpus
at
com.digitalpebble.classification.Learner.generateVectorFile(Learner.java:
104)
at com.digitalpebble.classification.Learner.learn(Learner.java:140)
at com.digitalpebble.classification.Learner.learn(Learner.java:91)
at
com.digitalpebble.classification.gate.TextClassificationModelCreatorPR.execute(TextClassificationModelCreatorPR.java:
183)
at gate.creole.SerialController.runComponent(SerialController.java:
177)
at gate.creole.SerialController.executeImpl(SerialController.java:
136)
at
gate.creole.SerialAnalyserController.executeImpl(SerialAnalyserController.java:
67)
at gate.creole.AbstractController.execute(AbstractController.java:42)
at gate.gui.SerialControllerEditor$RunAction
$1.run(SerialControllerEditor.java:1253)
at java.lang.Thread.run(Thread.java:619)


I have tried changing the paraeters, but still no success. What could
be wrong?

julien nioche

unread,
Oct 29, 2008, 11:36:56 AM10/29/08
to DigitalPebble
Hi,

You need to replace the value of the parameter from textAnnotationType
to the type used in your training corpus and make sure that the value
textAnnotationValue corresponds to the feature you want to learn.
http://www.digitalpebble.com/TextClassificationAPI.pdf contains a
description of the parameters used by the PRs.

Julien

Bah

unread,
Oct 29, 2008, 8:49:41 PM10/29/08
to DigitalPebble
Hey there. Thank you very much for your help. I realized what I was
doing wrong for that problem. I read that pdf over and over again but
I still have another problem which is that I just didn't get what this
error means:
gate.creole.ExecutionException: java.lang.Exception: There must be at
least two different class values in the training corpus

I mean, how should I enter correctly the values in textAnnotationValue
in order to make things work?

Once again, thank you.

On 29 out, 13:36, julien nioche <digitalpeb...@googlemail.com> wrote:
> Hi,
>
> You need to replace the value of the parameter from textAnnotationType
> to the type used in your training corpus and make sure that the value
> textAnnotationValue corresponds to the feature you want to learn.http://www.digitalpebble.com/TextClassificationAPI.pdfcontains a

julien nioche

unread,
Oct 30, 2008, 3:59:38 AM10/30/08
to DigitalPebble
As I said yesterday textAnnotationValue must point to the feature name
containing the labels in the annotations you want to learn from e.g.
in the XML below 'category'

<Message category="sport" >
bla bla bla
</Message>

obviously you need to have more than one value found for category in
your corpus otherwise there is nothing to learn from

J.

Bah

unread,
Oct 30, 2008, 12:58:25 PM10/30/08
to DigitalPebble
So you're saying I should have something within each document that
tells very explicitly what the text is about?
In case I have some html files, and I have processed them with some
GATE tools before using the classifier, my "pre-processing stage"
should return to me, besides the part of the text in the HTML file
that I want to classify, another feature which would be the category,
is that it?

julien nioche

unread,
Oct 30, 2008, 1:29:08 PM10/30/08
to DigitalPebble

> So you're saying I should have something within each document that
> tells very explicitly what the text is about?

no, the category feature was an example of a feature one might learn
from. You can replace that with anything you want

> In case I have some html files, and I have processed them with some
> GATE tools before using the classifier, my "pre-processing stage"
> should return to me, besides the part of the text in the HTML file
> that I want to classify, another feature which would be the category,
> is that it?

If you want to classify the whole document you will need to have an
annotation covering the whole text. If your docs are html then there
should be a html annotation in the Original Markups AS which you can
use as a value for textAnnotationType. The thing is that you will need
to add a feature to it corresponding to what you want the TC to learn.

Can you describe what you are trying to classify and what your
categories for the classification are?

Barbara

unread,
Oct 30, 2008, 1:41:58 PM10/30/08
to digita...@googlegroups.com
I have a set of html pages and I want to classify them. For instance, I have got some news and articles from the web and I want to sort them in categories like "health", "science", "arts" and so on.


De: julien nioche <digita...@googlemail.com>
Para: DigitalPebble <digita...@googlegroups.com>
Enviadas: Quinta-feira, 30 de Outubro de 2008 15:29:08
Assunto: Re: TextClassification API v1.1

Novos endereços, o Yahoo! que você conhece. Crie um email novo com a sua cara @ymail.com ou @rocketmail.com.

lists.dig...@gmail.com

unread,
Oct 30, 2008, 1:50:48 PM10/30/08
to DigitalPebble
do you already know about the categories for each document and if so
in which annotation / feature do you store that? if not then obviously
you need to have that information somewhere in order to train a model
with it. Once you have a model you can us it to classify new documents
for which the category is unknown.

On Oct 30, 5:41 pm, Barbara <rina_1...@yahoo.com.br> wrote:
> I have a set of html pages and I want to classify them. For instance, I have got some news and articles from the web and I want to sort them in categories like "health", "science", "arts" and so on.
>
> ________________________________
> De: julien nioche <digitalpeb...@googlemail.com>
>       Novos endereços, o Yahoo! que você conhece. Crie um email novo com a sua cara @ymail.com ou @rocketmail.com.http://br.new.mail.yahoo.com/addresses

Barbara

unread,
Oct 30, 2008, 2:43:39 PM10/30/08
to digita...@googlegroups.com
I have put the category as the Title of my html document, so that I have, for example,
<title>arts</title>
this is shown within my Original markups named as title
So I should have title for my TextAnnotationValue. Is that right?


De: "lists.dig...@gmail.com" <lists.dig...@gmail.com>
Para: DigitalPebble <digita...@googlegroups.com>
Enviadas: Quinta-feira, 30 de Outubro de 2008 15:50:48
Assunto: Re: Res: TextClassification API v1.1

lists.dig...@gmail.com

unread,
Oct 30, 2008, 3:40:07 PM10/30/08
to DigitalPebble
no, you should put that as a feature value on an annotation covering
the whole document or at least the section you want to use for the
training. You can use GATE's JAPE rule to move the content of title
into a feature of your html annotation e.g.
<html category="arts">
then use Token as componentAnnotationType.

The GATE documentation and mailing list should provide you with all
the help you need for that.

J.


> So I should have title for my TextAnnotationValue. Is that right?
>
> ________________________________
> De: "lists.digitalpeb...@gmail.com" <lists.digitalpeb...@gmail.com>

Barbara

unread,
Oct 30, 2008, 9:10:33 PM10/30/08
to digita...@googlegroups.com
Thank you very much for your help!
 I guess I did get it working, finally.


De: "lists.dig...@gmail.com" <lists.dig...@gmail.com>
Para: DigitalPebble <digita...@googlegroups.com>
Enviadas: Quinta-feira, 30 de Outubro de 2008 17:40:07
Assunto: Re: Res: Res: TextClassification API v1.1
Reply all
Reply to author
Forward
0 new messages