Data Format and Compatibility with the libSVM tools

105 views

Skip to first unread message

Hugo da silva Correa Pinto

unread,

Mar 30, 2011, 10:25:30 AM3/30/11

to DigitalPebble

Hello,

I was about to do a GATE port of libLinear and possibly some libSVN
tools when I rememberd DPś text classification plugin.

Using them in GATE was a breeze to generate features, but soon I
realized that, unlike the GATE Batch ML plugin, it entailed a 3-step
process:
1) Generate lexicon + features with the TraininCorpusCreator
2) Make the model via command line
3) Load model with the Classifier plugin

2 is the bit that seems confusing to me. the raw format is very
similar to the libSVM one, but has two preceding fields. Is there a
straightforward way to dump the results in a libSVM compatible
format(even though a simple SED command can solve it), so that we can
use the libsvm scaling and grid search(to find gamma and C),etc
tools?

I also did not see a reference to scaling. Does the bits generated are
already scaled? How may we inform the desired scale ranges when
classifying?

Thanks for open soucing the classification API and providing the
plugin - it works pretty fast compared to other ML options available
to GATE!

Best,

DigitalPebble

unread,

Mar 30, 2011, 10:51:46 AM3/30/11

to digita...@googlegroups.com, Hugo da silva Correa Pinto

Hi Hugo

I was about to do a GATE port of libLinear and possibly some libSVN
tools when I rememberd DPś text classification plugin.

Using them in GATE was a breeze to generate features, but soon I
realized that, unlike the GATE Batch ML plugin, it entailed a 3-step
process:
1) Generate lexicon + features with the TraininCorpusCreator
2) Make the model via command line
3) Load model with the Classifier plugin

2 is the bit that seems confusing to me. the raw format is very
similar to the libSVM one, but has two preceding fields. Is there a
straightforward way to dump the results in a libSVM compatible
format(even though a simple SED command can solve it), so that we can
use the libsvm scaling and grid search(to find gamma and C),etc
tools?

you forgot an intermediate step above which is the generation of a vector file from the raw+lexicon

as explained (briefly I know) on http://code.google.com/p/textclassification/wiki/HOWTO you must call

java -cp textclassification-1.7.jar com.digitalpebble.classification.util.CorpusUtils -generateVector output/raw output/lexicon output/params.ini

to generate the vector file which is what libsvm uses as input, so you should be able to use the scaling on it

The training plugin initially did the whole thing : feature extraction to module generation in a single step but in practice this is not very useful as in most cases we end up playing with various parameters and strategies before generating the model which is why I split it in : raw -> vector -> model

I also did not see a reference to scaling. Does the bits generated are
already scaled? How may we inform the desired scale ranges when
classifying?

There is no scaling as such but a very simple normalization which is on by default (see http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/Lexicon.java).

It is a simple L2 norm - see
http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/SimpleDocument.java#213

for the details.

Scaling is currently not handled but it would be nice to add that to the classifier somehow

Thanks for open soucing the classification API and providing the
plugin - it works pretty fast compared to other ML options available
to GATE!

You are welcome. I started it ages ago when the GATE ML did not do text classification but only token classification for NE learning and really wanted it to be as simple as possible but at the same time offer useful functionalities.

You might find the MultiFieldDocument approach interesting - most libraries handle documents as a single dimension whereas intuitively we know that a title or a list of keywords do not have the same weight as the main text.

The ModelUtils is also particularly helpful with libLinear as it gives some info about the weight of the attributes in the model - great for debugging and refining the attributes.

Regarding the GATE plugin which wraps the TC API - have a look at the ngram PR, it works well in combination with the training and learning

Thanks for your comments

Julien

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com
http://www.digitalpebble.com

Hugo Pinto

unread,

Mar 30, 2011, 11:49:07 AM3/30/11

to DigitalPebble, digita...@googlegroups.com

Hello Julien,

you forgot an intermediate step above which is the generation of a vector file from the raw+lexicon

as explained (briefly I know) on http://code.google.com/p/textclassification/wiki/HOWTO you must call

java -cp textclassification-1.7.jar com.digitalpebble.classification.util.CorpusUtils -generateVector output/raw output/lexicon output/params.ini

to generate the vector file which is what libsvm uses as input, so you should be able to use the scaling on it

Ooops! Thanks for the quick reply! indeed I forgot this step, which was well documented in the wiki. I relied on the PDF + README and missed it. I can replicate this info in a README or similar file in the textclassification project if you think it is ok.

The training plugin initially did the whole thing : feature extraction to module generation in a single step but in practice this is not very useful as in most cases we end up playing with various parameters and strategies before generating the model which is why I split it in : raw -> vector -> model

I see, splitting feature + lexicon generation from model training is a good strategy. I would find it more intuitive though to provide the vector in that same step, ending with "raw", "raw lexicon", "vector_lexicon" and "vector"- though I can see it as just another plugin that given a location with a raw + lexicon + parametes would create the vector for us, so we could experiment with frequency, tfidf in a shorter timespan for when dealing with huge datasets (which again is super-easy to do outside gate with CorpusUtils as you pointed). I can do either, specially the second if you would rather keep the original plugin totally clean.

I also did not see a reference to scaling. Does the bits generated are
already scaled? How may we inform the desired scale ranges when
classifying?

There is no scaling as such but a very simple normalization which is on by default (see http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/Lexicon.java).

It is a simple L2 norm - see
http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/SimpleDocument.java#213

for the details.

Scaling is currently not handled but it would be nice to add that to the classifier somehow

Hum...normalizing in most applications is enough to get good results, excellent. I will experiment with scaling in a couple datasets of mine, and if it gives considerable better results than just normalization, I will submit a patch to add scaling support to the API,ok?

Thanks for open soucing the classification API and providing the
plugin - it works pretty fast compared to other ML options available
to GATE!

You are welcome. I started it ages ago when the GATE ML did not do text classification but only token classification for NE learning and really wanted it to be as simple as possible but at the same time offer useful functionalities.

You might find the MultiFieldDocument approach interesting - most libraries handle documents as a single dimension whereas intuitively we know that a title or a list of keywords do not have the same weight as the main text.

Awesome, this is a distinctive feature of this API, and quite handy for news articles! Right now I am deadling mostly with microtext, which lacks titles,etc. However I can see this feature being used to put diferent emphasys on hashtags - which in a sense carry far more info than the other tokens in the message, in average(also I guess nowadays the twitter API already comes with hashtags, entities et al all already discriminated into the very message payload, which makes experimenting with it quite easy).

The ModelUtils is also particularly helpful with libLinear as it gives some info about the weight of the attributes in the model - great for debugging and refining the attributes.

Regarding the GATE plugin which wraps the TC API - have a look at the ngram PR, it works well in combination with the training and learning

Thanks, I will experiment with them and feedback my experiences here.

Regards,
Hugo Pinto,
Articial Intelligence and Natural Language Processing
http://www.hugopinto.net

Hugo Pinto

unread,

Apr 13, 2011, 5:45:11 PM4/13/11

to digita...@googlegroups.com

Hi Julien,

I just submitted a pull request with a version of the plugin that allows to generate the vector with all the optional parameters. For users of the old version nothing changes, as all parameters are optional and the vector will be generated with what was hardcoded in the original version using the CML tool.

In the process I ended taking out what seemed to be a legacy parameter in the textclassification aP
I, as the diff shows.

Hope you and other users find it usefull.

BTW - you were right, I just got in love with Github!:)

Best,
--
Hugo Pinto
Computational Linguistics & Artificial Intelligence
http://www.hugopinto.net

2011/3/30 DigitalPebble <jul...@digitalpebble.com>

Ooops! Thanks for the quick reply! indeed I forgot this step, which was well documented in the wiki. I relied on the PDF + README and missed it. I can replicate this info in a README or similar file in the textclassification project if you think it is ok.

are you talking about the PDF on DigitalPebble's website? It is a bit dated :-) Could add it to README or better add a ref to the wiki in README

The training plugin initially did the whole thing : feature extraction to module generation in a single step but in practice this is not very useful as in most cases we end up playing with various parameters and strategies before generating the model which is why I split it in : raw -> vector -> model

I see, splitting feature + lexicon generation from model training is a good strategy. I would find it more intuitive though to provide the vector in that same step, ending with "raw", "raw lexicon", "vector_lexicon" and "vector"

well if you want to generate the vector at the same time there is no need for a separate lexicon file as its content does not change. The only reason why we specify a different lexicon file when using the command file and the ini params is that the latter can be used to filter some of the attributes based on their document frequency. We could of course do the filtering of the attributes at the same time (which I think the old version of the PR used to do) but I'd rather keep it simple. If people want to experiment with various thresholds then this will be done using the command line and the params file.

- though I can see it as just another plugin that given a location with a raw + lexicon + parametes would create the vector for us, so we could experiment with frequency, tfidf in a shorter timespan for when dealing with huge datasets (which again is super-easy to do outside gate with CorpusUtils as you pointed). I can do either, specially the second if you would rather keep the original plugin totally clean.

I think the command line should be used when experimenting in the same way as we'd use the command line to test different params with the svm implementation itself.

We could add a new boolean param 'generateVectorFile' for the TraininCorpusCreator PR

I also did not see a reference to scaling. Does the bits generated are
already scaled? How may we inform the desired scale ranges when
classifying?

There is no scaling as such but a very simple normalization which is on by default (see http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/Lexicon.java).

It is a simple L2 norm - see
http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/SimpleDocument.java#213

for the details.

Scaling is currently not handled but it would be nice to add that to the classifier somehow

Hum...normalizing in most applications is enough to get good results, excellent. I will experiment with scaling in a couple datasets of mine, and if it gives considerable better results than just normalization, I will submit a patch to add scaling support to the API,ok?

sure, that would be great

Thanks for open soucing the classification API and providing the
plugin - it works pretty fast compared to other ML options available
to GATE!

You are welcome. I started it ages ago when the GATE ML did not do text classification but only token classification for NE learning and really wanted it to be as simple as possible but at the same time offer useful functionalities.

You might find the MultiFieldDocument approach interesting - most libraries handle documents as a single dimension whereas intuitively we know that a title or a list of keywords do not have the same weight as the main text.

Awesome, this is a distinctive feature of this API, and quite handy for news articles!

or any sort of web pages :-)

[...]

FYI I've recently added a module for Mahout in Behemoth. This could be handy when dealing with very large datasets for which running the textclassification api on a single machine is not an option. Could be useful e.g. to reduce the training set to the most important documents then use those for generating the model with the TC API

Julien