Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Data Format and Compatibility with the libSVM tools
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  4 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Hugo da silva Correa Pinto  
View profile  
 More options Mar 30 2011, 10:25 am
From: Hugo da silva Correa Pinto <hsspi...@gmail.com>
Date: Wed, 30 Mar 2011 07:25:30 -0700 (PDT)
Local: Wed, Mar 30 2011 10:25 am
Subject: Data Format and Compatibility with the libSVM tools
Hello,

I was about to do a GATE port of libLinear and possibly some libSVN
tools when I rememberd DPś text classification plugin.

Using them in GATE was a breeze to generate features, but soon I
realized that, unlike the GATE Batch ML plugin, it entailed a 3-step
process:
1) Generate lexicon + features with the TraininCorpusCreator
2) Make the model via command line
3) Load model with the Classifier plugin

2 is the bit that seems confusing to me. the raw format is very
similar to the libSVM one, but has two preceding fields. Is there a
straightforward way to dump the results in a libSVM compatible
format(even though a simple SED command can solve it), so that we can
use the libsvm scaling and  grid search(to find gamma and C),etc
tools?

I also did not see a reference to scaling. Does the bits generated are
already scaled? How may we inform the desired scale ranges when
classifying?

Thanks for open soucing the classification API and providing the
plugin - it works pretty fast compared to other ML options available
to GATE!

Best,


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
DigitalPebble  
View profile  
 More options Mar 30 2011, 10:51 am
From: DigitalPebble <jul...@digitalpebble.com>
Date: Wed, 30 Mar 2011 15:51:46 +0100
Local: Wed, Mar 30 2011 10:51 am
Subject: Re: Data Format and Compatibility with the libSVM tools

Hi Hugo

I was about to do a GATE port of libLinear and possibly some libSVN

> tools when I rememberd DPś text classification plugin.

> Using them in GATE was a breeze to generate features, but soon I
> realized that, unlike the GATE Batch ML plugin, it entailed a 3-step
> process:
> 1) Generate lexicon + features with the TraininCorpusCreator
> 2) Make the model via command line
> 3) Load model with the Classifier plugin
> 2 is the bit that seems confusing to me. the raw format is very
> similar to the libSVM one, but has two preceding fields. Is there a
> straightforward way to dump the results in a libSVM compatible
> format(even though a simple SED command can solve it), so that we can
> use the libsvm scaling and  grid search(to find gamma and C),etc
> tools?

you forgot an intermediate step above which is the generation of a vector
file from the raw+lexicon

as explained (briefly I know) on
http://code.google.com/p/textclassification/wiki/HOWTO you must call

*java -cp textclassification-1.7.jar
com.digitalpebble.classification.util.CorpusUtils -generateVector output/raw
output/lexicon output/params.ini*

to generate the vector file which is what libsvm uses as input, so you
should be able to use the scaling on it

The training plugin initially did the whole thing : feature extraction to
module generation in a single step but in practice this is not very useful
as in most cases we end up playing with various parameters and strategies
before generating the model which is why I split it in : raw -> vector ->
model

> I also did not see a reference to scaling. Does the bits generated are
> already scaled? How may we inform the desired scale ranges when
> classifying?

There is no scaling as such but a very simple normalization which is on by
default (see
http://code.google.com/p/textclassification/source/browse/trunk/src/j...).

It is a simple L2 norm - see
http://code.google.com/p/textclassification/source/browse/trunk/src/j...

for the details.

Scaling is currently not handled but it would be nice to add that to the
classifier somehow

> Thanks for open soucing the classification API and providing the
> plugin - it works pretty fast compared to other ML options available
> to GATE!

You are welcome. I started it ages ago when the GATE ML did not do text
classification but only token classification for NE learning and really
wanted it to be as simple as possible but at the same time offer useful
functionalities.

You might find the MultiFieldDocument approach interesting - most libraries
handle documents as a single dimension whereas intuitively we know that a
title or a list of keywords do not have the same weight as the main text.

The ModelUtils is also particularly helpful with libLinear as it gives some
info about the weight of the attributes in the model - great for debugging
and refining the attributes.

Regarding the GATE plugin which wraps the TC API - have a look at the ngram
PR, it works well in combination with the training and learning

Thanks for your comments

Julien

--
**
*
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com
http://www.digitalpebble.com*


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hugo Pinto  
View profile  
 More options Mar 30 2011, 11:49 am
From: Hugo Pinto <hsspi...@gmail.com>
Date: Wed, 30 Mar 2011 12:49:07 -0300
Local: Wed, Mar 30 2011 11:49 am
Subject: Re: Data Format and Compatibility with the libSVM tools

Hello Julien,

you forgot an intermediate step above which is the generation of a vector

> file from the raw+lexicon

> as explained (briefly I know) on
> http://code.google.com/p/textclassification/wiki/HOWTO you must call

> *java -cp textclassification-1.7.jar
> com.digitalpebble.classification.util.CorpusUtils -generateVector output/raw
> output/lexicon output/params.ini*

> to generate the vector file which is what libsvm uses as input, so you
> should be able to use the scaling on it

Ooops!  Thanks for the quick reply! indeed I forgot this step, which was
well documented in the wiki. I relied on the PDF + README and missed it. I
can replicate this info in a README or similar file in the
textclassification project if you think it is ok.

The training plugin initially did the whole thing : feature extraction to

> module generation in a single step but in practice this is not very useful
> as in most cases we end up playing with various parameters and strategies
> before generating the model which is why I split it in : raw -> vector ->
> model

I see, splitting feature + lexicon generation from model training is a good
strategy. I would find it more intuitive though to provide the vector in
that same step, ending with "raw", "raw lexicon", "vector_lexicon" and
"vector"- though I can see it as just another plugin that given a location
with a raw + lexicon + parametes would create the vector for us, so we could
experiment with frequency, tfidf in a shorter timespan for when dealing with
huge datasets (which again is super-easy to do outside gate with CorpusUtils
as you pointed). I can do either, specially the second if you would rather
keep the original plugin totally clean.

Hum...normalizing in most applications is enough to get good results,
excellent. I will experiment with scaling in a couple datasets of mine, and
if it gives considerable better results than just normalization, I will
submit a patch to add scaling support to the API,ok?

Awesome, this is a distinctive feature of this API, and quite handy for news
articles! Right now I am deadling mostly with microtext, which lacks
titles,etc. However I can see this feature being used to put diferent
emphasys on hashtags -  which in a sense carry far more info than the other
tokens in the message, in average(also I guess nowadays the twitter API
already comes with hashtags, entities et al all already discriminated into
the very message payload, which makes experimenting with it quite easy).

> The ModelUtils is also particularly helpful with libLinear as it gives some
> info about the weight of the attributes in the model - great for debugging
> and refining the attributes.

> Regarding the GATE plugin which wraps the TC API - have a look at the ngram
> PR, it works well in combination with the training and learning

> Thanks, I will experiment with them and feedback my experiences here.

Regards,
Hugo Pinto,
Articial Intelligence and Natural Language Processing
http://www.hugopinto.net

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Hugo Pinto  
View profile  
 More options Apr 13 2011, 5:45 pm
From: Hugo Pinto <hsspi...@gmail.com>
Date: Wed, 13 Apr 2011 18:45:11 -0300
Local: Wed, Apr 13 2011 5:45 pm
Subject: Re: Data Format and Compatibility with the libSVM tools

Hi Julien,

I just submitted a pull request with a version of the plugin that allows to
generate the vector with all the optional parameters. For users of the old
version nothing changes, as all parameters are optional and the vector will
be generated with what was hardcoded in the original version using the CML
tool.

In the process I ended taking out what seemed to be a legacy parameter in
the textclassification aP
I, as the diff shows.

Hope you and other users  find it usefull.

BTW - you were right, I just got in love with Github!:)

Best,
--
Hugo Pinto
Computational Linguistics & Artificial Intelligence
http://www.hugopinto.net

2011/3/30 DigitalPebble <jul...@digitalpebble.com>

  diffAIE.txt
2K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »