Hello Julien,
you forgot an intermediate step above which is the generation of a vector
> file from the raw+lexicon
> as explained (briefly I know) on
> http://code.google.com/p/textclassification/wiki/HOWTO you must call
> *java -cp textclassification-1.7.jar
> com.digitalpebble.classification.util.CorpusUtils -generateVector output/raw
> output/lexicon output/params.ini*
> to generate the vector file which is what libsvm uses as input, so you
> should be able to use the scaling on it
Ooops! Thanks for the quick reply! indeed I forgot this step, which was
well documented in the wiki. I relied on the PDF + README and missed it. I
can replicate this info in a README or similar file in the
textclassification project if you think it is ok.
The training plugin initially did the whole thing : feature extraction to
> module generation in a single step but in practice this is not very useful
> as in most cases we end up playing with various parameters and strategies
> before generating the model which is why I split it in : raw -> vector ->
> model
I see, splitting feature + lexicon generation from model training is a good
strategy. I would find it more intuitive though to provide the vector in
that same step, ending with "raw", "raw lexicon", "vector_lexicon" and
"vector"- though I can see it as just another plugin that given a location
with a raw + lexicon + parametes would create the vector for us, so we could
experiment with frequency, tfidf in a shorter timespan for when dealing with
huge datasets (which again is super-easy to do outside gate with CorpusUtils
as you pointed). I can do either, specially the second if you would rather
keep the original plugin totally clean.
Hum...normalizing in most applications is enough to get good results,
excellent. I will experiment with scaling in a couple datasets of mine, and
if it gives considerable better results than just normalization, I will
submit a patch to add scaling support to the API,ok?
>> Thanks for open soucing the classification API and providing the
>> plugin - it works pretty fast compared to other ML options available
>> to GATE!
> You are welcome. I started it ages ago when the GATE ML did not do text
> classification but only token classification for NE learning and really
> wanted it to be as simple as possible but at the same time offer useful
> functionalities.
> You might find the MultiFieldDocument approach interesting - most libraries
> handle documents as a single dimension whereas intuitively we know that a
> title or a list of keywords do not have the same weight as the main text.
Awesome, this is a distinctive feature of this API, and quite handy for news
articles! Right now I am deadling mostly with microtext, which lacks
titles,etc. However I can see this feature being used to put diferent
emphasys on hashtags - which in a sense carry far more info than the other
tokens in the message, in average(also I guess nowadays the twitter API
already comes with hashtags, entities et al all already discriminated into
the very message payload, which makes experimenting with it quite easy).
> The ModelUtils is also particularly helpful with libLinear as it gives some
> info about the weight of the attributes in the model - great for debugging
> and refining the attributes.
> Regarding the GATE plugin which wraps the TC API - have a look at the ngram
> PR, it works well in combination with the training and learning
> Thanks, I will experiment with them and feedback my experiences here.
Regards,
Hugo Pinto,
Articial Intelligence and Natural Language Processing
http://www.hugopinto.net