I was about to do a GATE port of libLinear and possibly some libSVN
tools when I rememberd DPś text classification plugin.
Using them in GATE was a breeze to generate features, but soon I
realized that, unlike the GATE Batch ML plugin, it entailed a 3-step
process:
1) Generate lexicon + features with the TraininCorpusCreator
2) Make the model via command line
3) Load model with the Classifier plugin
2 is the bit that seems confusing to me. the raw format is very
similar to the libSVM one, but has two preceding fields. Is there a
straightforward way to dump the results in a libSVM compatible
format(even though a simple SED command can solve it), so that we can
use the libsvm scaling and grid search(to find gamma and C),etc
tools?
I also did not see a reference to scaling. Does the bits generated are
already scaled? How may we inform the desired scale ranges when
classifying?
Thanks for open soucing the classification API and providing the
plugin - it works pretty fast compared to other ML options available
to GATE!
you forgot an intermediate step above which is the generation of a vector file from the raw+lexicon
as explained (briefly I know) on http://code.google.com/p/textclassification/wiki/HOWTO you must call
java -cp textclassification-1.7.jar com.digitalpebble.classification.util.CorpusUtils -generateVector output/raw output/lexicon output/params.ini
to generate the vector file which is what libsvm uses as input, so you should be able to use the scaling on it
The training plugin initially did the whole thing : feature extraction to module generation in a single step but in practice this is not very useful as in most cases we end up playing with various parameters and strategies before generating the model which is why I split it in : raw -> vector -> model
I also did not see a reference to scaling. Does the bits generated are
already scaled? How may we inform the desired scale ranges when
classifying?
There is no scaling as such but a very simple normalization which is on by default (see http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/Lexicon.java).
It is a simple L2 norm - see
http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/SimpleDocument.java#213
for the details.
Scaling is currently not handled but it would be nice to add that to the classifier somehow
Thanks for open soucing the classification API and providing the
plugin - it works pretty fast compared to other ML options available
to GATE!
You are welcome. I started it ages ago when the GATE ML did not do text classification but only token classification for NE learning and really wanted it to be as simple as possible but at the same time offer useful functionalities.
You might find the MultiFieldDocument approach interesting - most libraries handle documents as a single dimension whereas intuitively we know that a title or a list of keywords do not have the same weight as the main text.
The ModelUtils is also particularly helpful with libLinear as it gives some info about the weight of the attributes in the model - great for debugging and refining the attributes.
Regarding the GATE plugin which wraps the TC API - have a look at the ngram PR, it works well in combination with the training and learning
Ooops! Thanks for the quick reply! indeed I forgot this step, which was well documented in the wiki. I relied on the PDF + README and missed it. I can replicate this info in a README or similar file in the textclassification project if you think it is ok.
are you talking about the PDF on DigitalPebble's website? It is a bit dated :-) Could add it to README or better add a ref to the wiki in README
The training plugin initially did the whole thing : feature extraction to module generation in a single step but in practice this is not very useful as in most cases we end up playing with various parameters and strategies before generating the model which is why I split it in : raw -> vector -> model
I see, splitting feature + lexicon generation from model training is a good strategy. I would find it more intuitive though to provide the vector in that same step, ending with "raw", "raw lexicon", "vector_lexicon" and "vector"
well if you want to generate the vector at the same time there is no need for a separate lexicon file as its content does not change. The only reason why we specify a different lexicon file when using the command file and the ini params is that the latter can be used to filter some of the attributes based on their document frequency. We could of course do the filtering of the attributes at the same time (which I think the old version of the PR used to do) but I'd rather keep it simple. If people want to experiment with various thresholds then this will be done using the command line and the params file.
- though I can see it as just another plugin that given a location with a raw + lexicon + parametes would create the vector for us, so we could experiment with frequency, tfidf in a shorter timespan for when dealing with huge datasets (which again is super-easy to do outside gate with CorpusUtils as you pointed). I can do either, specially the second if you would rather keep the original plugin totally clean.
I think the command line should be used when experimenting in the same way as we'd use the command line to test different params with the svm implementation itself.We could add a new boolean param 'generateVectorFile' for the TraininCorpusCreator PR
I also did not see a reference to scaling. Does the bits generated are
already scaled? How may we inform the desired scale ranges when
classifying?
There is no scaling as such but a very simple normalization which is on by default (see http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/Lexicon.java).
It is a simple L2 norm - see
http://code.google.com/p/textclassification/source/browse/trunk/src/java/com/digitalpebble/classification/SimpleDocument.java#213
for the details.
Scaling is currently not handled but it would be nice to add that to the classifier somehow
Hum...normalizing in most applications is enough to get good results, excellent. I will experiment with scaling in a couple datasets of mine, and if it gives considerable better results than just normalization, I will submit a patch to add scaling support to the API,ok?
sure, that would be great
Thanks for open soucing the classification API and providing the
plugin - it works pretty fast compared to other ML options available
to GATE!
You are welcome. I started it ages ago when the GATE ML did not do text classification but only token classification for NE learning and really wanted it to be as simple as possible but at the same time offer useful functionalities.
You might find the MultiFieldDocument approach interesting - most libraries handle documents as a single dimension whereas intuitively we know that a title or a list of keywords do not have the same weight as the main text.
Awesome, this is a distinctive feature of this API, and quite handy for news articles!
or any sort of web pages :-)
[...]FYI I've recently added a module for Mahout in Behemoth. This could be handy when dealing with very large datasets for which running the textclassification api on a single machine is not an option. Could be useful e.g. to reduce the training set to the most important documents then use those for generating the model with the TC APIJulien