I'm playing with the tutorials, etc. and having trouble getting the classifier to work. Training seems to work fine, but not running the classifier. Perhaps I have the command-line args wrong.
I'm using the factorie-1.0-SNAPSHOT-jar-with-dependencies.jar that I generated by building the git repo, after checking out "factorie-1.0.0.RC1" branch/tag.
Here's the command I used to train the classifier (what the bin/fac script runs), reformatted for easier reading:
java -Xmx3g -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server \
-classpath ./src/main/resources:./target/classes:./lib/factorie-1.0-SNAPSHOT-jar-with-dependencies.jar \
cc.factorie.app.classify.Classify \
--write-vocabulary ./data/out/classifier/vocab \
--write-classifier ./data/out/classifier/enron_email \
--read-text-encoding ISO-8859-1 \
--training-portion 0.8 --validation-portion 0.1 \
--trainer "new cc.factorie.app.classify.backend.SVMMulticlassTrainer" \
--print-infogain true \
--read-text-dirs "./data/unclassified/spam ./data/unclassified/ham"
The classpath is using the built jar located in a "lib" directory in my own project folder.
By the way, the default class for the trainer, "SVMMulticlassClassifierTrainer" doesn't actually exist, so I specified the one shown. I also found I had to provide a fully-qualified path for it to be found.
This seems to run fine:
Read 33702 documents in 2 directories.
# of distinct class labels: 2
Class labels: spam, ham
Vocabulary size: 142539
Top 20 features with highest information gain:
enron 0.27314369477422284
cc 0.16737156485694638
pm 0.13206432605710616
ect 0.10402359163042996
hou 0.09120112077507947
forwarded 0.08690984306893668
vince 0.08562956389895193
http 0.06498487642128159
kaminski 0.06283999653158767
attached 0.062459899672930974
houston 0.0598921529702795
questions 0.057590148411805764
louise 0.05605109324219437
corp 0.05353748187049234
gas 0.051973143341611294
meeting 0.0502119064533999
money 0.04034957095817926
hpl 0.03643519767067327
original 0.035747773311982645
call 0.03475383262228593
- label = 0: iter = 999, nSV = 2233
- label = 1: iter = 999, nSV = 2233
Classifier trained in 3.615 seconds.
== Training Evaluation ==
OVERALL: accuracy=1.000000
spam f1=1.000000 p=1.000000 r=1.000000 (tp=13737 fp=0 fn=0 true=13737 pred=13737)
ham f1=1.000000 p=1.000000 r=1.000000 (tp=13224 fp=0 fn=0 true=13224 pred=13224)
== Testing Evaluation ==
OVERALL: accuracy=0.978048
spam f1=0.978426 p=0.975015 r=0.981861 (tp=1678 fp=43 fn=31 true=1709 pred=1721)
ham f1=0.977657 p=0.981212 r=0.974128 (tp=1619 fp=31 fn=43 true=1662 pred=1650)
== Validation Evaluation ==
OVERALL: accuracy=0.976855
spam f1=0.977273 p=0.974433 r=0.980129 (tp=1677 fp=44 fn=34 true=1711 pred=1721)
ham f1=0.976421 p=0.979381 r=0.973478 (tp=1615 fp=34 fn=44 true=1659 pred=1649)
Now, here's the command I used to classify some new emails:
java -Xmx3g -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server \
-classpath ./src/main/resources:./target/classes:./lib/factorie-1.0-SNAPSHOT-jar-with-dependencies.jar \
cc.factorie.app.classify.Classify \
--read-vocabulary ./data/out/classifier/vocab \
--read-classifier ./data/out/classifier/enron_email \
--read-text-encoding ISO-8859-1 \
--write-classifications ./data/out/classifier/results.txt \
--read-text-dirs "./data/unclassified/spam ./data/unclassified/ham"
A few things:
1. --write-classifications does nothing; the output is written to stdout.
2. It insists that the --read-text-dirs have the same dir names (labels) as the training data, even though DETERMINING SPAM/HAM IS THE WHOLE POINT ;)
3. Worse, it simply labels the emails based on the directory they are in. I cheated and switched the directory names, causing every email to be mislabeled.
So, what's the correct way to run the classifier?
Thanks in advance,
Dean