Trouble getting classification to work

116 views
Skip to first unread message

Dean Wampler

unread,
Mar 28, 2014, 5:30:58 PM3/28/14
to dis...@factorie.cs.umass.edu
I'm playing with the tutorials, etc. and having trouble getting the classifier to work. Training seems to work fine, but not running the classifier. Perhaps I have the command-line args wrong.

I'm using the factorie-1.0-SNAPSHOT-jar-with-dependencies.jar that I generated by building the git repo, after checking out "factorie-1.0.0.RC1" branch/tag.

Here's the command I used to train the classifier (what the bin/fac script runs), reformatted for easier reading:

java -Xmx3g -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server \ 
  -classpath ./src/main/resources:./target/classes:./lib/factorie-1.0-SNAPSHOT-jar-with-dependencies.jar \ 
  cc.factorie.app.classify.Classify \ 
  --write-vocabulary ./data/out/classifier/vocab \ 
  --write-classifier ./data/out/classifier/enron_email \ 
  --read-text-encoding ISO-8859-1 \ 
  --training-portion 0.8 --validation-portion 0.1 \ 
  --trainer "new cc.factorie.app.classify.backend.SVMMulticlassTrainer" \ 
  --print-infogain true \ 
  --read-text-dirs "./data/unclassified/spam ./data/unclassified/ham"

I'm using a pre-classified SPAM/HAM data set taken from the Enron email data set, available here: http://www.aueb.gr/users/ion/data/enron-spam/.

The classpath is using the built jar located in a "lib" directory in my own project folder. 

By the way, the default class for the trainer, "SVMMulticlassClassifierTrainer" doesn't actually exist, so I specified the one shown. I also found I had to provide a fully-qualified path for it to be found.

This seems to run fine:

Read 33702 documents in 2 directories.
# of distinct class labels: 2
Class labels: spam, ham
Vocabulary size: 142539
Top 20 features with highest information gain:
enron 0.27314369477422284
cc 0.16737156485694638
pm 0.13206432605710616
ect 0.10402359163042996
hou 0.09120112077507947
forwarded 0.08690984306893668
vince 0.08562956389895193
http 0.06498487642128159
kaminski 0.06283999653158767
attached 0.062459899672930974
houston 0.0598921529702795
questions 0.057590148411805764
louise 0.05605109324219437
corp 0.05353748187049234
gas 0.051973143341611294
meeting 0.0502119064533999
money 0.04034957095817926
hpl 0.03643519767067327
original 0.035747773311982645
call 0.03475383262228593
- label = 0: iter = 999, nSV = 2233
- label = 1: iter = 999, nSV = 2233
Classifier trained in 3.615 seconds.
== Training Evaluation ==
OVERALL: accuracy=1.000000
spam     f1=1.000000 p=1.000000 r=1.000000 (tp=13737 fp=0 fn=0 true=13737 pred=13737)
ham      f1=1.000000 p=1.000000 r=1.000000 (tp=13224 fp=0 fn=0 true=13224 pred=13224)
== Testing Evaluation ==
OVERALL: accuracy=0.978048
spam     f1=0.978426 p=0.975015 r=0.981861 (tp=1678 fp=43 fn=31 true=1709 pred=1721)
ham      f1=0.977657 p=0.981212 r=0.974128 (tp=1619 fp=31 fn=43 true=1662 pred=1650)
== Validation Evaluation ==
OVERALL: accuracy=0.976855
spam     f1=0.977273 p=0.974433 r=0.980129 (tp=1677 fp=44 fn=34 true=1711 pred=1721)
ham      f1=0.976421 p=0.979381 r=0.973478 (tp=1615 fp=34 fn=44 true=1659 pred=1649)


Now, here's the command I used to classify some new emails:

java -Xmx3g -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server \ 
  -classpath ./src/main/resources:./target/classes:./lib/factorie-1.0-SNAPSHOT-jar-with-dependencies.jar \ 
  cc.factorie.app.classify.Classify  \ 
  --read-vocabulary ./data/out/classifier/vocab \ 
  --read-classifier ./data/out/classifier/enron_email  \ 
  --read-text-encoding ISO-8859-1 \ 
  --write-classifications ./data/out/classifier/results.txt \ 
  --read-text-dirs "./data/unclassified/spam ./data/unclassified/ham"

A few things:

1. --write-classifications does nothing; the output is written to stdout.
2. It insists that the --read-text-dirs have the same dir names (labels) as the training data, even though DETERMINING SPAM/HAM IS THE WHOLE POINT ;)
3. Worse, it simply labels the emails based on the directory they are in. I cheated and switched the directory names, causing every email to be mislabeled.

So, what's the correct way to run the classifier?

Thanks in advance,

Dean

Ali Shirvani

unread,
Jan 10, 2015, 1:41:26 AM1/10/15
to dis...@factorie.cs.umass.edu
Hi Dean,

I have the same problem.
Have you find any solution to solve this problem?

Thanks,
Ali

Dean Wampler

unread,
Jan 10, 2015, 8:54:02 AM1/10/15
to dis...@factorie.cs.umass.edu
No one ever replied and I moved on to other things. Sorry.

Dean

--
--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@factorie.cs.umass.edu.

Pallika Kanani

unread,
Jan 12, 2015, 1:25:01 PM1/12/15
to dis...@factorie.cs.umass.edu
I've been using Factorie classifiers programmatically without any issues, in case that helps. 

Best,
Pallika.

Ali Shirvani

unread,
Jan 12, 2015, 1:36:41 PM1/12/15
to dis...@factorie.cs.umass.edu
Hi Pallika,

Thanks for your comment.
Could you please be more specific about the factorie version and directory structure of your data and also the command that you issue?

Thanks,
Ali

Pallika Kanani

unread,
Jan 12, 2015, 1:40:44 PM1/12/15
to dis...@factorie.cs.umass.edu
I've been able to use most versions of Factorie, upto 1.0. I've been using the cc.factorie.app.classify.backend package and used it in other scala programs. I manipulate the data on my own. It's not the same as being able to use the classify package from the command line, but I just wanted to let people know that the core classification works. 

Best,
Pallika. 

Ali Shirvani

unread,
Jan 12, 2015, 1:54:25 PM1/12/15
to dis...@factorie.cs.umass.edu
I also tried to run DocumentClassifier1 example.
But unfortunately I couldn't get result when using separate test directory instead of shuffle and split.
All the assigned labels to documents in test directory is `test` after classification and I couldn't figure out why.
Emma kindly helps me to solve the problem but I couldn't resolve this issue yet.
Would you please share your code?

Regards,
Ali
 

James Sullivan

unread,
Jun 13, 2016, 12:36:59 AM6/13/16
to Factorie, deanw...@gmail.com
I just put in a pull request https://github.com/factorie/factorie/pull/371 that should fix this issue if it is accepted.

Emma Strubell

unread,
Jun 13, 2016, 2:10:39 PM6/13/16
to Factorie, deanw...@gmail.com
Merged, thanks!

--
Reply all
Reply to author
Forward
0 new messages