Trouble getting classification to work

Dean Wampler

unread,

Mar 28, 2014, 5:30:58 PM3/28/14

to dis...@factorie.cs.umass.edu

I'm playing with the tutorials, etc. and having trouble getting the classifier to work. Training seems to work fine, but not running the classifier. Perhaps I have the command-line args wrong.

I'm using the factorie-1.0-SNAPSHOT-jar-with-dependencies.jar that I generated by building the git repo, after checking out "factorie-1.0.0.RC1" branch/tag.

Here's the command I used to train the classifier (what the bin/fac script runs), reformatted for easier reading:

java -Xmx3g -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server \

-classpath ./src/main/resources:./target/classes:./lib/factorie-1.0-SNAPSHOT-jar-with-dependencies.jar \

cc.factorie.app.classify.Classify \

--write-vocabulary ./data/out/classifier/vocab \

--write-classifier ./data/out/classifier/enron_email \

--read-text-encoding ISO-8859-1 \

--training-portion 0.8 --validation-portion 0.1 \

--trainer "new cc.factorie.app.classify.backend.SVMMulticlassTrainer" \

--print-infogain true \

--read-text-dirs "./data/unclassified/spam ./data/unclassified/ham"

I'm using a pre-classified SPAM/HAM data set taken from the Enron email data set, available here: http://www.aueb.gr/users/ion/data/enron-spam/.

The classpath is using the built jar located in a "lib" directory in my own project folder.

By the way, the default class for the trainer, "SVMMulticlassClassifierTrainer" doesn't actually exist, so I specified the one shown. I also found I had to provide a fully-qualified path for it to be found.

This seems to run fine:

Read 33702 documents in 2 directories.

# of distinct class labels: 2

Class labels: spam, ham

Vocabulary size: 142539

Top 20 features with highest information gain:

enron 0.27314369477422284

cc 0.16737156485694638

pm 0.13206432605710616

ect 0.10402359163042996

hou 0.09120112077507947

forwarded 0.08690984306893668

vince 0.08562956389895193

http 0.06498487642128159

kaminski 0.06283999653158767

attached 0.062459899672930974

houston 0.0598921529702795

questions 0.057590148411805764

louise 0.05605109324219437

corp 0.05353748187049234

gas 0.051973143341611294

meeting 0.0502119064533999

money 0.04034957095817926

hpl 0.03643519767067327

original 0.035747773311982645

call 0.03475383262228593

- label = 0: iter = 999, nSV = 2233

- label = 1: iter = 999, nSV = 2233

Classifier trained in 3.615 seconds.

== Training Evaluation ==

OVERALL: accuracy=1.000000

spam f1=1.000000 p=1.000000 r=1.000000 (tp=13737 fp=0 fn=0 true=13737 pred=13737)

ham f1=1.000000 p=1.000000 r=1.000000 (tp=13224 fp=0 fn=0 true=13224 pred=13224)

== Testing Evaluation ==

OVERALL: accuracy=0.978048

spam f1=0.978426 p=0.975015 r=0.981861 (tp=1678 fp=43 fn=31 true=1709 pred=1721)

ham f1=0.977657 p=0.981212 r=0.974128 (tp=1619 fp=31 fn=43 true=1662 pred=1650)

== Validation Evaluation ==

OVERALL: accuracy=0.976855

spam f1=0.977273 p=0.974433 r=0.980129 (tp=1677 fp=44 fn=34 true=1711 pred=1721)

ham f1=0.976421 p=0.979381 r=0.973478 (tp=1615 fp=34 fn=44 true=1659 pred=1649)

Now, here's the command I used to classify some new emails:

java -Xmx3g -ea -Djava.awt.headless=true -Dfile.encoding=UTF-8 -server \

-classpath ./src/main/resources:./target/classes:./lib/factorie-1.0-SNAPSHOT-jar-with-dependencies.jar \

cc.factorie.app.classify.Classify \

--read-vocabulary ./data/out/classifier/vocab \

--read-classifier ./data/out/classifier/enron_email \

--read-text-encoding ISO-8859-1 \

--write-classifications ./data/out/classifier/results.txt \

--read-text-dirs "./data/unclassified/spam ./data/unclassified/ham"

A few things:

1. --write-classifications does nothing; the output is written to stdout.

2. It insists that the --read-text-dirs have the same dir names (labels) as the training data, even though DETERMINING SPAM/HAM IS THE WHOLE POINT ;)

3. Worse, it simply labels the emails based on the directory they are in. I cheated and switched the directory names, causing every email to be mislabeled.

So, what's the correct way to run the classifier?

Thanks in advance,

Dean

Ali Shirvani

unread,

Jan 10, 2015, 1:41:26 AM1/10/15

to dis...@factorie.cs.umass.edu

Hi Dean,

I have the same problem.
Have you find any solution to solve this problem?

Thanks,
Ali

Dean Wampler

unread,

Jan 10, 2015, 8:54:02 AM1/10/15

to dis...@factorie.cs.umass.edu

No one ever replied and I moved on to other things. Sorry.

Dean

Dean Wampler, Ph.D.

Author: Programming Scala, 2nd Edition (O'Reilly)

Typesafe
@deanwampler

http://polyglotprogramming.com

--
--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@factorie.cs.umass.edu.

Pallika Kanani

unread,

Jan 12, 2015, 1:25:01 PM1/12/15

to dis...@factorie.cs.umass.edu

I've been using Factorie classifiers programmatically without any issues, in case that helps.

Best,

Pallika.

Ali Shirvani

unread,

Jan 12, 2015, 1:36:41 PM1/12/15

to dis...@factorie.cs.umass.edu

Hi Pallika,

Thanks for your comment.

Could you please be more specific about the factorie version and directory structure of your data and also the command that you issue?

Thanks,

Ali

Pallika Kanani

unread,

Jan 12, 2015, 1:40:44 PM1/12/15

to dis...@factorie.cs.umass.edu

I've been able to use most versions of Factorie, upto 1.0. I've been using the cc.factorie.app.classify.backend package and used it in other scala programs. I manipulate the data on my own. It's not the same as being able to use the classify package from the command line, but I just wanted to let people know that the core classification works.

Best,

Pallika.

Ali Shirvani

unread,

Jan 12, 2015, 1:54:25 PM1/12/15

to dis...@factorie.cs.umass.edu

I also tried to run DocumentClassifier1 example.
But unfortunately I couldn't get result when using separate test directory instead of shuffle and split.
All the assigned labels to documents in test directory is `test` after classification and I couldn't figure out why.
Emma kindly helps me to solve the problem but I couldn't resolve this issue yet.
Would you please share your code?

Regards,
Ali

James Sullivan

unread,

Jun 13, 2016, 12:36:59 AM6/13/16

to Factorie, deanw...@gmail.com

I just put in a pull request https://github.com/factorie/factorie/pull/371 that should fix this issue if it is accepted.

Emma Strubell

unread,

Jun 13, 2016, 2:10:39 PM6/13/16

to Factorie, deanw...@gmail.com

Merged, thanks!

--

Reply all

Reply to author

Forward