Problem in classification of new data

60 views
Skip to first unread message

Ali Shirvani

unread,
Jan 10, 2015, 2:26:56 AM1/10/15
to dis...@factorie.cs.umass.edu
Hi everyone,

I just start using Factorie for document classification. It seems the training work fine, but running classifier on new data arise some exception.
Here is the command I used to train the classifier:
read_dirs="economy,sport,tourism"
bin
/fac classify --read-text-dirs "$read_dirs" --write-classifier test.factorie --write-vocabulary test.vocab

And here is the command I used to test the classifier on new data:

test_dirs
="news"
bin
/fac --read-text-dirs "$test_dirs" --read-classifier test.factorie --read-vocabulary test.vocab

Here is the stack trace of exception:
Exception in thread "main" java.lang.Error: Initial category not in domain: news
    at cc
.factorie.variable.CategoricalVariable.<init>(CategoricalVariable.scala:61)
    at cc
.factorie.variable.LabeledCategoricalVariable.<init>(LabeledVariable.scala:171)
    at cc
.factorie.app.classify.Label.<init>(Classify.scala:45)
    at cc
.factorie.app.classify.Features$class.$init$(Classify.scala:55)
    at cc
.factorie.app.classify.BinaryFeatures.<init>(Classify.scala:58)
    at cc
.factorie.app.classify.Classify$$anonfun$main$3$$anonfun$apply$2.apply(Classify.scala:229)
    at cc
.factorie.app.classify.Classify$$anonfun$main$3$$anonfun$apply$2.apply(Classify.scala:224)
    at scala
.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:778)
    at scala
.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala
.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at scala
.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:777)
    at cc
.factorie.app.classify.Classify$$anonfun$main$3.apply(Classify.scala:224)
    at cc
.factorie.app.classify.Classify$$anonfun$main$3.apply(Classify.scala:220)
    at scala
.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala
.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
    at cc
.factorie.app.classify.Classify$.main(Classify.scala:220)
    at cc
.factorie.app.classify.Classify.main(Classify.scala)
    at sun
.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java
.lang.reflect.Method.invoke(Method.java:483)
    at com
.intellij.rt.execution.application.AppMain.main(AppMain.java:134)


I would appreciate any ideas?

Thanks,
Ali

John Sullivan

unread,
Jan 10, 2015, 5:59:07 PM1/10/15
to dis...@factorie.cs.umass.edu
Ali,

I'm not too familiar with the command-line interface, but it looks like the issue has to do with the labels of your data. Specifically, when reading in data, the `--read-text-dirs` command takes a comma separated list of directories whose contents' label is the directory's name. That means that when you give 'news' as the test dir, the classifier views it as a collection of documents with the class label 'news'. Since none of your training data had that class, the exception you see is thrown. Try sorting out the test documents into directories in the same way you've sorted your training documents.

Let me know if that helps or you still have questions.

Thanks,
Jack

--
--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu


Ali Shirvani

unread,
Jan 10, 2015, 11:09:11 PM1/10/15
to dis...@factorie.cs.umass.edu
Hi John,

Thanks for your reply. I also test the method that you described, but unfortunately the classifier classify all new documents to the class with folder name.
For example if I rename `news` to `sport` and then run classifier, classifier assign label `sport` to all new documents. Also if I rename `news` to `tourism` all assigned labels are `tourism`.

I just start to work on `DocumentClassifier1.scala` in `tutorial` package. It works fine.
But I have some question about label of `testVariables`.
How should I set `testVariables` label when I see the related documents for the first time and don't know the labels?
Also would you please explain more about difference between `CategoricalVariable` and `LabeledCategoricalVariabl`? I couldn't find any good resource about the difference.

Thank you again,
Ali


To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@factorie.cs.umass.edu.

Emma Strubell

unread,
Jan 11, 2015, 5:33:56 PM1/11/15
to dis...@factorie.cs.umass.edu
Hi Ali,

You're correct that "CategoricalVariable" is the type of variable to use if you don't know the labels, and LabeledCategoricalLabels are for labeled data where you know a "target" label value. You don't even have to make a Label variable for scoring, they are just convenient containers for evaluation and downstream processing.

For example, given a File "testfile" containing the document you wish to classify, you could just do the following to get the string value of the label assigned to that document by the classifier in the DocumentClassifier1 example:

val testDoc = new Document(testfile)
val assignedLabel = LabelDomain.category(classifier.classification(testDoc.value).bestLabelIndex)

Hope this helps,

Emma

Ali Shirvani

unread,
Jan 12, 2015, 12:16:00 AM1/12/15
to dis...@factorie.cs.umass.edu
Hi Emma,

Thanks for you reply and helpful comment.

I create separate test directory that contains all test document.
Also I modified the `cc.factorie.tutorial.DocumentClassifier1`, Instead of using `testVariables` I used testDocs directly as you said.
But unfortunately all assigned labels are `test` label.


Here is the modifications:
//    val (trainVariables, testVariables) = docLabels.shuffle.split(0.5)
// (trainVariables ++ testVariables).foreach(_.setRandomly)
val trainVariables = docLabels
(trainVariables).foreach(_.setRandomly)

for (test <- testDocs) {
val label = LabelDomain.category(classifier.classification(test.value).bestLabelIndex)
println(LabelDomain) // output label is: test !!!!!!
}

I couldn't figure out why all assigned labels are `test`.
Please kindly let me know your ideas.

Thanks,
Ali
Reply all
Reply to author
Forward
0 new messages