Different size of instance greatly influences accuracy

Peter Klügl

unread,

Sep 5, 2014, 9:45:03 AM9/5/14

to dis...@factorie.cs.umass.edu

Hi,

I try to reproduce/extend some experiments I implemented with Mallet/GRMM in Factorie and I observe some quite large differences of accuracy dependent on the definition of an instance. I had this problem also with Mallet and GRMM, but I wanted to ask now in this group for some advice or some explanation for this disconnect.

The task is the common segmentation of references with a linear chain CRF. If an instance/sentence corresponds to one reference, then I get a lower accuracy compared to if an instance/sentence corresponds to all references of a reference section (separated by a default token with an O label). Both experiments rely on the same data/features, whereas the second dataset is generated using the first one (by just connecting the instances).

I can imagine many reasons why not the exact same results are produced, but not the major difference of the increase of accuracy by approx. 3% or the 50% error reduction

Any help/explanation is greatly appreciated.

Best,

Peter

Here's some logging info for a linear chain CRF (uses the example in the tutorial package):

Instance/sentence = Reference

Loaded 444 sentences with 16497 words total from ...

Loaded 122 sentences with 4887 words total from ...

Using 31731 observable features.

Iteration 1

Train accuracy = 0.9406558768260896

Test accuracy = 0.8817270308983016

Iteration 2

Train accuracy = 0.9680547978420319

Test accuracy = 0.9183548189073051

Iteration 3

Train accuracy = 0.9949687822028248

Test accuracy = 0.9318600368324125

Iteration 4

Train accuracy = 0.9990301266897011

Test accuracy = 0.9347247800286475

Iteration 5

Train accuracy = 0.9996362975086379

Test accuracy = 0.9349294045426643

Final Test accuracy = 0.9339062819725803

Finished in 25.379 seconds

MaxBP Test accuracy = 0.9339062819725803

SumBP Test accuracy = 0.9359525271127481

Gibbs Test accuracy = 0.9349294045426643

Instance/sentence = Section

Loaded 17 sentences with 16923 words total from ...

Loaded 4 sentences with 5004 words total from ...

Using 31731 observable features.

Iteration 1

Train accuracy = 0.936676450384063

Test accuracy = 0.927956502038967

Iteration 2

Train accuracy = 0.9916942484231562

Test accuracy = 0.9546896239238786

Iteration 3

Train accuracy = 0.9979391744207831

Test accuracy = 0.9642048028998641

Iteration 4

Train accuracy = 0.9997502029600949

Test accuracy = 0.9676030811055731

Iteration 5

Train accuracy = 1.0

Test accuracy = 0.9680561848663344

Final Test accuracy = 0.9685092886270956

Finished in 24.962 seconds

MaxBP Test accuracy = 0.9685092886270956

SumBP Test accuracy = 0.9651110104213865

Gibbs Test accuracy = 0.9644313547802447

Emma Strubell

unread,

Sep 5, 2014, 1:31:02 PM9/5/14

to dis...@factorie.cs.umass.edu

Hi Peter,

One guess I have is that the significant difference in accuracy could be due to sorting of examples (or lack thereof). If your data is ordered in any way and you split up your examples into larger chunks, then the model could be learning patterns in the ordering of examples rather than in the actual examples themselves, causing falsely high accuracy.

Something you could try is shuffling your examples (i.e. references?) before chunking them into sections and seeing if you then get an accuracy closer to what you get with the smaller chunks (which I think are implicitly being shuffled by the optimizer). Let me know if this makes sense.

Best,

Emma

--
--
Factorie Discuss group.
To post, email: dis...@factorie.cs.umass.edu
To unsubscribe, email: discuss+u...@factorie.cs.umass.edu

Peter Klügl

unread,

Sep 8, 2014, 3:58:47 AM9/8/14

to dis...@factorie.cs.umass.edu

Hi,

Am Freitag, 5. September 2014 19:31:02 UTC+2 schrieb Emma Strubell:

Hi Peter,

One guess I have is that the significant difference in accuracy could be due to sorting of examples (or lack thereof). If your data is ordered in any way and you split up your examples into larger chunks, then the model could be learning patterns in the ordering of examples rather than in the actual examples themselves, causing falsely high accuracy.

The examples of both datasets have exactly the same ordering. Sorry for my limited knowledge about graphical models, but how can the model learn patterns? It's a Markov-order 1 linear chain state-based CRF. The classification is mainly driven by the discriminative impact of a feature concerning a label assignment and the likelihood of a label transition. There are different results when you apply forward backward compared to viterbi, but in this example it was gibbs sampling. I don't know how inference during learning could cause so different results. It's still the exact same data, only connected by a dummy token with an O label. Shouldn't that prevent this situation?

The ordering of the examples is the natural one, as they occur in real world. If the model learns patterns, then, I think, it is a real improvement since the test data is not provided during training (it's the first fold of a 5-fold cross evaluation). Instead of the good accuracy for complete sections, I am rather concerned with why the single references do not provide the same accuracy.

Something you could try is shuffling your examples (i.e. references?) before chunking them into sections and seeing if you then get an accuracy closer to what you get with the smaller chunks (which I think are implicitly being shuffled by the optimizer). Let me know if this makes sense.

I did some experiments with shuffling the data, but not yet throughout. I will try it. However, as I mentioned above, the ordering of examples is how I get the data. For actual experiments, I would not like to change it, especially in order to get worse results.

I was thinking that maybe the implicit start state may influence the performance. So, I added additional tokens (label O) to the start and end of each single reference. It improved the accuracy, but only by 1/3.

The easiest explanation would be that I did something wrong. However, I noticed this disconnect in three different setups now with three different implementations. If it is really true that you can get much better results just by changing the definition of one instance, then there is some major impact for publications like "I improve reference segmentation with CRFs with...", in my opinion.

(If someone is interested, I could set up a reproducible example somewhere)

Best,

Peter

Steve Ash

unread,

Jan 20, 2015, 12:14:10 AM1/20/15

to dis...@factorie.cs.umass.edu

Did you ever figure anything else out about this? I am observing a fairly large delta in accuracy in my task (0.92 vs 0.98) and it takes factorie twice as long to train as well compared to mallet. I haven't dug too deep outside of validating that my features are the same and I'm using the same opto+regularization. One thing in mallet that I've been curious about (but dont know much about) is this "unsupported feature trick" they talk about in the CRF.java source code. I'm not entirely sure what it does outside of add features that aren't present in the training data. Maybe adding features for combinations of X_features against C classes that weren't observed? Maybe factorie doesn't do this yet?

Steve

Peter Klügl

unread,

Jan 20, 2015, 7:06:49 AM1/20/15

to dis...@factorie.cs.umass.edu

Just to clarify it: The differences can be observed if implemented with mallet or factorie. It's not a difference between two models where one uses mallet and the other one is built with factorie.

I have not investigated the differences in detail, but only tested it in more configurations. In a five-fold cross evaluation, the average is quite the same for both sizes of instances meaning that, in other folds, the accuracy is sometimes considerably reduced. If I had to guess, then I would say that the delta is caused by several smaller factors like distribution of the data, initial states, weight-updating for one instance and that what Emma said. I had no time to dig deeper into that problem.

Peter

Reply all

Reply to author

Forward