Improper data labelling

975 views
Skip to first unread message

Rose Perrone

unread,
Dec 4, 2014, 2:25:24 PM12/4/14
to caffe...@googlegroups.com
While trying to train a few models, I discovered the symptoms of improper data labelling: The loss changes but the accuracy doesn't change (much). Here's output from training one model:

```
1130 13:40:19.760501 2041447168 solver.cpp:160] Solving CaffeNet
I1130 13:40:19.760577 2041447168 solver.cpp:247] Iteration 0, Testing net (#0)
I1130 14:18:50.135545 2041447168 solver.cpp:298]     Test net output #0: accuracy = 0.343721
I1130 14:18:50.135606 2041447168 solver.cpp:298]     Test net output #1: loss = 0.831318 (* 1 = 0.831318 loss)
...
I1130 19:30:21.624950 2041447168 solver.cpp:247] Iteration 2000, Testing net (#0)
I1130 20:08:46.390735 2041447168 solver.cpp:298]     Test net output #0: accuracy = 0.343721
I1130 20:08:46.390794 2041447168 solver.cpp:298]     Test net output #1: loss = 3.80082 (* 1 = 3.80082 loss)
```

The score predictions are also hugely skewed. I use only two labels (0 and 1), but the 1 label nearly always has scores of about 0.999 when I use the bvlc reference caffenet, or the 0 label has a prediction score of all 0s if I use the NIN model. The same symptoms occur if I train from scratch vs finetune the models, and the problem remains if I use an IMAGE_DATA layer rather than a DATA layer that depends on LMDB.

Yet my train.txt and test.txt appear to be formatted correctly:

Here’s my train.txt: 

01eggs.533span_0.jpg 1 
01eggs.533span_1.jpg 1 
01eggs.533span_10.jpg 1 
01eggs.533span_11.jpg 1 
01eggs.533span_12.jpg 1 
… 
n02093056_2211045646_a3df4790b8.jpg 0 
n02093056_298841044_552ffd4061.jpg 0 
n02093428_1851150959_e32c79c88a.jpg 0 
n02093754_161646818_c922da8140.jpg 0 

and my test.txt: 

3417053458_e45d068b20_0.jpg 1 
3422082793_72bdb5b2a2_0.jpg 1 
3422082793_72bdb5b2a2_1.jpg 1 
3422082793_72bdb5b2a2_2.jpg 1 
3423114055_c5294b1832_0.jpg 1 
3423114055_c5294b1832_1.jpg 1 
3423114055_c5294b1832_2.jpg 1 
... 
n02251067_5119056.jpg 0 
n02251067_k10880-1i.jpg 0 
n02251233_5119056.jpg 0 
n02251593_mealybug-1.jpg 0 

I make sure that the data is shuffled during LMDB creation. 
I used absolute filenames when using an IMAGE_DATA layer instead of a DATA layer, but I still had the same results. This train.txt looks like it’s the same format other users use. Maybe there’s something wrong with whitespace characters? Here’s the code I use to generate train.txt and test.txt: 

# stage is ‘test’ or ‘train' 
  positive_dir = join(wnid_dir, 'images', FLAGS.dataset, stage + '-positive') 
  negative_dir = join(wnid_dir, 'images', FLAGS.dataset, stage + '-negative') 
  with open(join(wnid_dir,'images', FLAGS.dataset, stage + '.txt'), 'w') as f: 
    for name in listdir(positive_dir): 
      f.write(name + ' 1\n') 
    for name in listdir(negative_dir): 
      f.write(name + ' 0\n') 

Here’s an explanation of this code: I have stored images in directories named train-positive, test-positive, train-negative, and test-negative, and then I symlink images in train-* into train/ and test-* into test/ 

I’ve attached the script I use to generate the LMDB database from the train.txt, test.txt, train/ and test/. What else can help diagnose the problem?

create_imagenet.sh

Dmitry Ulyanov

unread,
Dec 4, 2014, 4:27:21 PM12/4/14
to caffe...@googlegroups.com
Hello, did you change the number of classes in train_val.proto ? I had something similar because I forgot about it.

Dmitry 

Rose Perrone

unread,
Dec 4, 2014, 4:30:03 PM12/4/14
to Dmitry Ulyanov, caffe...@googlegroups.com
Yes, I made sure that the “num_output” of the last layer that had this field (in the case of the BVLC reference net, it’s fc8) has num_output=2.

Mohamed Omran

unread,
Dec 4, 2014, 4:33:44 PM12/4/14
to Rose Perrone, caffe...@googlegroups.com
As a first step you could try reading out the images and the labels from the lmdb to verify that they're indeed being shuffled and/or being stored correctly.

Here's some python code I have lying around which reads out the first image, and can be extended to read several:

import os
import sys 
import lmdb
import scipy 
import numpy as np 
 
caffe_root = <INSERT PATH HERE>
sys.path.append(os.path.join(caffe_root, 'python'))
import caffe.io
from caffe.proto import caffe_pb2
 
db_path = <INSERT PATH HERE>
if not os.path.exists(db_path):
    raise Exception('db not found')
 
lmdb_env = lmdb.open(db_path)  # equivalent to mdb_env_open()
lmdb_txn = lmdb_env.begin()  # equivalent to mdb_txn_begin()
lmdb_cursor = lmdb_txn.cursor()  # equivalent to mdb_cursor_open()
lmdb_cursor.first()  # equivalent to mdb_cursor_get()
value = lmdb_cursor.value()
key = lmdb_cursor.key()
 
datum = caffe_pb2.Datum()
datum.ParseFromString(value)
image = np.zeros((datum.channels, datum.height, datum.width))
image = caffe.io.datum_to_array(datum)
image = np.transpose(image, (1, 2, 0))
scipy.misc.pilutil.imshow(image) 



--
You received this message because you are subscribed to the Google Groups "Caffe Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to caffe-users...@googlegroups.com.
To post to this group, send email to caffe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/caffe-users/8cb9680c-1ccf-4fb0-b7ea-57feb180b952%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rose Perrone

unread,
Dec 5, 2014, 12:17:26 AM12/5/14
to caffe...@googlegroups.com, ros...@gmail.com
Thanks so much for your help!! The problem was that some of the filenames in train.txt and test.txt contained spaces. The convert_imageset tool to create the LMDB databases simply stops including images when it hits a filename that contains a space, and it does so silently. Apparently the IMAGE_DATA layer also has this behavior. The following is the relevant code. "infile" is either train.txt or test.txt.

```
  string filename;
  int label;
  while (infile >> filename >> label) {
    lines.push_back(std::make_pair(filename, label));
  }
```
Only negative images contained spaces, so I had been training on about 2226 positive images and 9 negative images. In about a week I'll write a PR to make the failure not silent.

Mohamed Omran

unread,
Dec 5, 2014, 11:17:18 AM12/5/14
to Rose Perrone, caffe...@googlegroups.com
P.S. I used this with image data where the channel ordering was RGB (as opposed to BGR), which might have to be taken into account when displaying the images.

Wei Guo

unread,
Feb 8, 2015, 11:52:40 PM2/8/15
to caffe...@googlegroups.com, ros...@gmail.com

The code just read a single image ,how to read read all images stored in a lmdb file? 
Is there any document about reading or writting to a lmdb file using python?

Thanks,

dan sil

unread,
May 6, 2015, 10:40:14 AM5/6/15
to caffe...@googlegroups.com, ros...@gmail.com
I have also the same problem which I couldn't fix.
Could you upload sample wrong labelling and corrected one.
Thanks

Antonio Paes

unread,
May 6, 2015, 11:07:24 AM5/6/15
to caffe...@googlegroups.com, ros...@gmail.com
it would help a lot.

thanks
Reply all
Reply to author
Forward
0 new messages