Here's my understanding so far.
Caffe pools all the data into foreground windows and background windows at the network's setup
I1107 09:37:03.030887 16320 window_data_layer.cpp:157] Number of images: 4171
I1107 09:37:03.030887 16320 window_data_layer.cpp:161] class 0 has 450454 samples
I1107 09:37:03.030887 16320 window_data_layer.cpp:161] class 1 has 4014 samples
I1107 09:37:03.030887 16320 window_data_layer.cpp:165] Amount of context padding: 16
The code below from window_data_layer.cpp then uses the foreground and background pools to obtain samples of balanced batches. The sampling is done randomly, so an unbalanced class (as above) will be oversampled. So to answer my own question, the data is not balanced through caffe, but caffe implements balanced batches to avoid any biases.
Further, the network above had an accuracy of 96%, a strong indicator that the classes are still unbalanced.
Hope this helps anyone in the future
void WindowDataLayer<Dtype>::load_batch(Batch<Dtype>* batch) {
// At each iteration, sample N windows where N*p are foreground (object)
// windows and N*(1-p) are background (non-object) windows
...
const int num_fg = static_cast<int>(static_cast<float>(batch_size)
* fg_fraction);
const int num_samples[2] = { batch_size - num_fg, num_fg };
int item_id = 0;
CHECK_GT(fg_windows_.size(), 0);
CHECK_GT(bg_windows_.size(), 0);
// sample from bg set then fg set
for (int is_fg = 0; is_fg < 2; ++is_fg) {
for (int dummy = 0; dummy < num_samples[is_fg]; ++dummy) {
// sample a window
timer.Start();
const unsigned int rand_index = PrefetchRand();
vector<float> window = (is_fg) ?
fg_windows_[rand_index % fg_windows_.size()] :
bg_windows_[rand_index % bg_windows_.size()];
bool do_mirror = mirror && PrefetchRand() % 2;