box3x3(uniform float image[32][32], int x, int y) takes varying x and y parameters and exploits parallelism coming from these vectors. For SSE4 it will handle 4 pixels at a time. So you need to organise outer loops to supply these pixels in chunks of 4 (or 8, or 16, depending on your target).
This loop will do the job:
for(uniform int ii = 0; ii < numRows; ii++){ // note uniform counter
for(int jj = 0; jj < numCols; jj++){ // note varying counter
mO[ii, jj] = box3x3(mI, jj, ii); // on the first iteration it will handle pixels (0,0), (1,0), (2,0), (3,0), i.e. jj is (0,1,2,3), ii is uniform int 0, which is casted to varying int (0,0,0,0).
}
}
In case of both counter are varying, it will basically handle only 4 diagonal points of out of 16 in 4x4 area: (0,0), (1,1), (2,2), (3,3).