If the relative position of the cameras in the array is fixed, why does it matter if the camera rig is hand held or is being rotated. Yes you need to estimate the rotation in the first case and rotation + translation in the second of the whole rig moving.
This is straightforward to implement without requiring anything fancy.
Lets call the position of the center of the array at each frame captured by the rig be S_i. Lets call the relative transformation from the center of the rig to the k^th camera be P_k. If you know this, then it is fixed otherwise variable.
So the actual position of each camera in the i^th frame is P_k S_i.
So your cost functor should take these two transformation and the 3d position of the point as inputs and compute the projection into the image and compare it to your observed projection.
And you have complete flexibility to set P_ks constant if you know them, part of S_i constant if its only rotating. If you do not know P_k precisely, you can add another term to the optimization problem which penalizes its deviation from your best estimate for it.
Hope this helps,
Sameer