We did the same sort of thing about a year ago with a PTZ security camera.
The basic idea is that you have one client whose job it is to control the camera and convert the streaming video into a webrtc media stream (the messier part).
For simply controlling the camera, the client took data channel messages and converted them to rest api calls that the security camera understood.
The latency of the rest api made interactive control not so great, but for going to preset positions, that wouldn't be an issue.