And I wish I could be the one to try it with you!
I'm probably misunderstanding something here. You should try this but here are the problems I see: It sounds like you're trying to activate two different capture modes at once (Multiple exposure mode and Interval Time Shooting mode). I don't see how the two modes can mix. When you set the Multiple Exposure mode to "single photo", this only means that you want to create one in-camera composite instead of leaving the Multiple Exposure mode on indefinitely. When you enable the Multiple Exposure feature, you must set the "number of shots" to 2 at the minimum. That's the whole idea behind the Multiple Exposure capture mode - it requires a minimum of two exposures to be able to blend them in a single "multiple exposure" image. Having said that, if you enable Multiple Exposure mode, you will get an in-camera composite (a double exposure) that will be a single file that can not be separated in post processing. So if anything moves between those two frames, that movement or change in position will be recorded in a single exposure and you're stuck with it.
If you insist on using automated features in the camera, to do what you're suggesting, the Interval Timer Shooting mode is probably your better option (forgetting Multiple Exposure altogether). But once you have those 9 images, you still have to blend them together some how which leads you back to a layer stack in Photoshop manually blending the images to replace people in each frame with other parts of other images. Before activating the Interval Timer mode, go ahead and set your exposure manually as you normally would and leave the camera set to Manual. The camera will then use that exposure setting consistently across all 9 frames. Also set the focus before activating the intervalometer and then switch the camera to manual focus so there is no chance of the focus changing between each frame.
Buddy, if I were trying to do this, I don't think I'd use any automated features and would try the captures manually. The post processing is the most difficult aspect of this. So I don't think I'd leave it to chance for the camera to capture whatever it sees based on a set interval.
Beware of long exposures with this. I understand the idea behind a long exposure 'dissolving' the appearance of a solid person, but there will be ghosting nonetheless. If you have multiple overlapping ghosts across 9 frames, eventually that ghosting will build up into a 'solid mass' and that's going to create a muddy mess when you try to composite the images.
I hope your idea above works and I'd love to see what you come up with!