I'm not sure what you mean with target model.
Merging multiple samples of the same person in order to cover a maximum of the voice characteristics is a good idea.
Once you have covered the different phonemes that could possibly show up in his speech. Further merging won't improve speaker recognition anymore.
Variability of voice is so large that I doubt any system could cope with huge variations (yelling vs soft tone)
A slight change should be ok for most system.
The difficulty lies in recording the user in similar conditions:
- distance to microphone (impacts signal to noise ratio)
- surrounding noises (outdoor noises are very hard: it's not just background that you could somehow discard, there might be someone else passing by and speaking at the same time)
The human brain is very good at filtering non-relevant information (we don't really notice the noise) but teaching that to a computer is a whole different story...