If we are talking strictly the authentication stage (where an initially unknown utterance is compared to the voice prints), 2-5 seconds should be acceptable (emphasis on the latter though).
However, the duration doesn't quite capture the full scope of what would be considered a good utterance. You should also consider the amount of time a speaker is not speaking (pauses between sentences or how quickly one speaks) as well as the content (a proper sentence with lots of phonemes is a lot better than saying the same word 3 times).
If you are talking about the enrollment stage, you might need more data. You mention 4 recordings between 2-5s for each user. So your range of total recording times could potentially be anywhere between 8 and 20 seconds (and that is before you perform VAD). I would recommend closing that range a bit (at least 10-15s perhaps).
I relatively new to Speaker Recognition, so take my advice with a grain of salt. My estimations primarily come from experience using Recognito as well as other tools (such as ALIZE-LIA_RAL).