Hi Raihan,
Regarding your first question, the segment-based PLDA training just means that each segment (natural utterance) is considered a unique speaker. As in, if someone speaks a couple sentences, then pauses for a few seconds, then says another sentence, we would consider the 2-3 ivectors from the first sentences to be one speaker, then the 1-2 from the other sentence to be a different speaker.
For your 2nd question, you would want to use the pre-trained model. That being said, unless you have a particular reason to want to use ivectors, I'd strongly recommend using xvectors, which are neural network-bsed embeddings in contrast to the statistical ivectors. They more-or-less can be used interchangeably, but the performance of xvectors greatly exceeds that of ivectors, and we have pre-trained models available on the Kaldi website. I personally would not consider using anything besides a pre-trained xvector model for any speaker ID or diarization task.
For question 3, I'm not exactly sure how to replicate that functionality. I believe you need to change the code to concatenate feature vectors. I think the xvector extractor code actually does something along those lines automatically. Of course, if you download an xvector model it will include a good PLDA already as well.
I think ultimately, as far as I know, for any speaker ID type application, basically the best thing to do is to give your model as much data as possible, which is why using a pre-trained model can be desirable (unless you have access to a tremendous amount of data). You can usually tweak some minor extra performance out of a PLDA model with some kind of in-domain adaptation, but that will not likely compare to just throwing more data at it.
—Matt