Hey Márton,
First, don't worry, you are in the right place :-)
I'm not a speaker recognition specialist and I've never worked with language detection before, so I don't have much to say (where are the speaker specialists of this list :-P ??).
About the amount of data to train the background models (UBM, TV Matrix, etc...), there is no precise answer for that.
It depends in which conditions you want to operate your language detection system.
Roughly speaking you should have a good amount of utterances from different speakers.
Regarding your question about i-Vectors, I suggest you to have a look in this paper ( Front-end factor analysis for speaker verification).
To be short, from the step 4 you deal with the 'i-vectors' (which are 1-d feature vectors).
So the Whitening, LDA, WCCN, ..... is done with the 'i-vectors' from the training set ('train_world.lst' in your example).
Cheers
Tiago