Hi, I have researched this area for the past few months(master degree), and based on all the experiments I tried, my advice is to try the models which are good for "alignment"
In some exp, I tried using an asr model to recognise the non-native speakers' phoneme sequences, then used the toolkit extracted from FestivalTTS to get the "correct" phoneme sequence of the English sentence, finally i directly compare these 2 sequences to find the errors. The problem is, usually the phoneme recognition result is poor, and there is a trade-off between asr and error detection tasks, if I did well in asr tasks, my asr model would learn to "correct the errors" itself, which means even the non-native speakers are speaking not very accuratly, sometimes the asr model still learns to generate the correct phoneme. Such situations make the directly-compare-phoneme-seq method unreliable.
Then I tried the traditional "gop"(goodness of pronunciation) methods, I strongly recommend you read those relative papers.
Usually it needs a model to do force-alignment first, and then calculate a gop of each phoneme, the simplest way is to set a threshold, if gop is higher than the threshold, the phoneme is wrong, otherwise it's correct. Some paper use gops as phonemes' features, and train a error classification model (if you have enough labelled data), but I don't think this is a good idea, because it's hard to label the data(try listen to a non-native speaker's audio and label which phoneme is right or wrong, you'll know how hard it is) and different people have different ideas of "what is a good/bad pronunciation".
As for the dataset, i mixed the non-native and native(librispeech) public english dataset into a giant combined corpora to train my model.
If you want to try the "gop" based methods, then you should focus on those models good for "force-alignment". Some state-of-the-art models, like lstm/ctc, are not designed for "alignment". I tried a ctc model once, it has very low WER on the librispeech test_clean set(around 4%), but i found out the alignment results on the same data set are very bad(now I know it's partly because lstm has the time-delay problem). I'm using a tdnn model now, and it works well (there are many tdnn receipts in kaldi, maybe you can take a look at egs/fisher_english/s5/local/chain/run_tdnn.sh)
It's hard to do pronunciation detection, I tried my best, but I think the final performance is faaaaar from satisfying...Anyway, I believe you'll learn a lot from your project.
Hope it helps.
Shin