Kaldi is an advanced LVCSR system based on WFSTs. While you could do what you want in Kaldi, it's really not worth the effort IMO. Why not simply use something scikit-learn or keras?
If you really want to use Kaldi, for some reason, then you should treat phonemes as words. Make a lexicon that maps each phoneme to itself and make a grammar/LM that simply copies the phonemes as they are. If you feel like it, you can eve train a bigram LM on the phoneme sequences using SRILM or something.
BTW, this is the same procedure you would use with and LVCSR system (HTK, Julius, Sphinx), but phoneme recognition is a useless topic, apart from research.