Kaldi is a toolkit that allows you to train acoustic models for ASR; it
mostly relies on external tools for language modeling (with some recent
additions to the toolkit related to RNN language models). These are all
low-level tools (scripts and C++ binaries) that you need to get familiar
with if you want to use the toolkit. Kaldi also provides different
decoders that you can use for evaluation or, as a starting point for
some sort of production ASR system.
You will have to get your hands dirty, so to speak, read and learn a
lot; don't expect a quick solution where you can grab a few things and
have a final solution in couple of days. If you don't have a solid
background in machine learning, Unix, shell
scripting/python, C++, and signal processing (listed in the order of important), it will be an uphill
battle.
Here is what you could do
1) Review Kaldi documentation
2) Read the HTK Book from Cambridge University to get the general idea on how HMM based ASR systems work
3) Read "Weighted Finite-State Transducers in
Speech Recognition"
4) Read about language modeling (ngrams, RNNs, etc.); Google "A bit of progress in Language Modeling"
5) Kaldi comes with a number of recipes (in egs directory), i.e. scripts
that take care of acoustic model training. Run a few sample recipes
(kaldi for dummies, mini librispeech), and follow the scripts, their
outputs, log files, etc. to gain a better understanding of what
different tools do. There is quite a bit of overlap between egs (and
scripts are structured so that there is not duplication of code); parts
that deal with external data, ie. converting data to the format Kaldi
expects it to be in will be data source specific... those scripts are
typically in subdirectory called local of the specific recipe. You will
want to focus on nnet3 type models.
6) When you get stuck go back to documentation, search the forum for
answers, and then if you are still stuck ask specific questions on the
forum. When asking questions provide sufficient information so that
whoever is trying to help you has a full context (what you did, what was
the expected outcome, what was the error, log files, any details about
mods you made, etc.). It's unlikely somebody on the forum will handhold
you through all the steps to train/test your models, especially if your
questions make it clear that you did not get familiar with the toolkit
from already available resources.
I think it's okay to go through step 5 without covering some of the
previous steps (e.g. skipping 3 and 4). But, if you are planning to do
anything serious with the toolkit, you should review those papers as
well.
I don't know if there are acoustic and language models for Spanish
available for download; even if there are, you will most likely need to
understand quite a few details about Kaldi before these assets would
become useful.
This may be blunt, but I hope it's helpful.
Good luck!
Ogi