Integrate Kaldi into existing Java/Hadoop workflow

347 views
Skip to first unread message

vfisher

unread,
Sep 28, 2017, 12:50:56 PM9/28/17
to kaldi-help
Hi All,

We have a batch voice processing workflow in Java on Hadoop.  The existing workflow currently uses another speech recognition package that comes with a Java API.  Now we want to try Kaldi in this same Java/Hadoop-based workflow. 

I've got a Java JNI program that passes command line args through to Kaldi.  But, Kaldi takes either a local file or stdin, neither of which is compatible with the typical Hadoop programming paradigm.  Hadoop input files are in the Hadoop File System (HDFS), as opposed to the local file system.  And Hadoop calls your methods and passes in data as arguments to the method, instead of the method reading directly from a file.    

The approach I'm considering now is to have a Java wrapper that reads the .wav file into a byte array, then passes this byte array to Kaldi as the input via a JNI call.  This requires me to enhance the util/kaldi-io classes to handle a byte array input (or some kind of in-memory array) .  I think I would have to add a new input type into util/kaldi-io.*.  

Just wondering if anyone else has already done anything like this, or if the Kaldi team is already working on anything like this.    . 

Thanks,
Vick



Daniel Povey

unread,
Sep 28, 2017, 12:56:49 PM9/28/17
to kaldi-help
I think if you wanted to have that kind of interface, it would be
easier to do it all at the C++ level than by wrapping the binaries.
Things that can read from disk (like WavHolder or whatever it's
called) usually have some kind of Read() call that can accept a
std::istream, and you can make it an istringstream that works from a
string. The binaries usually have pretty simple code that you can
copy and modify to work with the same input provided in a different
format.

However, if you are serious about using Kaldi I'd suggest installing
GridEngine and using NFS for files, at least for your initial work so
you undrerstand the normal Kaldi workflow. It seems to me that you
are trying to fit a square peg into the round hole, and I wonder
whether the Hadoop aspect is really adding any value.

Dan
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

vfisher

unread,
Sep 28, 2017, 1:09:07 PM9/28/17
to kaldi-help
Thank Dan.  First of all, yes, square pegs and round holes, though not of my choosing.  It's a sunk-cost situation - where we already have Hadoop and a lot of other code written for it before and after Kaldi.  We do have an HPC system where we're running Kaldi on Moab for training, but that system won't allow us to host our continuous production process and data there.  

I've been looking at WavHolder, but it seems to come into play only after quite a bit of command-line options processing and other processing happens. I'd have to rewrite all that too if I want to maintain that functionality and call from JNI.  Well, thanks for your perspective - we could still go with a more standard setup if this is too painful.  

Daniel Povey

unread,
Sep 28, 2017, 1:14:15 PM9/28/17
to kaldi-help
You could bypass the command-line options processing by just
hardcoding everything in the code, or passing things in from the
Hadoop layer. The options processing just sets things in various
config classes or in variables.

You might want to unpack the decoder wrappers a bit, if you're using
those (I refer to any function with 'Wrapper' in its name). I.e.
follow the code in and figure out what it's doing in your particular
use-case, and you could write a modified version of the decoder
program, that doesn't use the wrapper, to verify that you understood
it correctly. The unpacked version would be easier to invoke from
hadoop.
Reply all
Reply to author
Forward
0 new messages