What I thought of doing was using a frequency input to the voice recorder but it required the data itself to be uploaded(why not put additional voice recording method?).
Even if I did so, the text was a must, not an option.
What I thought was why don't people make a vocab boxes with additional button to it make the content play the data uploaded.
Ex: When an anchor is shown and branch needs to be selected, branch WILL require a text or it will be shown blank.