> I am currently trying to get out department to buy us a high spec mac that we can use for FAVE in house (we already have it up and running on various laptops) so that we can train the undergrads and postgrads on it and it will be available for them to use for dissertations/projects etc.
The way FAVE is built, it’s somewhat hard to get significantly better performance from a higher-spec machine. FAVE runs fastest on a computer with a very fast processor and a very fast hard disk—RAM doesn’t seem to matter much, assuming you have at least a few GB—but nowadays, even the cheapest Mac laptops have "very fast” hard drives (i.e., SSDs) and you can upgrade the Mac Mini drive to an SSD for US$ 200. And even the priciest processors, like those in the baseline Mac Pro (US$ 2,999) is only going to be about 1.5x than a baseline Mac Mini (which costs 5 times less). Both FAVE-align and FAVE-extract are written to process data in serial, so they don’t take advantage of the Mac Pro’s strengths, unless your lab intends to regularly run more than 4 batches at a time (if that’s the case, you might want to just purchase more low-end Macs instead). That’s not to discourage you from buying hardware to run alignment and extraction, of course.
> Leung & Zue (1984) also provide an estimate of about 30 seconds per phone. They indicate that it took two experienced transcribers about 15 hours to phonetically transcribe 65 sentences, corresponding to approximately 1950 phones (= 27.7 seconds / phone).
>
> Leung, Hong C. and Victor W. Zue. 1984 . A procedure for automatic alignment of phonetic transcriptions with continuous speech. Proceedings of the IEEE International Converence on Acoustics, Speech, and Signal Processing (ICASSP 1984), pp. 73-76.
Here’s how J.P. Hosom (Speaker-independent phoneme alignment using transition-dependent states. Speech Communication 51: 352-368, 2009) summarized that study (I can’t find the original paper, either; does anyone on the list have a PDF of it?):
Leung and Zue (1984) evaluated five American English sentences from the Harvard list of phonetically-balanced sentences, aligned by two people. Manual alignment required about 30 s per phoneme, and they reported approximately 80% agreement within 10 ms, 87% agreement within 15 ms, and 93% agreement within 20ms.
Under perfect conditions (high quality recordings, native adults, little intraspeaker overlap, etc.), it takes maybe 10 minutes to do word-level (“orthographic”) transcription of one minute of speech. By my back-of-the-envelope calculation (making educated guesses about phonemes per second, etc.), orthographic transcription would thus be at least 10x faster than phoneme alignment.
Kyle