Best current tools for manually aligning audio with text

678 views
Skip to first unread message

Weston Ruter

unread,
Dec 3, 2013, 2:02:53 AM12/3/13
to hyper...@googlegroups.com
I've got a friend who is doing work with some minority languages, ones which do not have any speech-recognition language models built which could be used to align spoken audio with corresponding text. The goal of the hyperaudio text-audio alignment is to help with teaching literacy.

What are the best tools out there today for manually aligning audio with text? I'm thinking of something that is crowd-sourced, has ability for multiple passes to further improve the data. In terms of UI, I was thinking having an audio player (perhaps slowed down 0.5x) which starts out with the first word in the text highlighted. Each time the audio starts to speak the next word, the data-entry person could hit the spacebar to advance the cursor to the next word, and in doing so the timings for each word would be collected for the corresponding audio. Is there anything like this or better currently available for use?

Thanks!

Mark Boas

unread,
Dec 3, 2013, 10:08:32 AM12/3/13
to hyper...@googlegroups.com
Hi Weston,

Part of what we are doing with Hyperaudio is providing methods for people to transcribe their own material, they can then submit it to our free time aligning tool that aligns media timings to text given the transcript and the media. However for now, our time-aligner is designed to align English. That said, it may work to a certain degree with other languages in future and as soon as it does we plan to support them. We also will provide a tool to allow people to easily clean up the results of whatever the time-aligner produces.

We are working hard on getting to beta still, but we're pretty close, and you can try a minimal version of the transcript maker here: http://hyperaud.io/lab/ha-maker/ start typing and the video will pause after 4 seconds and resume when you stop. You can also adjust playback rate and pause interval. This doesn't do the timing. I've tried something something similar to the spacebar method here http://happyworm.com/blog/2010/12/05/drumbeat-demo-html5-audio-text-sync/ It's OK but not fantastically accurate, however it may be good a way to get approximate timing.

Mark
Message has been deleted

Mark

unread,
Dec 10, 2013, 5:45:29 AM12/10/13
to hyper...@googlegroups.com
Thanks Brian - it's a prototype just now, so it's great to get feedback from people like yourself. We'll certainly add pause functionality into the beta version which is coming soon.

Cheers

Mark


On 10 December 2013 04:37, Brian Cellars <gma...@gmail.com> wrote:
Hi Mark, 
That's a very, very interesting tool, and is exactly what my brother is looking for. I think we could also implement it for people learning languages. One big suggestion. There needs to be an 'over-ride' pause button, so that if you need to stop for whatever reason (talk to someone, go for a pee, have a drink, etc), then you can pause the video (or audio) and then resume typing when you're ready. As it is, it resumes playing when you stop typing and the only way to keep it paused is to keep typing. I'm sure that's an easy fix.
Cheers,
Brian

--
You received this message because you are subscribed to the Google Groups "hyperaudio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hyperaudio+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Nic Wistreich

unread,
Dec 10, 2013, 7:04:15 AM12/10/13
to hyper...@googlegroups.com
With the syncing issue, would it be possible to use the chunks as a first pass sync, and then use a second method to exactly position each word within that?

For instance, the user sets chunks to last 3 seconds, and when they've typed up that chunk, hit something like shift-return. That jumps to the next chunk of audio and also links that chunk of text to the timecode. Then, when the video is transcribed and roughly aligned, the user can perfect each chunk

Made a quick mockup - http://visuali.st/bin/hyperaudio_position.png - so here the Position section would play the audio of a 3-sec chunk in a loop starting from the highlighted word. The arrow keys can nudge the position of that word back and forwards frame-by-frame, and when it’s right the user hits tab to jump to the next word.

Obv that adds more complexity, but it could output frame accurate synching.

cheers
Nic

Mark Boas

unread,
Dec 11, 2013, 4:11:47 PM12/11/13
to hyper...@googlegroups.com
Hi Nic,

Yes something like this might work better than we can even predict. Although we do have access of algorithms to do the aligning for us, it would be interesting to see how easy we can make the manual process with the use of some clever UI. Certainly the time-alignment software we are using works better if the media and transcripts are chunked up it to small pieces, so any time information we can feed the algorithm the better.

Thanks for taking the time to put together the sketch - it's really heartening to get such constructive feedback and we'll certainly be taking all into account.

Best

Mark

Nic Wistreich

unread,
Dec 11, 2013, 7:52:05 PM12/11/13
to hyper...@googlegroups.com
Hi Mark,

Very happy if it’s of any use. It seems such a helpful and promising tool - indeed as Zev said, the most usable transcription tool I’ve seen.

cheers
Nic
Message has been deleted
Message has been deleted

Mark

unread,
Dec 16, 2013, 3:46:53 PM12/16/13
to hyper...@googlegroups.com
Some issue with this group where posts are getting deleted (I'm pretty versed in Google Groups having used for jPlayer for several years so I don't *think* it's me). I'm replying to this thread as a way of exposing the two posts on the forum.


On 12 December 2013 20:38, Zev Averbach <z...@avtranscription.com> wrote:
There are two things that currently exist resembling what you're describing, Nic:  Amara's browser-based transcriber, and ProTranscript.  The manual line by line timing in Amara (not word by word), as well as the process you outlined, is of course a second pass, which is much less efficient than the algorithm Mark and team are leveraging, but it's free!

Mark Boas

unread,
Dec 16, 2013, 4:01:12 PM12/16/13
to hyper...@googlegroups.com
Nic Wrote : "With the syncing issue, would it be possible to use the chunks as a first pass sync, and then use a second method to exactly position each word within that?

For instance, the user sets chunks to last 3 seconds, and when they've typed up that chunk, hit something like shift-return. That jumps to the next chunk of audio and also links that chunk of text to the timecode. Then, when the video is transcribed and roughly aligned, the user can perfect each chunk

Made a quick mockup - http://visuali.st/bin/hyperaudio_position.png - so here the Position section would play the audio of a 3-sec chunk in a loop starting from the highlighted word. The arrow keys can nudge the position of that word back and forwards frame-by-frame, and when it’s right the user hits tab to jump to the next word.

Obv that adds more complexity, but it could output frame accurate synching.

cheers
Nic"

Zev replied: "There are two things that currently exist resembling what you're describing, Nic:  Amara's browser-based transcriber, and ProTranscript.  The manual line by line timing in Amara (not word by word), as well as the process you outlined, is of course a second pass, which is much less efficient than the algorithm Mark and team are leveraging, but it's free!"

------

Nic, we're certainly looking at getting some timing information from the manual transcription process, so yes that is a valid idea. We also have a time alignment tool at our disposal that takes text and media as inputs and then marries them up to produce timings. The time alignment tool we're using works better with smaller chunks, so there's a definite advantage to chunking stuff up.

Zev - yes Amara's subtitling tool is great and in fact we took a lot of ideas from that. Also you can use that to create subtitles and then submit then to the Hyperaudio Converter http://hyperaud.io/lab/ha-converter/

The tool we're most looking forward on finishing off is the Hyperaudio Cleaner that allows you to tweak roughly put together transcripts. Hope to show you a big improvement on this later http://hyperaud.io/lab/ha-cleaner/v07/ (the idea is you click on a word, hit the adjust radio button and then tweak the timing of that word with the slier) - I think there's a lot more we can do here and will definitely be taking ideas on board.

Keep it coming! :)

Mark

To unsubscribe from this group and stop receiving emails from it, send an email to hyperaudio+unsubscribe@googlegroups.com.

Nic Wistreich

unread,
Dec 16, 2013, 5:29:57 PM12/16/13
to hyper...@googlegroups.com
Mark: Yes something seems up - I only got Zev’s 12 Dec reply today, and I think my original message didn’t appear in the group until a day or two after I sent it (last week).

Zev: I've long been a fan of Amara / Universal Subs. I don’t remember it being so suited to synching a transcript to exact words & frames so that you could do a clean cut/edit in the way hyperaudio is talking of - tho it sounds like that’s what HACleaner is for.

Zev Averbach

unread,
Dec 17, 2013, 2:44:52 PM12/17/13
to hyper...@googlegroups.com
Yes, thanks for clarifying, Nic.  Those other services at least illustrate different approaches, interface-wise, but aren't word-granular.  Last I heard, Amara was at least entertaining the notion of doing auto-timing, and they do have a beta interactive transcript piece floating around.

Re: Google Groups -- Stack Exchange might be a little friendlier?
Reply all
Reply to author
Forward
0 new messages