Speech and navigation messages

55 views
Skip to first unread message

Michael Cooper

unread,
May 21, 2014, 1:17:25 AM5/21/14
to flysig...@googlegroups.com

Hi All—

 

I think I’ve nearly got the timing issues worked out with speech and 10 Hz logging. It was a bit harder than I thought, because the microSD card occasionally pauses for about 75 ms before allowing the next command to continue. There’s no way to avoid that pause, so our audio buffer has to last at least 75 ms. Tom’s suggestion to go to a lower rate gets us most of the way there—up to 64 ms. Other changes I’ve been experimenting with will allow us, I think, to change to the “tiny” FatFS mode, which frees up enough memory to double the size of the audio buffer, giving 128 ms. I think this will be sufficient.

 

So—if the timing issues are actually solved, then the next question in my mind is how to make the speech mechanism as versatile as possible. Right now, in response to messages from the GPS module, the firmware creates a string of characters, e.g., “123.4”. This string is then handled one character at a time in UBX_Task, where it is turned into a series of words, e.g., “one two three point four”. The asynchronous nature of this system—that the string is created in one place and then handled without blocking in another place—allows us to keep everything else running while the FlySight is speaking.

 

One thing I like about the way the string is being handled right now is that it’s very easy to use for some of the most common cases, i.e., if you want the FlySight to speak a number, then you need only write that number into the string in the obvious way—there is no need to add a routine for converting numbers into a series of filenames to be spoken.

 

However, what I’m wondering is how that will extend to navigational messages. If we want the FlySight, for example, to say, “Turn left 90 degrees,” then we need to turn those words into a series of symbols. One way to do it might be to use letters. For example, we could set the string to “tl90d” and add a few files to the AUDIO folder:

 

t.wav     “turn”

l.wav     “left”

d.wav    “degrees”

 

Then without modifying the way we parse the string, we would hear “turn left nine zero degrees”. Pretty close. I suppose my only concern is how this might be extended, e.g., if we wanted the FlySight to say “turn left ninety degrees” instead. Maybe we could come up with single-character symbols for each word we want it to say, but I wonder if that is ultimately going to be limiting.

 

That said, single-character symbols are a very simple way of handling things, and until we start to run out of symbols, maybe there’s no good reason to go to a more complex system. These limitations really only apply to “procedurally generated” speech. There will be a lot more flexibility, e.g., with spoken alarms—the user should be able to specify any filename and a condition under which it should be spoken, without having the same limitations we’re talking about above. If you wanted the FlySight to read a poem to you at pull time, you’d just have to record the poem as a single file and tell the FlySight to play it at a particular altitude.

 

Any thoughts?

 

Michael

Ahti Legonkov

unread,
May 21, 2014, 5:04:34 AM5/21/14
to flysig...@googlegroups.com

On 21 May 2014 08:17, Michael Cooper <mic...@flysight.ca> wrote:
...
 

 

However, what I’m wondering is how that will extend to navigational messages. If we want the FlySight, for example, to say, “Turn left 90 degrees,” then we need to turn those words into a series of symbols. One way to do it might be to use letters. For example, we could set the string to “tl90d” and add a few files to the AUDIO folder:

 

t.wav     “turn”

l.wav     “left”

d.wav    “degrees”

 

Then without modifying the way we parse the string, we would hear “turn left nine zero degrees”. Pretty close. I suppose my only concern is how this might be extended, e.g., if we wanted the FlySight to say “turn left ninety degrees” instead. Maybe we could come up with single-character symbols for each word we want it to say, but I wonder if that is ultimately going to be limiting.


... 

Any thoughts? 

 


Just one. I don't think that making the navigational messages more sophisticated than "turn left nine zero degrees" would be helpful. For example in English distinguishing between ninety and nineteen (or 9-0 vs. 4-20-10 in French, may be worse in other languages) can be a bit difficult if there's a lot of noise around.  Even pilots when talking numbers are spelling them out one digit at a time to avoid confusion. Also the simpler version makes it easier to translate the audio to different languages.

--
lego

Luke Hederman

unread,
May 21, 2014, 8:45:07 AM5/21/14
to flysig...@googlegroups.com
Hi Michael,

I think the spoken feedback should be kept short, clear and simple.  I had considered spoken words for 20, 30, 400, 500, 7000, 8000 etc but I don't think it's worth the extra complication and there is potential to mix up similar sounds like 15 and 50 etc.
Rather than "Turn left 90 degrees" i was thinking "nine zero left" potentially followed by "one point file miles" or using the clock method "nine o'clock, one point five miles"
To that end I think single character codes for the wav files is sufficient for now.

Some suggestions for characters:
a - ten
b - eleven
c - twelve
f - feet
l - left
r - right
j - meters
k - kilometers
m - miles
o - o'clock
(some suitable symbol) - [silent pause]

To avoid scope creep I think extra bells and whistles like poetry at break-off can wait until the subsequent release if required ;-)
--
You received this message because you are subscribed to the Google Groups "FlySight Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flysight-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Will Glynn

unread,
May 21, 2014, 1:36:49 PM5/21/14
to flysig...@googlegroups.com
I was thinking of replacing characters with #defines like [SPEECH_NINE, SPEECH_ZERO, SPEECH_LEFT], them adding helper functions/macros to turn (int)90 into the appropriate group of symbols and enqueue speech for playback. This queue can be represented as an int8_t array (just like a string) and converted into descriptive filenames using a map as appropriate. I think this would keep the code and the filenames clear, even as the list of things FlySight can say grows past the set of things representable by individual characters.

Then I got to thinking that "play this file" and "play this tone" are really the same thing, so maybe these two functions should share a common playback queue.

Then I thought that maybe the entire audio subsystem should be modeled as a priority queue, so e.g. breakoff plays over navigation which plays over tones. This way, each audio source only has to consider what it wants to play (rather than the state of the entire system) and the user always hears the desired output.

Then I thought that maybe the audio system should ask the various sources what, if anything, they want to play right at this moment. That would have similar effects compared to the priority queue concept, but using an inversion of the control flow rather than extra buffers.

Then I got far enough from the original concept that I didn't submit any patches :-/

--Will

Tom van Dijck

unread,
May 21, 2014, 1:36:55 PM5/21/14
to flysig...@googlegroups.com
I entirely agree with Luke here, simple is better and less ambiguous...
That said, using the first letter of a word might lead to collisions too, why not just let the user type in the entire sentence, we can easily parse that back to just byte code... or maybe there is a tool for people to create audio-sets, which would solve that problem entirely.

Tom.

Michael Cooper

unread,
May 21, 2014, 7:12:05 PM5/21/14
to flysig...@googlegroups.com

I’ve just merged a pull request that includes a lot of changes related to audio/log timing. The full description is here:

 

https://github.com/flysight/flysight/pull/35

 

I set up a few pins on PORTF to indicate two major error conditions:

 

1.       An audio buffer overrun. In the past, these have resulted in occasional “clipped” speech.

2.       A log buffer overrun. Essentially, this is what was causing invalid lines in the CSV files previously.

 

With these pins set up, I was able to hook the board up to a logic analyzer and run a long “torture” test with a particularly long audio file being played repeatedly. In about 3 hours of logging, there were no errors at all with the logging rate set to 10 Hz, so I think we’ve finally nailed down the timing issues.

 

As part of these changes, I’ve reduced the audio sample rate from 31250 Hz down to 7812 Hz (one quarter). With interpolation, I don’t think there is a significant change in audio quality, but I would welcome any input you guys have. The updated audio files are included in the GitHub repository.

 

I’m going to keep speech very simple for now. I’ll produce a couple of extra files aimed at navigation, and then we’ll see if there is any need to make things more complex.

 

Michael

Reply all
Reply to author
Forward
0 new messages