Least intrusive way of adding (encoded) audio support to noVNC

Luis Héctor Chávez

unread,

Feb 6, 2021, 4:29:09 PM2/6/21

to noVNC

Hello!

I saw that there is some progress in [1] to support audio in noVNC and the associated issue is marked as "patchwelcome"[4]. There's one tiny thing that would be nice: allowing the audio to be compressed. The main problem is that the only official RFB protocol extension related to audio[2] expects an uncompressed stream of raw samples of fixed size, so there's currently no standard-compliant way to achieve this.

Which brings me to the actual question that I want to ask: if, as an embedder of noVNC, I were to add encoded audio support to noVNC in the least intrusive / most palatable way possible, what would be the way to go? Some alternatives I've considered so far:

Add support for specifying custom decoders and message handlers in RFB's constructor's option parameter. This should be fairly unintrusive from noVNC's perspective (most of the implementation would live outside of the noVNC codebase), but would expose the `sock` object to the outside since the custom message handler would potentially need to send messages (e.g. for negotiating the supported codecs). It also does not interact with the rest of the RFB community, which is probably good in the immediate short term (to get an implementation out of the gate faster), but not so good in the long term (pretty sure other folks would benefit from getting this standardized as the discussion in the above issue indicates). But maybe this bootstraps the process of getting a working implementation before starting the standardization discussion before formalizing the implementation in noVNC. See the appendix at the end of this email for what it would look like.
Help get the PR for the standard QEMU Audio extension merged following the standard as closely as possible, and sniff the implementation server name for a magic string and then decode the stream based on the information from the server name. This feels a bit hacky :(
Add a new entry in the list of QEMU Audio sample formats. This might require getting something going in the rfbproto codebase, but they seem to want to have a reference implementation[3] before standardizing stuff, which leaves us in a chicken-and-egg situation. Also, potentially to QEMU, and I don't know if they even have a need for that. The other downside is that there would be no way of negotiating what codecs / container formats are supported by the client.
A combination of multiple of the above?
Some other approach I did not consider?

Please advise on what the preferred course of action is and let's get audio support for everyone!

1: https://github.com/novnc/noVNC/pull/1457

2: https://github.com/rfbproto/rfbproto/blob/master/rfbproto.rst#qemu-audio-client-message

3: https://github.com/rfbproto/rfbproto/pull/39#issuecomment-773276572

4: https://github.com/novnc/noVNC/issues/302

Appendix:

potential diff for docs/EMBEDDING.md for option 1:

diff --git a/docs/EMBEDDING.md b/docs/EMBEDDING.md

index 1050014..a6e2ecf 100644

--- a/docs/EMBEDDING.md

+++ b/docs/EMBEDDING.md

@@ -71,6 +71,37 @@ query string. Currently the following options are available:

* `logging` - The console log level. Can be one of `error`, `warn`, `info` or

`debug`.

+* `decoders` - A dictionary of rectangle encoding numbers to instances of

+ classes that have the `decodeRect` function:

+ ```javascript

+ class MyDecoder {

+ // Returns true if the rect could be decoded.

+ decodeRect(x, y, width, height, sock, display, depth);

+ }

+

+ // ...

+ decoders: {

+ 0x1234: new MyDecoder(),

+ },

+ ```

+* `message_handlers` - A dictionary of message handlers that can decode

+ messages of a specific message ID:

+ ```

+ class MyMessageHandler {

+ // Called with an instance of Websock when the connection is

+ // available, or null when it's not.

+ setSocket(sock);

+

+ // Returns true if the message could be decoded.

+ handleMessage();

+ }

+

+ // ...

+ message_handlers: {

+ 0xFC: new MyMessageHandler(),

+ }

+ ```

+

## HTTP Serving Considerations

### Browser Cache Issue

Pierre Ossman

unread,

Feb 8, 2021, 4:38:10 AM2/8/21

to no...@googlegroups.com, Luis Héctor Chávez

On 06/02/2021 22:29, Luis Héctor Chávez wrote:
>
> Which brings me to the actual question that I want to ask: if, as an
> embedder of noVNC, I were to add encoded audio support to noVNC in the
> least intrusive / most palatable way possible, what would be the way to go?
> Some alternatives I've considered so far:
>

> ...
>
> 5. Some other approach I did not consider?
>
We'd like to handle it cleanly, so that means a protocol extension.
Either building on the QEMU one, or something new.

Compression is not the only thing missing in the QEMU extension, so a
fresh start with a more complete audio protocol could be interesting.

You'd also need server support for this. Do you have one in mind you
intend to add support for this to?

Regards
--
Pierre Ossman Software Development
Cendio AB https://cendio.com
Teknikringen 8 https://twitter.com/ThinLinc
583 30 Linköping https://facebook.com/ThinLinc
Phone: +46-13-214600

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Luis Héctor Chávez

unread,

Feb 8, 2021, 7:32:07 AM2/8/21

to Pierre Ossman, no...@googlegroups.com

On Mon, Feb 8, 2021, 1:38 AM Pierre Ossman <oss...@cendio.com> wrote:

On 06/02/2021 22:29, Luis Héctor Chávez wrote:
>
> Which brings me to the actual question that I want to ask: if, as an
> embedder of noVNC, I were to add encoded audio support to noVNC in the
> least intrusive / most palatable way possible, what would be the way to go?
> Some alternatives I've considered so far:
>
> ...
>
> 5. Some other approach I did not consider?
>
We'd like to handle it cleanly, so that means a protocol extension.
Either building on the QEMU one, or something new.

i am not opposed to building a new one. how does someone go about this?

Compression is not the only thing missing in the QEMU extension, so a
fresh start with a more complete audio protocol could be interesting.

You'd also need server support for this. Do you have one in mind you
intend to add support for this to?

right now i have a proxy that interposes in front of TigerVNC for ease of implementation, but would be willing to move that code over to that codebase as a PR. let me ask them if they would accept such patches.

Luis Héctor Chávez

unread,

Feb 8, 2021, 7:56:57 AM2/8/21

to noVNC

On Monday, February 8, 2021 at 4:32:07 AM UTC-8 Luis Héctor Chávez wrote:

On Mon, Feb 8, 2021, 1:38 AM Pierre Ossman <oss...@cendio.com> wrote:
On 06/02/2021 22:29, Luis Héctor Chávez wrote:
>
> Which brings me to the actual question that I want to ask: if, as an
> embedder of noVNC, I were to add encoded audio support to noVNC in the
> least intrusive / most palatable way possible, what would be the way to go?
> Some alternatives I've considered so far:
>
> ...
>
> 5. Some other approach I did not consider?
>
We'd like to handle it cleanly, so that means a protocol extension.
Either building on the QEMU one, or something new.

i am not opposed to building a new one. how does someone go about this?

Compression is not the only thing missing in the QEMU extension, so a
fresh start with a more complete audio protocol could be interesting.

Forgot to ask, what else is missing from your POV? The only other thing that I could think of was some sort of mechanism to let the server know the client is lagging behind too much to drop some packets, but I may need to go educate myself how other protocols do it.

Pierre Ossman

unread,

Feb 8, 2021, 8:16:14 AM2/8/21

to no...@googlegroups.com, Luis Héctor Chávez

On 08/02/2021 13:31, Luis Héctor Chávez wrote:
> On Mon, Feb 8, 2021, 1:38 AM Pierre Ossman <oss...@cendio.com> wrote:
>> We'd like to handle it cleanly, so that means a protocol extension.
>> Either building on the QEMU one, or something new.
>>
>
> i am not opposed to building a new one. how does someone go about this?
>

Just start hacking, that's usually the best way forward. :)

What we'd like is working code, and documentation for rfbproto.

> Forgot to ask, what else is missing from your POV? The only other thing
> that I could think of was some sort of mechanism to let the server know the
> client is lagging behind too much to drop some packets, but I may need to
> go educate myself how other protocols do it.

Primarily the buffer handling, yes. Normally audio should be "pull", not
"push". I.e. the playing sound card should own the clock and control the
data rate.

In a use case like this you also want low latency, and latency feedback,
since you want graphics and user actions to be in sync with the audio.
However low latency usually means low buffering, which could result in
stuttering.

Audio is unfortunately a complex beast since it is a real time problem.

Ideally we'd piggy back on something existing. Perhaps there is an
existing protocol we can encapsulate in VNC to handle things?

We here at Cendio use PulseAudio, and we handle it separately and not
embedded in the VNC stream. It is very capable, but it is also a very
complex beast. So something simpler should hopefully exist.

>>
>> You'd also need server support for this. Do you have one in mind you
>> intend to add support for this to?
>>
>
> right now i have a proxy that interposes in front of TigerVNC for ease of
> implementation, but would be willing to move that code over to that
> codebase as a PR. let me ask them if they would accept such patches.
>

We would. :)

Note that TigerVNC has no audio features at all right now, so there's
probably quite a bit of infrastructure needed.

Luis Héctor Chávez

unread,

Feb 8, 2021, 9:59:31 PM2/8/21

to noVNC

On Monday, February 8, 2021 at 5:16:14 AM UTC-8 oss...@cendio.com wrote:

On 08/02/2021 13:31, Luis Héctor Chávez wrote:
> On Mon, Feb 8, 2021, 1:38 AM Pierre Ossman <oss...@cendio.com> wrote:
>> We'd like to handle it cleanly, so that means a protocol extension.
>> Either building on the QEMU one, or something new.
>>
>
> i am not opposed to building a new one. how does someone go about this?
>

Just start hacking, that's usually the best way forward. :)

What we'd like is working code, and documentation for rfbproto.

> Forgot to ask, what else is missing from your POV? The only other thing
> that I could think of was some sort of mechanism to let the server know the
> client is lagging behind too much to drop some packets, but I may need to
> go educate myself how other protocols do it.

Primarily the buffer handling, yes. Normally audio should be "pull", not
"push". I.e. the playing sound card should own the clock and control the
data rate.

In a use case like this you also want low latency, and latency feedback,
since you want graphics and user actions to be in sync with the audio.
However low latency usually means low buffering, which could result in
stuttering.

hmm although having a pull model for audio might introduce a lot of latency due to all the extra roundtrips (I was thinking about this since WebRTC also uses a push model, as opposed to DASH/LL-DASH, which are pull models and have 1-40s worth of latency). folks the other side of the earth might appreciate a latency reduction of 200-300msec by avoiding these.

but on the other hand, i'd rather make this a data-driven decision instead of going on a hunch. i already have a push proof-of-concept implementation ready (since it was piggy-backing on the QEMU audio extension), so i can compare it with a pull implementation and see which one performs better over the internet.

Audio is unfortunately a complex beast since it is a real time problem.

Ideally we'd piggy back on something existing. Perhaps there is an
existing protocol we can encapsulate in VNC to handle things?

We here at Cendio use PulseAudio, and we handle it separately and not
embedded in the VNC stream. It is very capable, but it is also a very
complex beast. So something simpler should hopefully exist.

The PoC i have transcodes PulseAudio's simple API to get the audio data transcoded and muxed into https://www.w3.org/TR/mse-byte-stream-format-webm/

right now i'm in the process of going through the specs of streaming protocols that are amenable to the Media Source Extensions (e.g. DASH/LL-DASH, mostly from this list https://developer.mozilla.org/en-US/docs/Web/Guide/Audio_and_video_delivery/Live_streaming_web_audio_and_video).

Pierre Ossman

unread,

Feb 9, 2021, 2:06:14 AM2/9/21

to no...@googlegroups.com, Luis Héctor Chávez

On 09/02/2021 03:59, Luis Héctor Chávez wrote:
>
> hmm although having a pull model for audio might introduce a lot of latency
> due to all the extra roundtrips (I was thinking about this since WebRTC
> also uses a push model, as opposed to DASH/LL-DASH, which are pull models
> and have 1-40s worth of latency). folks the other side of the earth might
> appreciate a latency reduction of 200-300msec by avoiding these.
>

That must be an implementation issue. All the professional stuff is pull
AFAIK. I've mostly seen DASH used for video, and not interactive stuff.
So latency was likely less important there than stuttering.

Luis Héctor Chávez

unread,

Feb 9, 2021, 8:23:35 AM2/9/21

to noVNC

On Monday, February 8, 2021 at 11:06:14 PM UTC-8 oss...@cendio.com wrote:

On 09/02/2021 03:59, Luis Héctor Chávez wrote:
>
> hmm although having a pull model for audio might introduce a lot of latency
> due to all the extra roundtrips (I was thinking about this since WebRTC
> also uses a push model, as opposed to DASH/LL-DASH, which are pull models
> and have 1-40s worth of latency). folks the other side of the earth might
> appreciate a latency reduction of 200-300msec by avoiding these.
>

That must be an implementation issue. All the professional stuff is pull
AFAIK. I've mostly seen DASH used for video, and not interactive stuff.
So latency was likely less important there than stuttering.

do you have any references to specs / implementations for the professional stuff? i'd like to avoid follow other well-established patterns and conventions if possible.

Luis Héctor Chávez

unread,

Feb 9, 2021, 8:25:29 AM2/9/21

to noVNC

On Tuesday, February 9, 2021 at 5:23:35 AM UTC-8 Luis Héctor Chávez wrote:

On Monday, February 8, 2021 at 11:06:14 PM UTC-8 oss...@cendio.com wrote:
On 09/02/2021 03:59, Luis Héctor Chávez wrote:
>
> hmm although having a pull model for audio might introduce a lot of latency
> due to all the extra roundtrips (I was thinking about this since WebRTC
> also uses a push model, as opposed to DASH/LL-DASH, which are pull models
> and have 1-40s worth of latency). folks the other side of the earth might
> appreciate a latency reduction of 200-300msec by avoiding these.
>

That must be an implementation issue. All the professional stuff is pull
AFAIK. I've mostly seen DASH used for video, and not interactive stuff.
So latency was likely less important there than stuttering.

do you have any references to specs / implementations for the professional stuff? i'd like to avoid past mistakes and follow other well-established patterns and conventions if possible.

Pierre Ossman

unread,

Feb 9, 2021, 9:40:49 AM2/9/21

to no...@googlegroups.com, Luis Héctor Chávez

On 09/02/2021 14:23, Luis Héctor Chávez wrote:
>
>>
>> That must be an implementation issue. All the professional stuff is pull
>> AFAIK. I've mostly seen DASH used for video, and not interactive stuff.
>> So latency was likely less important there than stuttering.
>
>
> do you have any references to specs / implementations for the professional
> stuff? i'd like to avoid follow other well-established patterns and
> conventions if possible.
>

PulseAudio's standard API, ALSA and JACK are the first that come to
mind. Apple's CoreAudio likely also have such an approach. It's been
some time since I was involved in audio stuff so it's a bit fuzzy. :)

(I think Windows' old WaveOut API also has that approach, but it is a
rather horrible API so not really something to get inspired by. :))

Luis Héctor Chávez

unread,

Feb 21, 2021, 11:12:21 PM2/21/21

to noVNC

Sorry for the delay, I was cleaning stuff up. Here's the protocol extension docs: https://github.com/replit/rfbproxy#replit-audio-rfb-extension, it supports both pull and push models, although I got terrible results with the pull model due to the extra latency and the jitter introduced by the browser and the underlying AudioBuffer's state machinery, which doesn't allow chunks to be appended at arbitrary times. With the push model the audio was consistently smooth and was able to cap it to an acceptable ~500ms of total latency (could go as low as 300msec, but depending on where on the internet this was hosted it started tearing up).

Also, here's the PR for the noVNC support: https://github.com/novnc/noVNC/pull/1525, with only the push model support due to the above limitations being extra obvious on the browser. maybe for standalone applications where there is better control of the underlying buffer it might make sense to revisit the push model.

Pierre Ossman

unread,

Feb 22, 2021, 3:37:24 AM2/22/21

to no...@googlegroups.com, Luis Héctor Chávez

On 22/02/2021 05:12, Luis Héctor Chávez wrote:
> Sorry for the delay, I was cleaning stuff up. Here's the protocol extension
> docs: https://github.com/replit/rfbproxy#replit-audio-rfb-extension, it

Could you submit that as a PR to rfbproto so we can sort out the details
and get it officially documented?

And have you allocated those numbers with IANA?

> supports both pull and push models, although I got terrible results with
> the pull model due to the extra latency and the jitter introduced by the
> browser and the underlying AudioBuffer's state machinery, which doesn't
> allow chunks to be appended at arbitrary times. With the push model the
> audio was consistently smooth and was able to cap it to an acceptable
> ~500ms of total latency (could go as low as 300msec, but depending on where
> on the internet this was hosted it started tearing up).
>

That's a huge latency. For things to feel synchronised you generally
need to go under 100 ms.

Luis Héctor Chávez

unread,

Feb 22, 2021, 8:02:16 AM2/22/21

to Pierre Ossman, no...@googlegroups.com

On Mon, Feb 22, 2021, 12:37 AM Pierre Ossman <oss...@cendio.com> wrote:

On 22/02/2021 05:12, Luis Héctor Chávez wrote:
> Sorry for the delay, I was cleaning stuff up. Here's the protocol extension
> docs: https://github.com/replit/rfbproxy#replit-audio-rfb-extension, it

Could you submit that as a PR to rfbproto so we can sort out the details
and get it officially documented?

will do!

And have you allocated those numbers with IANA?

i haven't. how does one do that?

> supports both pull and push models, although I got terrible results with
> the pull model due to the extra latency and the jitter introduced by the
> browser and the underlying AudioBuffer's state machinery, which doesn't
> allow chunks to be appended at arbitrary times. With the push model the
> audio was consistently smooth and was able to cap it to an acceptable
> ~500ms of total latency (could go as low as 300msec, but depending on where
> on the internet this was hosted it started tearing up).
>

That's a huge latency. For things to feel synchronised you generally
need to go under 100 ms.

yeah :( forgot to mention, those 300msec are all in the browser's audio buffer, so there's nothing else that i can do in the protocol or server side.

Luis Héctor Chávez

unread,

Feb 22, 2021, 9:56:02 AM2/22/21

to noVNC

On Monday, February 22, 2021 at 5:02:16 AM UTC-8 Luis Héctor Chávez wrote:

On Mon, Feb 22, 2021, 12:37 AM Pierre Ossman <oss...@cendio.com> wrote:
On 22/02/2021 05:12, Luis Héctor Chávez wrote:
> Sorry for the delay, I was cleaning stuff up. Here's the protocol extension
> docs: https://github.com/replit/rfbproxy#replit-audio-rfb-extension, it

Could you submit that as a PR to rfbproto so we can sort out the details
and get it officially documented?

will do!

done!

And have you allocated those numbers with IANA?

i haven't. how does one do that?

i *think* i found how to. done!

Reply all

Reply to author

Forward