Voice

0 views
Skip to first unread message

daniel miller

unread,
Jan 22, 2009, 2:06:33 PM1/22/09
to realxtend-a...@googlegroups.com
I was wondering if voice has been considered as a feature for the
viewer. I have an extensive background in audio/video streaming and
codec work (I was a contributor at xiph.org, makers of ogg/vorbis and
the new ogg video stuff in Firefox 3.1).

It seems like this is an important feature to think about, since it's
something SL offers that no alternative platform has yet to match.

-danx0r


On Thu, Jan 22, 2009 at 5:37 AM, Ryan McDougall <semp...@gmail.com> wrote:
>
> Finally read the citations here:
> http://rexdeveloper.org/wiki/index.php?title=Abstraction_levels_-_world_object_structure
>
> Excellent choice to go with an entity system.
>
> Not only is Smoke compatible with ES, if you read the source code you
> notice it *is* in fact an ES, even if not advertised as such. An ES
> seems to be a requirement for an efficient component-based system
> running on multiple threads.
>
> Cheers,
>
> >
>

Ryan McDougall

unread,
Jan 23, 2009, 2:15:04 AM1/23/09
to realxtend-a...@googlegroups.com
Yes voice and video are extremely important, and we have two people
investigating the Telepathy framework:
http://telepathy.freedesktop.org/wiki/

We also did a prototype using the SIP protocol directly, but we
weren't impressed with the status of high-level libraries or the
community support for non-telephony applications. That said, Telepathy
does have SIP support through Sofia-SIP, so that would no rule things
out.

Another project we had a lot of interest in was sip-communicator.org,
which aims to be a skype replacement using SIP, but it was in Java,
which may pose some challenge to integrating with c++/python
applications.

If you have ideas, we are *more* than happy to hear all about them.
Especially if you see something we've missed.

Cheers,

daniel miller

unread,
Jan 23, 2009, 11:37:20 PM1/23/09
to realxtend-a...@googlegroups.com
On audio: my first thought would be to go with something fully open,
like ogg/vorbis/speex. Audio needs to be really tightly integrated;
it's not just "virtual world + skype" -- it's more like audio in a
game. You want to be able to fully control things like
synchronization, spatialization, panning, volume, and so on. That
could be very difficult with a typical SIP type of app; something like
xiph's stuff gives you much more low-level control. Of course, the
trade-off is a steeper learning curve, and you need someone who can
really dig in and understand this stuff.

Video has somewhat different requirements. I was involved in Theora,
so naturally I would recommend it as a starting point (another
xiph/ogg offering). But the truth is that in video, you have to deal
with interoperability with all the formats out there.

In both cases, I think we need some work on the requirements side.
For instance, audio includes speech, music, and ambient effects, each
of which has their own issues.

-dan

Ryan McDougall

unread,
Jan 24, 2009, 4:16:18 AM1/24/09
to realxtend-a...@googlegroups.com
I may not get precisely what you are going for. Can you start by
laying out some use cases? Requirements, as you said. Feel free to run
with your own requirements, as they might be more developed than ours,
which do kinda look like "virtual world + skype".

Cheers,

daniel miller

unread,
Jan 24, 2009, 3:03:22 PM1/24/09
to realxtend-a...@googlegroups.com
well maybe I'm confusing voice and audio -- they're separate features
after all. I have not had occasion to use voice on SL, so I may be
ass-talking here.

I guess the main thing that differentiates our application from Skype
is that it is part of a virtual environment; it's not a phone call
(unless you do a private voice chat, which I assume works more like a
Skype call). Here's a bit of a review of SL voice that discusses the
experience:

"But voice in Second Life is better than Skype: The sound is
three-dimensional. You hear the sound directionally. The avatar's
voice seems to be coming from the direction the avatar is in relation
to your avatar. If the speaker's avatar is coming from your left, you
hear the voice from your left. If the speaker's avatar is coming from
the right, you hear the voice from the right. If the avatar is far
away, their voice is softer, if it's closer, it's louder. As you walk
or otherwise move around, the sound seems to change direction
appropriately."

So assuming we want at least as good an experience as Linden has
provided, the challenge is to integrate the voice feature into the
user's audioscape. I'd like to see further improvements such as echo,
reverb, giving prims accoustic reflection properties, and so on.

Since there is no real concept of initiating a call, I don't see what
place SIP has in this scenario. SIP is mostly about connecting two
people together over p2p. Our challenge seems to be mostly about
getting a good audio/speech codec, and integrating it into the whole
system.

Now I could see why you might want to look at a solution like
telepathy if your position is that you don't want to worry about
synchronization, buffering, streaming and so on. But that brings up
an interesting point. As I said in my last post, the platform really
needs a concept of time, synchronization, and a separation of packet
delivery from event injection.

Assuming we were to buy into the need to think about this kind of
architecture, implementing the buffering and synchronization necessary
for voice and other media streaming would flow naturally.

It's my opinion that cobbling together unrelated streaming
technologies -- this lib for video, another for music, a 3rd for voice
-- is going to create huge headaches further on. In my somewhat
humble opinion, a virtual world platform IS a streaming platform, and
should be built from the bottom up with the concept of time,
synchronization, packet buffering and event posting schedules firmly
in place. Then we can build the features we need from a single,
manageable base of functionality.

-dan

Ryan McDougall

unread,
Jan 24, 2009, 3:45:12 PM1/24/09
to realxtend-a...@googlegroups.com
I believe completely that it is a streaming platform -- streaming a 3D
world instead of video or audio. And the day may yet come where a
use-case demands that we reimplement buffering for 3D and audio and
video together, and do some cool things with that...

However the use case you present can be handled reasonably with SIP or
related technologies:

* When a user enters a region with 3D spatial audio on, it joins a
session on a central audio mixing server devoted to that region. That
is, it "calls" a special voice conferencing server.

* When a user is on a regular voice conference server, everyone's
voice is at 100% volume. It gets mixed with the other participant's
voices and the resulting stream is returned to all participants.

* However our mixing server reads positional information from the
region. Instead of one conference mixing one stream for all, our
server tracks clusters people, and adds them dynamically and
transparently to a conference based on relative location. This is
where SIP signalling and call routing becomes important.

* Each conference mixes one stream for each recipient. In the ith
stream, the mixer applies an attenuating function for all participants
!= i, mixes them together and streams them back to participant i.

Of course one can imagine all sorts of optimization techniques such as
local mixing, P2P, early culling, etc.; or removing SIP for some other
protocol such as Multi-user Jingle
(http://telepathy.freedesktop.org/wiki/MultiUserJingle); but the basic
idea remains the same.

If we use existing voice and video conferencing solutions, we can
re-use a lot of code. If we redo streaming entirely, it would be a
major undertaking that would require a serious use-case to justify. I
am not adverse to doing so, it'd just take some serious discussion.

Cheers,

daniel miller

unread,
Jan 24, 2009, 4:12:53 PM1/24/09
to realxtend-a...@googlegroups.com
On Sat, Jan 24, 2009 at 12:45 PM, Ryan McDougall <semp...@gmail.com> wrote:
>
> However the use case you present can be handled reasonably with SIP or
> related technologies:
>

I appreciate that any project requires tradeoffs between build vs. buy
(or use existing FOSS), and time to market vs. elegance and perfection
of design.

My philosophy is that its' OK to skimp on the implementation, but
things like network protocols (ie the API you present to the world)
should really be done right to the degree possible.

> * When a user enters a region with 3D spatial audio on, it joins a
> session on a central audio mixing server devoted to that region. That
> is, it "calls" a special voice conferencing server.
>
> * When a user is on a regular voice conference server, everyone's
> voice is at 100% volume. It gets mixed with the other participant's
> voices and the resulting stream is returned to all participants.
>
> * However our mixing server reads positional information from the
> region. Instead of one conference mixing one stream for all, our
> server tracks clusters people, and adds them dynamically and
> transparently to a conference based on relative location. This is
> where SIP signalling and call routing becomes important.
>
> * Each conference mixes one stream for each recipient. In the ith
> stream, the mixer applies an attenuating function for all participants
> != i, mixes them together and streams them back to participant i.

what I don't understand yet is, what is the server doing, what is the
client doing, and how are they communicating? Also, are we talking
about something that is Linden compatible, or a totally new protocol?

Personally, I don't see the advantage of introducing VOIP protocols
such as SIP into the wire protocol, but perhaps that's a done deal, in
which case I'm whistling into the wind, as we say.

Regardless, I still think the approach of treating the client as a
streaming platform will simplify our lives tremendously. I would
settle for an architecture that at least gives me some hope of more
fully integrating the various pieces down the road.

-dan

Ryan McDougall

unread,
Jan 24, 2009, 4:40:21 PM1/24/09
to realxtend-a...@googlegroups.com
On Sat, Jan 24, 2009 at 11:12 PM, daniel miller <danb...@gmail.com> wrote:
>
> On Sat, Jan 24, 2009 at 12:45 PM, Ryan McDougall <semp...@gmail.com> wrote:
>>
>> However the use case you present can be handled reasonably with SIP or
>> related technologies:
>>
>
> I appreciate that any project requires tradeoffs between build vs. buy
> (or use existing FOSS), and time to market vs. elegance and perfection
> of design.
>
> My philosophy is that its' OK to skimp on the implementation, but
> things like network protocols (ie the API you present to the world)
> should really be done right to the degree possible.

Which is fine and easy to agree with. However I'd need to see a
proposal before I can evaluate it.

>> * When a user enters a region with 3D spatial audio on, it joins a
>> session on a central audio mixing server devoted to that region. That
>> is, it "calls" a special voice conferencing server.
>>
>> * When a user is on a regular voice conference server, everyone's
>> voice is at 100% volume. It gets mixed with the other participant's
>> voices and the resulting stream is returned to all participants.
>>
>> * However our mixing server reads positional information from the
>> region. Instead of one conference mixing one stream for all, our
>> server tracks clusters people, and adds them dynamically and
>> transparently to a conference based on relative location. This is
>> where SIP signalling and call routing becomes important.
>>
>> * Each conference mixes one stream for each recipient. In the ith
>> stream, the mixer applies an attenuating function for all participants
>> != i, mixes them together and streams them back to participant i.
>
> what I don't understand yet is, what is the server doing, what is the
> client doing, and how are they communicating? Also, are we talking
> about something that is Linden compatible, or a totally new protocol?

Vivox, in SL, uses SIP as far as I know. However SIP is just a
signalling protocol. However it almost always implies RTP
(http://en.wikipedia.org/wiki/Real-time_Transport_Protocol).

> Personally, I don't see the advantage of introducing VOIP protocols
> such as SIP into the wire protocol, but perhaps that's a done deal, in
> which case I'm whistling into the wind, as we say.

Nothing is a done deal as long as there is an alternative. What do you
propose in way of specifics?

> Regardless, I still think the approach of treating the client as a
> streaming platform will simplify our lives tremendously. I would
> settle for an architecture that at least gives me some hope of more
> fully integrating the various pieces down the road.

Sounds like something I can agree to, but I'd need to see more details.

> -dan

Cheers,

>
> >
>

Ryan McDougall

unread,
Jan 24, 2009, 5:22:38 PM1/24/09
to realxtend-a...@googlegroups.com
Just wanted to make clear that I don't intend to bake voice or IM into
a single custom UDP protocol like LL did.

I propose that components of the system use whatever protocol is best
suited for the application. It is the logic on the client and/or
server that turn what those components report into a coherent data
model that represents the state of the world.

For example IM uses XMPP, voice/video uses its own streaming protocol
like SIP/RTP, asset downloads use HTTP, 3D world streaming is
OpenSim/SL, etc.

There are many wonderfully designed protocols out there, and I don't
have the hubris to presume I can do better without some supporting
evidence.

daniel miller

unread,
Jan 25, 2009, 4:35:01 PM1/25/09
to realxtend-a...@googlegroups.com
>> Personally, I don't see the advantage of introducing VOIP protocols
>> such as SIP into the wire protocol, but perhaps that's a done deal, in
>> which case I'm whistling into the wind, as we say.
>
> Nothing is a done deal as long as there is an alternative. What do you
> propose in way of specifics?

OK, this is very back-of-the-envelope, but just to have it on record:

My basic approach would be to think of voice as simply another part of
the general packet protocol. When audio is detected on the user's mic
(ie > some silence threshold), you compress it and send it in a UDP
packet to the server, with a time stamp. The server gets voice
packets from clients, figures out who is within listening range, and
sends the appropriate packets (still with compressed speech) back out
to the clients. So decoding voice is a client operation similar to
displaying an avatar with animation: when the client gets a packet
with speech data, it knows where the avatar that is talking is
located, so it can compute distance and direction. It decodes the
data, applies volume/pan/effects appropriately, and renders the audio
to the mixer.

The aspect of all this that is not presently in the architecture is
the time stamp. However, as I've mentioned before, time stamps
*should* be part of the architecture for a multitude of reasons. Once
you accept that, the idea of implementing your own audio 'streaming'
-- ie early packet delivery with buffering and timing -- doesn't seem
so radical (at least to me).

-dan

Ryan McDougall

unread,
Jan 25, 2009, 5:18:28 PM1/25/09
to realxtend-a...@googlegroups.com
I think you've touched on a couple issues here

1. What defines the reX platform? What is the keystone to build the
platform on, and to build our future successes, commercial or
otherwise?

Is it the (custom) network protocol? Or a suite of services (defining
their own "standard" protocols as necessary) running in a common
application framework?

Or

Is the API on the wire, or in the source code?

Or

Are we more like Second Life, or OpenSim?

I get the feeling you believe the former, and I am inclined to the latter.

2. The relative cost-benefit analysis of elevating voice packets to
the same level of fidelity as 3D.

Humans are very sensitive to subtle discrepancies in graphics from one
frame to the next. Humans are also quite sensitive discrepancies in
voice from one phoneme to the next.

However I have yet to see a convincing argument that:
- Phonemes need updating at 30+ frames per second
- Humans are sensitive to discrepancies between visual frames and phonemes
- Humans are sensitive to discrepancies between perceived visual
distance and audible distance
- 3D voice is as critical to 3D VW as 3D video

3. Your proposal sounds a bit like a modified TURN server:
http://en.wikipedia.org/wiki/Traversal_Using_Relay_NAT, which if you
threw in a couple 3D positional packets in addition to speex packets,
would work transparently using SIP/RTP

As I said before, I am open and up to anything. I just need to see the
rationale laid out clearly.

Cheers,

Ryan McDougall

unread,
Jan 25, 2009, 5:19:32 PM1/25/09
to realxtend-a...@googlegroups.com
ps. Adding timestamps to the protocol sounds like something very
useful for a large number of reasons. What do you propose?

Cheers,

Ben Lindquist

unread,
Jan 26, 2009, 9:44:18 AM1/26/09
to realxtend-a...@googlegroups.com
You should have a look at what the lg3d-wonderland folks built for their audio.  They have a voice-bridge process which acts as a 3d spatial conference bridge for SIP calls.  The wonderland server updates the bridge with avatar positions in real time, and the clients are using SIP.  The thing even supports bridging in regular phone calls; an avatar can carry a phone call around as a sphere, from conversation to conversation in the world...
 
Ben
 
ps - timestamps in the virtual world protocol seems like a very valuable thing to me, as you will see if you look at MXP

Ryan McDougall

unread,
Jan 26, 2009, 10:42:06 AM1/26/09
to realxtend-a...@googlegroups.com
On Mon, Jan 26, 2009 at 4:44 PM, Ben Lindquist <arko...@gmail.com> wrote:
> You should have a look at what the lg3d-wonderland folks built for their
> audio. They have a voice-bridge process which acts as a 3d spatial
> conference bridge for SIP calls. The wonderland server updates the bridge
> with avatar positions in real time, and the clients are using SIP. The
> thing even supports bridging in regular phone calls; an avatar can carry a
> phone call around as a sphere, from conversation to conversation in the
> world...

Sounds like a feature list of SIP. I wonder if their server could be
used lock-stock-barrel?

> Ben
>
> ps - timestamps in the virtual world protocol seems like a very valuable
> thing to me, as you will see if you look at MXP

I've heard of MXP. ;)

> >
>

daniel miller

unread,
Jan 27, 2009, 8:13:28 PM1/27/09
to realxtend-a...@googlegroups.com
On Sun, Jan 25, 2009 at 2:18 PM, Ryan McDougall <semp...@gmail.com> wrote:

> 1. What defines the reX platform? What is the keystone to build the
> platform on, and to build our future successes, commercial or
> otherwise?
>
> Is it the (custom) network protocol? Or a suite of services (defining
> their own "standard" protocols as necessary) running in a common
> application framework?
>
> Or
>
> Is the API on the wire, or in the source code?
>
> Or
>
> Are we more like Second Life, or OpenSim?
>
> I get the feeling you believe the former, and I am inclined to the latter.

Perhaps that's true. My inclination is to focus on "facts on the
ground", -- things as they actually are, rather than what should,
would, or could be the case. In my numerous years in this industry, I
have seen countless projects that aspired to be the 'platform', but in
reality most of those projects turned out to really be applications,
not a 'platform' in the deep sense of the word.

What does it even mean to speak of a platform these days? Is
Microsoft Windows a platform? Firefox? HTML? Linux? Or, perhaps,
the platform is defined by the defacto protocols that are actually
used by real-world applications to achieve information exchange with
their users. By that standard, today's platform is the list of
web-oriented protocols supported by a 'typical' user you wish to
support. In other words, right now, the most established platform is
Firefox, Safari, or IE, augmented by the plugins and media helpers
that are well supported on each (flash , a/v formats etc). Note that
this can include apps that run outside the browser, such as Acrobat or
Windows Media.

In my mind, what we are trying to do here is to extend the platform to
enable new modalities of communication. How does that happen? It
happens when a downloadable piece of software is compelling enough for
potential users to install it. Once that starts to happen, there will
eventually be a critical mass of adoption, and the new media type will
eventually be popular enough to start to be considered part of the
overall platform that is the internet as it is typically utilized.
The defacto standard at that point will be the precise way that
popular plug-in parses and sends packets. All the talk about
standards and modularity will be moot in the face of the need for
real-world apps to stay compatible with that one piece of software,
warts and all. In that sense, the API is defined by what that first
software does in every corner case, and the rest is just talk. SL is
a contender for this, but I believe it has not achieved critical mass
yet, so I believe the field is still open.

My personal goal is to have as much impact on what that first popular
piece of software in this space does, to ensure that it can support
the kinds of applications I want to build.

I acknowledge that this process can be influenced by a well-designed
FOSS project, and therefore issues of modularity and extensibility can
have an impact. However, in the real world, most of the new protocols
that have become defacto standards have started life as proprietary
systems (Flash, Acrobat, Windows Media, Quicktime -- and the new
social network protocols). To compete with this phenomenon, an open
source VW client needs to work well in the real world, and provide the
features people want. This is more important than issues ofl
modularity or extensibility. I saw all the attempts at open-source
web streaming audio and video fail precisely because they failed to
understand this fact of life.

> However I have yet to see a convincing argument that:
> - Phonemes need updating at 30+ frames per second
> - Humans are sensitive to discrepancies between visual frames and phonemes
> - Humans are sensitive to discrepancies between perceived visual
> distance and audible distance
> - 3D voice is as critical to 3D VW as 3D video

Well, I would pretty much disagree on all these points, at least in
the long term. In the video world, we struggled for years to achieve
proper audio/video synchronization, to make it good enough to pass
muster with typical viewers who were used to analog TV transmission
(which, for all its faults, usually managed to keep lips in synch).

The reason you might think these things are not relevant yet is
because right now, avatars don't actually talk with their lips. The
user doesn't have fine-grained timing control of their avatar, so sync
is sloppy and "good enough for rock&roll". If this project is going
to be successful long-term, it should be architected for what will be
possible in 3, 5, 7, or 10 years, not just what is happening today.

I've seen motion capture solutions that claim to be moving towards
consumer use, ie < $100 to capture your face and upper body. Once
that sort of thing is possible, the issues of timing and
synchronization of avatar movement and voice will become much more
critical. If our architecture fails to adapt to that requirement
because its fundamental architecture is missing the necessary
components, that would be a bad thing.

>
> 3. Your proposal sounds a bit like a modified TURN server:
> http://en.wikipedia.org/wiki/Traversal_Using_Relay_NAT, which if you
> threw in a couple 3D positional packets in addition to speex packets,
> would work transparently using SIP/RTP

I'm not really following your point here -- perhaps because of my own
lack of knowledge when it comes to low-level network stuff. I don't
care if the bottom layer is client-->server-->client or real P2P or
some mixture. I don't care if we use SIP to initiate things, RTP to
stream, etc. -- all that stuff is fine with me. What matters to me is
performance and capability. If an existing protocol gives us that,
fine. If not, I say punt it.


>
> As I said before, I am open and up to anything. I just need to see the
> rationale laid out clearly.
>
> Cheers,
>

I hope I'm not sounding argumentative. I just want this project to
have the best possible chance of great success. In the end, decisions
will be made, and I will decide whether to continue my involvement or
not. I won't hang around pissing and moaning if things go a different
way. I'm just ranting now because the decision process appears to
remain open.

Cheers,
dan

Ryan McDougall

unread,
Jan 28, 2009, 10:34:42 AM1/28/09
to realxtend-a...@googlegroups.com
I don't believe there is anything to disagree with, except an
assertion that popular protocols always start out proprietary.

I would argue as you do, that popular protocols start out *working*,
ie solving a real problem that many people have.

>> However I have yet to see a convincing argument that:
>> - Phonemes need updating at 30+ frames per second
>> - Humans are sensitive to discrepancies between visual frames and phonemes
>> - Humans are sensitive to discrepancies between perceived visual
>> distance and audible distance
>> - 3D voice is as critical to 3D VW as 3D video
>
> Well, I would pretty much disagree on all these points, at least in
> the long term. In the video world, we struggled for years to achieve
> proper audio/video synchronization, to make it good enough to pass
> muster with typical viewers who were used to analog TV transmission
> (which, for all its faults, usually managed to keep lips in synch).
>
> The reason you might think these things are not relevant yet is
> because right now, avatars don't actually talk with their lips. The
> user doesn't have fine-grained timing control of their avatar, so sync
> is sloppy and "good enough for rock&roll". If this project is going
> to be successful long-term, it should be architected for what will be
> possible in 3, 5, 7, or 10 years, not just what is happening today.
>
> I've seen motion capture solutions that claim to be moving towards
> consumer use, ie < $100 to capture your face and upper body. Once
> that sort of thing is possible, the issues of timing and
> synchronization of avatar movement and voice will become much more
> critical. If our architecture fails to adapt to that requirement
> because its fundamental architecture is missing the necessary
> components, that would be a bad thing.

Yes, if/when avatars speak with lips, the issues in syncing will cause problems.

However before we can get to the future, we'll have to catch up with
the present. Which means skype. I use it every day to speak with my
daughter in Japan. I'd much rather be using my own code to do so. If
we can't even do streaming video that well what hope do we have of
surpassing it?

I respect that you don't want us to close doors to a future where
lipsyncing is important, but I really don't see that as being the case
at this moment.

>>
>> 3. Your proposal sounds a bit like a modified TURN server:
>> http://en.wikipedia.org/wiki/Traversal_Using_Relay_NAT, which if you
>> threw in a couple 3D positional packets in addition to speex packets,
>> would work transparently using SIP/RTP
>
> I'm not really following your point here -- perhaps because of my own
> lack of knowledge when it comes to low-level network stuff. I don't
> care if the bottom layer is client-->server-->client or real P2P or
> some mixture. I don't care if we use SIP to initiate things, RTP to
> stream, etc. -- all that stuff is fine with me. What matters to me is
> performance and capability. If an existing protocol gives us that,
> fine. If not, I say punt it.

Fair enough, I misunderstood your point then. I thought you saw
existing solutions as a waste of time.

I see it as a necessary stepping stone: when we have basic level of
voice through existing solutions, we then have to luxury of trying to
measure its weaknesses and optimize, which may or may not mean
embedding voice in 3D stream.

Actually based on my thoughts up until now, its rather more likely
that in the long term I would move away from SL/UDP towards
custom/RTP/SIP for 3D. Whether customer == MXP or not, its just too
early to tell. If we can't do the basics, then there is no money for
later.

>> As I said before, I am open and up to anything. I just need to see the
>> rationale laid out clearly.
>>
>> Cheers,
>>
>
> I hope I'm not sounding argumentative. I just want this project to
> have the best possible chance of great success. In the end, decisions
> will be made, and I will decide whether to continue my involvement or
> not. I won't hang around pissing and moaning if things go a different
> way. I'm just ranting now because the decision process appears to
> remain open.

I'm usually the one worrying about being argumentative! :)

The decision making process is open, but I am still having a hard time
finding out precisely what you'd have us do. The fact of the matter is
that internally in reX we have goals and milestones that come as a
requirement of receiving money to work on reX. They have to be
respected.

Part of the plan I have created for respecting those milestones is
creating an iterative, evolutionary way of getting from "here" to
"there". That iterative plan calls for using modrex, so we have a
server to program against, a viewer so that the protocol may become
changeble, and finally a protocol that we can customize to evolve from
LL to what we think is important.

However, if you'll look closely modrex is still not finished, and the
viewer will take at least a year to become useful. By the time that
year is up, we may yet run out of money.

So my question is:
1. What precisely do you propose?
2. How can we do it in one year so as to ensure we will receive
funding for the next years?
3. How would you allocate our limited resources over the year to
ensure that above outcome?

Using high-level toolkit and existing protocols will save work and
thus ensure that we can have a viewer within a reasonable amount of
time that is the least bit useful. It also ensures high code quality
and modularity, so that if/when we decide to make more fundamental
changes to the protocol (see step 3 above), it is not a serious
re-engineer.

> Cheers,
> dan

Cheers,

> >
>

daniel miller

unread,
Jan 30, 2009, 11:39:32 PM1/30/09
to realxtend-a...@googlegroups.com
Soon, I think we should put this thread to rest. I don't want an
ongoing debate to subtract from your valuable time moving ahead. I'll
respond briefly as best I can to your last points.

[Ryan:]
...


> Fair enough, I misunderstood your point then. I thought you saw
> existing solutions as a waste of time

Of course not. My issue is entirely about what you gain vs. lose
choosing an existing solution.


>
> I see it as a necessary stepping stone: when we have basic level of
> voice through existing solutions, we then have to luxury of trying to
> measure its weaknesses and optimize, which may or may not mean
> embedding voice in 3D stream.

Ok, so here's my issue. Once the protocol is out there, it's highly
unlikely that we will be able to just switch it to some new protocol.
SL went through terrible pain every time they wanted to upgrade their
client/server comms, and they were a small company with a limited,
devoted userbase and 100% control of the technology on both sides of
the wire. I suppose there could be some fancy new protocol extensions
with complex backward compatibilty hax -- that's what will be
necessary. However beyond technical issues, if this stuff gets
popular there will be massive inertia to introduce even minor changes.


>
> Actually based on my thoughts up until now, its rather more likely
> that in the long term I would move away from SL/UDP towards
> custom/RTP/SIP for 3D. Whether customer == MXP or not, its just too
> early to tell. If we can't do the basics, then there is no money for
> later.

I think I'd be happier with one protocol for 3D and voice, whether
it's SL/UDP or some RTP variant. My main objection is the complete
separate pipeline for 3D, video, music and voice.

> The decision making process is open, but I am still having a hard time
> finding out precisely what you'd have us do. The fact of the matter is
> that internally in reX we have goals and milestones that come as a
> requirement of receiving money to work on reX. They have to be
> respected.

I totally understand that. I guess my main basis for argument is a
gut feeling that in this particular case (voice),

A) it's not as hard as you think to roll our own, and

B) the penalty of going with a black-box solution is going to be worse
than expected.

These are obviously judgement calls, neither of us can claim perfect knowledge.

> So my question is:
> 1. What precisely do you propose?

I propose developing (or perhaps choosing a 3rd party) time-based
event system, initially for the 3D pipeline, and also utilizing it for
voice, music, and video streaming as well (not necessarily with first
release). Server-side plugins would allow existing formats to be
integrated into the world (ie streaming quicktime video on a prim,
etc).

> 2. How can we do it in one year so as to ensure we will receive
> funding for the next years?

Like any other proposal, we would make estimates as to person-time
necessary to achieve subtasks, figure out the dependency structure,
and propose a timeline and milestones

> 3. How would you allocate our limited resources over the year to
> ensure that above outcome?

My argument is basically that by putting a bit of extra work in the
beginning (designing our own message pipeline), we will reap rewards
later in the project because integration of the parts will go much
more smoothly. Again, it's a judgement call, and I'm not the main
judge, just making a proposal.

> Using high-level toolkit and existing protocols will save work and
> thus ensure that we can have a viewer within a reasonable amount of
> time that is the least bit useful. It also ensures high code quality
> and modularity, so that if/when we decide to make more fundamental
> changes to the protocol (see step 3 above), it is not a serious
> re-engineer.

I feel the corners of my lips curling up in a smile, and I'm asking
myself why. I have made the argument you are making many times.
Sometimes we rolled more of our own; sometimes we integrated more OPC
(Other People's Code). It's a question of balance. In my proposal,
we wouldn't try to develop our own codecs, for instance. OPC will be
there; the question is, how big a black box do you want to depend on?

I guess my gut is telling me to keep the black boxes small and easily
manageable. A big-ass SIP-based VOIP solution is not going to fit
that description, but a little, optimized voice codec will.

Peace,
dan

Ryan McDougall

unread,
Jan 31, 2009, 2:29:16 PM1/31/09
to realxtend-a...@googlegroups.com
Well let me just say that I don't consider this thread any waste of my
time. I think its very important for one to have one's assumptions
challenged, especially early in the design phase. Even if you don't
agree in the end, it is important to thoroughly consider the options
you didn't choose.

On Sat, Jan 31, 2009 at 6:39 AM, daniel miller <danb...@gmail.com> wrote:
>
> Ok, so here's my issue. Once the protocol is out there, it's highly
> unlikely that we will be able to just switch it to some new protocol.
> SL went through terrible pain every time they wanted to upgrade their
> client/server comms, and they were a small company with a limited,
> devoted userbase and 100% control of the technology on both sides of
> the wire. I suppose there could be some fancy new protocol extensions
> with complex backward compatibilty hax -- that's what will be
> necessary. However beyond technical issues, if this stuff gets
> popular there will be massive inertia to introduce even minor changes.

One of the great things about reX is we don't have the burden of LL's
business model to support. Changing protocols will not be fun, but it
won't be a financial concern.


>> So my question is:
>> 1. What precisely do you propose?
>
> I propose developing (or perhaps choosing a 3rd party) time-based
> event system, initially for the 3D pipeline, and also utilizing it for
> voice, music, and video streaming as well (not necessarily with first
> release). Server-side plugins would allow existing formats to be
> integrated into the world (ie streaming quicktime video on a prim,
> etc).
>
>> 2. How can we do it in one year so as to ensure we will receive
>> funding for the next years?
>
> Like any other proposal, we would make estimates as to person-time
> necessary to achieve subtasks, figure out the dependency structure,
> and propose a timeline and milestones
>

Unfortunately this is the crux of the issue, with the details to be
filled in making all the difference in the world.

>> 3. How would you allocate our limited resources over the year to
>> ensure that above outcome?
>
> My argument is basically that by putting a bit of extra work in the
> beginning (designing our own message pipeline), we will reap rewards
> later in the project because integration of the parts will go much
> more smoothly. Again, it's a judgement call, and I'm not the main
> judge, just making a proposal.

I am willing to agree in principle; but without details, ones I can't
really propose myself. We would need to start another thread with your
hypothetical plan, and we can crunch the numbers. However the window
for making a decision is soon closing. By the end of February to be
exact.

Cheers,

Toni Alatalo

unread,
Feb 1, 2009, 8:20:28 AM2/1/09
to realxtend-a...@googlegroups.com
Ryan McDougall kirjoitti:

> On Sat, Jan 31, 2009 at 6:39 AM, daniel miller <danb...@gmail.com> wrote:
>
>> Ok, so here's my issue. Once the protocol is out there, it's highly
>> unlikely that we will be able to just switch it to some new protocol.
>>
> One of the great things about reX is we don't have the burden of LL's
> business model to support. Changing protocols will not be fun, but it
>

To restate the points there a bit:

The plan, or at least my understanding of it, was initially indeed to
'just switch to some new protocol' a bit later. And like Ryan said in
one of the posts, to first implement the core and key components for the
viewer so that we'll have something to use the new protocol.

So for the first stage no new protocol things have been planned, but
implementing the essential parts of the currently used SL protocol and
Rex extensions, and have the an early version of the viewer that
functions using those against the existing server functionality
(rexserver refactored to modrex, which basically introduces no protocol
changes). Probably at that stage the NG viewer will not get popular yet,
as it'll lack functionality vs. the current Linden based viewer.

After that the idea is to start introducing new architecture,
protocol(s) etc. - perhaps implementing some of the ideas that have been
discussed here earlier (and are implented in e.g. MXP).

I guess only after that the new work is really "out there" -- before
that it'll remain of interest for developers only, and users are better
off with the legacy tools (except the ones that can't use the current
viewer at all due to gfx driver probs in how the two renderers are
mashed together).

Now that the planning has progressed, the initial idea of first
implementing the legacy protocol and then switching to something, is
kind of also being reconsidered in light of the new information from the
research and the practice of prototyping and planning the actual
implementations. I don't know how exactly Ryan sees it now, but my
feeling is that we are starting to need more detailed ideas of what will
stay and what new things will come first and how (and I'm not referring
to things like asset downloading which are kinda clear already, but more
of what the 'wire protocol' will be). So perhaps now during February we
need to start digging there too (and not only in software arch issues
for implementing the core of the legacy, like currently). Am not sure,
but ready to stand corrected :)

Regarding the specific question of audio and voice .. well in games
audio is traditionally quite similar to e.g. textures and particle
systems, in the sense that there is a pre-existing file ('asset') that
is the data (tex.png, fire.particle, or shot.wav) and the game logic
instantiates them during the play (e.g. in a networked game when someone
shoots, a 'shot' event is sent to all clients which then show and play
the corresponding e.g. particle effect and audio clip). So no audio, nor
graphics for that matter, are transmitted in the protocol - just events
(and the assets are preinstalled in the game client from e.g. a dvd -
downloading from a server is identical, client just needs the file
before can play). And typically separate solutions have been used for
speech, like Teamspeak, which is essentially the same as Skype for the
purposes of this discussion - i.e. separate protocols etc. that don't
have anything to do with the game.

Videoconferencing is of course a different matter, and I at least
partially agree that the point made that SL with the positional audio
may be a good solution for that. Rex does have previous experience in
implementing audio similar to that, with the Speex using slvoice
replacement that was made more than a year ago (I'm not sure of the
current status there, other that Mikko P. was recently fixing some
server things related to voice).

A final note about one thing Dan said:

> Server-side plugins would allow existing formats to be
> integrated into the world (ie streaming quicktime video on a prim,
> etc).

I don't think that should be the only model, that the viewer would only
be connected to the region which would transfer all the data.

I think the current model, which more resembles a Web browser, where the
viewer can be connected to many servers at the same time, using several
protocols and formats, to form the overall view of a place, should be at
least considered for the future as well. There are many servers out
there in the world that provide e.g. video streams (youtube, vimeo, ..)
and I don't think we want world servers relaying those to the viewers
encapsuled in the own protocol to be the only choice. I've had good
experiences with the Web doing own custom views to heavy data hosted
elsewhere, hosting just light html on own small server to compose the
view, having the visitors' browsers stream directly the embedded videos
from e.g. Youtube, slideshows from Flickr etc.

But having timestamps in all packets is cerrainly a big +1 :)
In fact the game network library, Raknet, that we've used before also
had nicely (and automagically for the programmer) a shared time for all
the participants .. http://www.jenkinssoftware.com/ is about Raknet in
general (it seems to have 'voice communication' nowadays too it seems
:o) .. http://www.jenkinssoftware.com/raknet/manual/timestamping.html is
about how they deal with time.

~Toni

daniel miller

unread,
Feb 1, 2009, 4:13:47 PM2/1/09
to realxtend-a...@googlegroups.com
On Sun, Feb 1, 2009 at 5:20 AM, Toni Alatalo <ant...@kyperjokki.fi> wrote:

> A final note about one thing Dan said:
>
>> Server-side plugins would allow existing formats to be
>> integrated into the world (ie streaming quicktime video on a prim,
>> etc).
>
> I don't think that should be the only model, that the viewer would only
> be connected to the region which would transfer all the data.

On reflection, I totally agree. There are two very different use
cases, which I will explain below.


>
> I think the current model, which more resembles a Web browser, where the
> viewer can be connected to many servers at the same time, using several
> protocols and formats, to form the overall view of a place, should be at
> least considered for the future as well. There are many servers out
> there in the world that provide e.g. video streams (youtube, vimeo, ..)
> and I don't think we want world servers relaying those to the viewers
> encapsuled in the own protocol to be the only choice. I've had good
> experiences with the Web doing own custom views to heavy data hosted
> elsewhere, hosting just light html on own small server to compose the
> view, having the visitors' browsers stream directly the embedded videos
> from e.g. Youtube, slideshows from Flickr etc.

OK, here's how it breaks down in my mind:

A streaming audio/video file (ie Youtube) is a concrete unit, which
may have its own protocols and delivery mechanism. If our "browser'
has support for that datatype (perhaps as a plug-in), it can render it
and integrate it into the rest of the rendering pipeline.

As for voice, I see an avatar's speech as being part of its 'stream'
of data, which includes motion commands, animated behaviors, etc.
Therefore, I think it makes sense to design (if no such design exists
yet) a protocol that properly manages all of an avatar's (or other
entity's) behavior in an integrated fashion, including proper
synchronization. That includes voice because avatars can speak.

As a separate matter, we may want to support Skype conferencing, as we
would support certain types of video streaming or other data formats.
It's perfectly reasonable that this would be a separate protocol,
since Skype is a mature, well-supported application.

If you go back to some of the stuff I was posting a while ago about
"behavior servers", the idea of separate datastreams coming from
different places on the net makes perfect sense. My only point here
really is that an avatar's behavior and voice should be considered a
single data stream, or at least should be properly synchronized.

> But having timestamps in all packets is cerrainly a big +1 :)
> In fact the game network library, Raknet, that we've used before also
> had nicely (and automagically for the programmer) a shared time for all
> the participants .. http://www.jenkinssoftware.com/ is about Raknet in
> general (it seems to have 'voice communication' nowadays too it seems
> :o) .. http://www.jenkinssoftware.com/raknet/manual/timestamping.html is
> about how they deal with time.

If we want to integrate something like a Youtube stream, or Skype
call, packet timing is inherent in those protocols. We can think of
our NG client as an application that integrates multiple static and
streaming datasources. The issue of how we synchronize those streams
is an interesting one. Typical streaming apps such as video and VOIP
have implicit latency due to buffering strategies, but these are
typically not exposed at the GUI/App layer. The assumption is that
these are internal matters, and the plugins deal with setting up
buffers and client/server timing, presumably to minimize delay as much
as possible given the network conditions and other parameters.

But now we have a new design goal, which is to integrate these
multiple streams such that they are mutually synchronized in a way
that creates the right experience for the user. This is something a
web browser simply doesn't do, so it's an area where we need to be
careful using the browser model as our mental image. A browser is
inherently a 2D, document-style interface, and the time component is
not something explicitly managed in an overall fashion. A virtual
world browser needs to always be thinking about time, and managing the
process of integrating its various information sources with a
well-defined notion of time, latency, and synchronization.

The voice issue is really just a small subset of this larger question.

Ryan McDougall

unread,
Feb 2, 2009, 11:51:46 AM2/2/09
to realxtend-a...@googlegroups.com

I think I can agree with everything said herein, by both Toni and Dan. :)

Cheers,

Reply all
Reply to author
Forward
0 new messages