I appreciate that any project requires tradeoffs between build vs. buy
(or use existing FOSS), and time to market vs. elegance and perfection
of design.
My philosophy is that its' OK to skimp on the implementation, but
things like network protocols (ie the API you present to the world)
should really be done right to the degree possible.
> * When a user enters a region with 3D spatial audio on, it joins a
> session on a central audio mixing server devoted to that region. That
> is, it "calls" a special voice conferencing server.
>
> * When a user is on a regular voice conference server, everyone's
> voice is at 100% volume. It gets mixed with the other participant's
> voices and the resulting stream is returned to all participants.
>
> * However our mixing server reads positional information from the
> region. Instead of one conference mixing one stream for all, our
> server tracks clusters people, and adds them dynamically and
> transparently to a conference based on relative location. This is
> where SIP signalling and call routing becomes important.
>
> * Each conference mixes one stream for each recipient. In the ith
> stream, the mixer applies an attenuating function for all participants
> != i, mixes them together and streams them back to participant i.
what I don't understand yet is, what is the server doing, what is the
client doing, and how are they communicating? Also, are we talking
about something that is Linden compatible, or a totally new protocol?
Personally, I don't see the advantage of introducing VOIP protocols
such as SIP into the wire protocol, but perhaps that's a done deal, in
which case I'm whistling into the wind, as we say.
Regardless, I still think the approach of treating the client as a
streaming platform will simplify our lives tremendously. I would
settle for an architecture that at least gives me some hope of more
fully integrating the various pieces down the road.
-dan
> 1. What defines the reX platform? What is the keystone to build the
> platform on, and to build our future successes, commercial or
> otherwise?
>
> Is it the (custom) network protocol? Or a suite of services (defining
> their own "standard" protocols as necessary) running in a common
> application framework?
>
> Or
>
> Is the API on the wire, or in the source code?
>
> Or
>
> Are we more like Second Life, or OpenSim?
>
> I get the feeling you believe the former, and I am inclined to the latter.
Perhaps that's true. My inclination is to focus on "facts on the
ground", -- things as they actually are, rather than what should,
would, or could be the case. In my numerous years in this industry, I
have seen countless projects that aspired to be the 'platform', but in
reality most of those projects turned out to really be applications,
not a 'platform' in the deep sense of the word.
What does it even mean to speak of a platform these days? Is
Microsoft Windows a platform? Firefox? HTML? Linux? Or, perhaps,
the platform is defined by the defacto protocols that are actually
used by real-world applications to achieve information exchange with
their users. By that standard, today's platform is the list of
web-oriented protocols supported by a 'typical' user you wish to
support. In other words, right now, the most established platform is
Firefox, Safari, or IE, augmented by the plugins and media helpers
that are well supported on each (flash , a/v formats etc). Note that
this can include apps that run outside the browser, such as Acrobat or
Windows Media.
In my mind, what we are trying to do here is to extend the platform to
enable new modalities of communication. How does that happen? It
happens when a downloadable piece of software is compelling enough for
potential users to install it. Once that starts to happen, there will
eventually be a critical mass of adoption, and the new media type will
eventually be popular enough to start to be considered part of the
overall platform that is the internet as it is typically utilized.
The defacto standard at that point will be the precise way that
popular plug-in parses and sends packets. All the talk about
standards and modularity will be moot in the face of the need for
real-world apps to stay compatible with that one piece of software,
warts and all. In that sense, the API is defined by what that first
software does in every corner case, and the rest is just talk. SL is
a contender for this, but I believe it has not achieved critical mass
yet, so I believe the field is still open.
My personal goal is to have as much impact on what that first popular
piece of software in this space does, to ensure that it can support
the kinds of applications I want to build.
I acknowledge that this process can be influenced by a well-designed
FOSS project, and therefore issues of modularity and extensibility can
have an impact. However, in the real world, most of the new protocols
that have become defacto standards have started life as proprietary
systems (Flash, Acrobat, Windows Media, Quicktime -- and the new
social network protocols). To compete with this phenomenon, an open
source VW client needs to work well in the real world, and provide the
features people want. This is more important than issues ofl
modularity or extensibility. I saw all the attempts at open-source
web streaming audio and video fail precisely because they failed to
understand this fact of life.
> However I have yet to see a convincing argument that:
> - Phonemes need updating at 30+ frames per second
> - Humans are sensitive to discrepancies between visual frames and phonemes
> - Humans are sensitive to discrepancies between perceived visual
> distance and audible distance
> - 3D voice is as critical to 3D VW as 3D video
Well, I would pretty much disagree on all these points, at least in
the long term. In the video world, we struggled for years to achieve
proper audio/video synchronization, to make it good enough to pass
muster with typical viewers who were used to analog TV transmission
(which, for all its faults, usually managed to keep lips in synch).
The reason you might think these things are not relevant yet is
because right now, avatars don't actually talk with their lips. The
user doesn't have fine-grained timing control of their avatar, so sync
is sloppy and "good enough for rock&roll". If this project is going
to be successful long-term, it should be architected for what will be
possible in 3, 5, 7, or 10 years, not just what is happening today.
I've seen motion capture solutions that claim to be moving towards
consumer use, ie < $100 to capture your face and upper body. Once
that sort of thing is possible, the issues of timing and
synchronization of avatar movement and voice will become much more
critical. If our architecture fails to adapt to that requirement
because its fundamental architecture is missing the necessary
components, that would be a bad thing.
>
> 3. Your proposal sounds a bit like a modified TURN server:
> http://en.wikipedia.org/wiki/Traversal_Using_Relay_NAT, which if you
> threw in a couple 3D positional packets in addition to speex packets,
> would work transparently using SIP/RTP
I'm not really following your point here -- perhaps because of my own
lack of knowledge when it comes to low-level network stuff. I don't
care if the bottom layer is client-->server-->client or real P2P or
some mixture. I don't care if we use SIP to initiate things, RTP to
stream, etc. -- all that stuff is fine with me. What matters to me is
performance and capability. If an existing protocol gives us that,
fine. If not, I say punt it.
>
> As I said before, I am open and up to anything. I just need to see the
> rationale laid out clearly.
>
> Cheers,
>
I hope I'm not sounding argumentative. I just want this project to
have the best possible chance of great success. In the end, decisions
will be made, and I will decide whether to continue my involvement or
not. I won't hang around pissing and moaning if things go a different
way. I'm just ranting now because the decision process appears to
remain open.
Cheers,
dan
[Ryan:]
...
> Fair enough, I misunderstood your point then. I thought you saw
> existing solutions as a waste of time
Of course not. My issue is entirely about what you gain vs. lose
choosing an existing solution.
>
> I see it as a necessary stepping stone: when we have basic level of
> voice through existing solutions, we then have to luxury of trying to
> measure its weaknesses and optimize, which may or may not mean
> embedding voice in 3D stream.
Ok, so here's my issue. Once the protocol is out there, it's highly
unlikely that we will be able to just switch it to some new protocol.
SL went through terrible pain every time they wanted to upgrade their
client/server comms, and they were a small company with a limited,
devoted userbase and 100% control of the technology on both sides of
the wire. I suppose there could be some fancy new protocol extensions
with complex backward compatibilty hax -- that's what will be
necessary. However beyond technical issues, if this stuff gets
popular there will be massive inertia to introduce even minor changes.
>
> Actually based on my thoughts up until now, its rather more likely
> that in the long term I would move away from SL/UDP towards
> custom/RTP/SIP for 3D. Whether customer == MXP or not, its just too
> early to tell. If we can't do the basics, then there is no money for
> later.
I think I'd be happier with one protocol for 3D and voice, whether
it's SL/UDP or some RTP variant. My main objection is the complete
separate pipeline for 3D, video, music and voice.
> The decision making process is open, but I am still having a hard time
> finding out precisely what you'd have us do. The fact of the matter is
> that internally in reX we have goals and milestones that come as a
> requirement of receiving money to work on reX. They have to be
> respected.
I totally understand that. I guess my main basis for argument is a
gut feeling that in this particular case (voice),
A) it's not as hard as you think to roll our own, and
B) the penalty of going with a black-box solution is going to be worse
than expected.
These are obviously judgement calls, neither of us can claim perfect knowledge.
> So my question is:
> 1. What precisely do you propose?
I propose developing (or perhaps choosing a 3rd party) time-based
event system, initially for the 3D pipeline, and also utilizing it for
voice, music, and video streaming as well (not necessarily with first
release). Server-side plugins would allow existing formats to be
integrated into the world (ie streaming quicktime video on a prim,
etc).
> 2. How can we do it in one year so as to ensure we will receive
> funding for the next years?
Like any other proposal, we would make estimates as to person-time
necessary to achieve subtasks, figure out the dependency structure,
and propose a timeline and milestones
> 3. How would you allocate our limited resources over the year to
> ensure that above outcome?
My argument is basically that by putting a bit of extra work in the
beginning (designing our own message pipeline), we will reap rewards
later in the project because integration of the parts will go much
more smoothly. Again, it's a judgement call, and I'm not the main
judge, just making a proposal.
> Using high-level toolkit and existing protocols will save work and
> thus ensure that we can have a viewer within a reasonable amount of
> time that is the least bit useful. It also ensures high code quality
> and modularity, so that if/when we decide to make more fundamental
> changes to the protocol (see step 3 above), it is not a serious
> re-engineer.
I feel the corners of my lips curling up in a smile, and I'm asking
myself why. I have made the argument you are making many times.
Sometimes we rolled more of our own; sometimes we integrated more OPC
(Other People's Code). It's a question of balance. In my proposal,
we wouldn't try to develop our own codecs, for instance. OPC will be
there; the question is, how big a black box do you want to depend on?
I guess my gut is telling me to keep the black boxes small and easily
manageable. A big-ass SIP-based VOIP solution is not going to fit
that description, but a little, optimized voice codec will.
Peace,
dan
On Sat, Jan 31, 2009 at 6:39 AM, daniel miller <danb...@gmail.com> wrote:
>
> Ok, so here's my issue. Once the protocol is out there, it's highly
> unlikely that we will be able to just switch it to some new protocol.
> SL went through terrible pain every time they wanted to upgrade their
> client/server comms, and they were a small company with a limited,
> devoted userbase and 100% control of the technology on both sides of
> the wire. I suppose there could be some fancy new protocol extensions
> with complex backward compatibilty hax -- that's what will be
> necessary. However beyond technical issues, if this stuff gets
> popular there will be massive inertia to introduce even minor changes.
One of the great things about reX is we don't have the burden of LL's
business model to support. Changing protocols will not be fun, but it
won't be a financial concern.
>> So my question is:
>> 1. What precisely do you propose?
>
> I propose developing (or perhaps choosing a 3rd party) time-based
> event system, initially for the 3D pipeline, and also utilizing it for
> voice, music, and video streaming as well (not necessarily with first
> release). Server-side plugins would allow existing formats to be
> integrated into the world (ie streaming quicktime video on a prim,
> etc).
>
>> 2. How can we do it in one year so as to ensure we will receive
>> funding for the next years?
>
> Like any other proposal, we would make estimates as to person-time
> necessary to achieve subtasks, figure out the dependency structure,
> and propose a timeline and milestones
>
Unfortunately this is the crux of the issue, with the details to be
filled in making all the difference in the world.
>> 3. How would you allocate our limited resources over the year to
>> ensure that above outcome?
>
> My argument is basically that by putting a bit of extra work in the
> beginning (designing our own message pipeline), we will reap rewards
> later in the project because integration of the parts will go much
> more smoothly. Again, it's a judgement call, and I'm not the main
> judge, just making a proposal.
I am willing to agree in principle; but without details, ones I can't
really propose myself. We would need to start another thread with your
hypothetical plan, and we can crunch the numbers. However the window
for making a decision is soon closing. By the end of February to be
exact.
Cheers,
To restate the points there a bit:
The plan, or at least my understanding of it, was initially indeed to
'just switch to some new protocol' a bit later. And like Ryan said in
one of the posts, to first implement the core and key components for the
viewer so that we'll have something to use the new protocol.
So for the first stage no new protocol things have been planned, but
implementing the essential parts of the currently used SL protocol and
Rex extensions, and have the an early version of the viewer that
functions using those against the existing server functionality
(rexserver refactored to modrex, which basically introduces no protocol
changes). Probably at that stage the NG viewer will not get popular yet,
as it'll lack functionality vs. the current Linden based viewer.
After that the idea is to start introducing new architecture,
protocol(s) etc. - perhaps implementing some of the ideas that have been
discussed here earlier (and are implented in e.g. MXP).
I guess only after that the new work is really "out there" -- before
that it'll remain of interest for developers only, and users are better
off with the legacy tools (except the ones that can't use the current
viewer at all due to gfx driver probs in how the two renderers are
mashed together).
Now that the planning has progressed, the initial idea of first
implementing the legacy protocol and then switching to something, is
kind of also being reconsidered in light of the new information from the
research and the practice of prototyping and planning the actual
implementations. I don't know how exactly Ryan sees it now, but my
feeling is that we are starting to need more detailed ideas of what will
stay and what new things will come first and how (and I'm not referring
to things like asset downloading which are kinda clear already, but more
of what the 'wire protocol' will be). So perhaps now during February we
need to start digging there too (and not only in software arch issues
for implementing the core of the legacy, like currently). Am not sure,
but ready to stand corrected :)
Regarding the specific question of audio and voice .. well in games
audio is traditionally quite similar to e.g. textures and particle
systems, in the sense that there is a pre-existing file ('asset') that
is the data (tex.png, fire.particle, or shot.wav) and the game logic
instantiates them during the play (e.g. in a networked game when someone
shoots, a 'shot' event is sent to all clients which then show and play
the corresponding e.g. particle effect and audio clip). So no audio, nor
graphics for that matter, are transmitted in the protocol - just events
(and the assets are preinstalled in the game client from e.g. a dvd -
downloading from a server is identical, client just needs the file
before can play). And typically separate solutions have been used for
speech, like Teamspeak, which is essentially the same as Skype for the
purposes of this discussion - i.e. separate protocols etc. that don't
have anything to do with the game.
Videoconferencing is of course a different matter, and I at least
partially agree that the point made that SL with the positional audio
may be a good solution for that. Rex does have previous experience in
implementing audio similar to that, with the Speex using slvoice
replacement that was made more than a year ago (I'm not sure of the
current status there, other that Mikko P. was recently fixing some
server things related to voice).
A final note about one thing Dan said:
> Server-side plugins would allow existing formats to be
> integrated into the world (ie streaming quicktime video on a prim,
> etc).
I don't think that should be the only model, that the viewer would only
be connected to the region which would transfer all the data.
I think the current model, which more resembles a Web browser, where the
viewer can be connected to many servers at the same time, using several
protocols and formats, to form the overall view of a place, should be at
least considered for the future as well. There are many servers out
there in the world that provide e.g. video streams (youtube, vimeo, ..)
and I don't think we want world servers relaying those to the viewers
encapsuled in the own protocol to be the only choice. I've had good
experiences with the Web doing own custom views to heavy data hosted
elsewhere, hosting just light html on own small server to compose the
view, having the visitors' browsers stream directly the embedded videos
from e.g. Youtube, slideshows from Flickr etc.
But having timestamps in all packets is cerrainly a big +1 :)
In fact the game network library, Raknet, that we've used before also
had nicely (and automagically for the programmer) a shared time for all
the participants .. http://www.jenkinssoftware.com/ is about Raknet in
general (it seems to have 'voice communication' nowadays too it seems
:o) .. http://www.jenkinssoftware.com/raknet/manual/timestamping.html is
about how they deal with time.
~Toni
> A final note about one thing Dan said:
>
>> Server-side plugins would allow existing formats to be
>> integrated into the world (ie streaming quicktime video on a prim,
>> etc).
>
> I don't think that should be the only model, that the viewer would only
> be connected to the region which would transfer all the data.
On reflection, I totally agree. There are two very different use
cases, which I will explain below.
>
> I think the current model, which more resembles a Web browser, where the
> viewer can be connected to many servers at the same time, using several
> protocols and formats, to form the overall view of a place, should be at
> least considered for the future as well. There are many servers out
> there in the world that provide e.g. video streams (youtube, vimeo, ..)
> and I don't think we want world servers relaying those to the viewers
> encapsuled in the own protocol to be the only choice. I've had good
> experiences with the Web doing own custom views to heavy data hosted
> elsewhere, hosting just light html on own small server to compose the
> view, having the visitors' browsers stream directly the embedded videos
> from e.g. Youtube, slideshows from Flickr etc.
OK, here's how it breaks down in my mind:
A streaming audio/video file (ie Youtube) is a concrete unit, which
may have its own protocols and delivery mechanism. If our "browser'
has support for that datatype (perhaps as a plug-in), it can render it
and integrate it into the rest of the rendering pipeline.
As for voice, I see an avatar's speech as being part of its 'stream'
of data, which includes motion commands, animated behaviors, etc.
Therefore, I think it makes sense to design (if no such design exists
yet) a protocol that properly manages all of an avatar's (or other
entity's) behavior in an integrated fashion, including proper
synchronization. That includes voice because avatars can speak.
As a separate matter, we may want to support Skype conferencing, as we
would support certain types of video streaming or other data formats.
It's perfectly reasonable that this would be a separate protocol,
since Skype is a mature, well-supported application.
If you go back to some of the stuff I was posting a while ago about
"behavior servers", the idea of separate datastreams coming from
different places on the net makes perfect sense. My only point here
really is that an avatar's behavior and voice should be considered a
single data stream, or at least should be properly synchronized.
> But having timestamps in all packets is cerrainly a big +1 :)
> In fact the game network library, Raknet, that we've used before also
> had nicely (and automagically for the programmer) a shared time for all
> the participants .. http://www.jenkinssoftware.com/ is about Raknet in
> general (it seems to have 'voice communication' nowadays too it seems
> :o) .. http://www.jenkinssoftware.com/raknet/manual/timestamping.html is
> about how they deal with time.
If we want to integrate something like a Youtube stream, or Skype
call, packet timing is inherent in those protocols. We can think of
our NG client as an application that integrates multiple static and
streaming datasources. The issue of how we synchronize those streams
is an interesting one. Typical streaming apps such as video and VOIP
have implicit latency due to buffering strategies, but these are
typically not exposed at the GUI/App layer. The assumption is that
these are internal matters, and the plugins deal with setting up
buffers and client/server timing, presumably to minimize delay as much
as possible given the network conditions and other parameters.
But now we have a new design goal, which is to integrate these
multiple streams such that they are mutually synchronized in a way
that creates the right experience for the user. This is something a
web browser simply doesn't do, so it's an area where we need to be
careful using the browser model as our mental image. A browser is
inherently a 2D, document-style interface, and the time component is
not something explicitly managed in an overall fashion. A virtual
world browser needs to always be thinking about time, and managing the
process of integrating its various information sources with a
well-defined notion of time, latency, and synchronization.
The voice issue is really just a small subset of this larger question.
I think I can agree with everything said herein, by both Toni and Dan. :)
Cheers,