Parallel Detour navigation mesh queries

Cam

unread,

Jul 21, 2010, 9:10:57 PM7/21/10

to recastnavigation

I thought it might be better to raise this here rather than as an
issue, so anyone interested can chip in.

I'm interested in performing multi-threaded navigation mesh queries
with Detour. However currently many queries like findPath modify a
node pool and open list that are owned by dtNavMesh. This means if you
call dtNavMesh::findPath from two different threads you'll get both of
them using the same node pool etc. Concurrent reads of the navigation
mesh data is fine, as long as no other thread is writing to it at the
time, but the node pool is a problem.

One way of paralleling path finding would be to have a path finding
thread which has an input queue of path finding (or other) requests
and an output queue of results. Because only one thread is calling
findPath, there is no race condition with the node pool and open list.
However, this method is never going to scale up for new hardware with
more cores, which is a bit limiting.

Multiple worker threads executing concurrent path finding (or other
navigation) requests in my mind is the way to go. There is just the
one little issue of the shared node pool and open list.

I've played around a bit with making a parallel nav mesh tester tool,
the way I've got it working is by adding additional methods for
findPath, findDistanceToWall and findPolys around which take a
dtNodePool and dtNodeQueue as parameters. The original dtNavMesh
methods call my new ones passing the node pool and open list owned by
dtNavMesh.

I've created a ParallelNavMeshTesterTool which is a copy of
NavMeshTesterTool with multiple agents. recalc() is executed across
multiple threads using OpenMP. Each agent currently has their own node
pool and open list, but it would be reasonably easy to create a node
pool and open list per worker thread.

This all works and you can see a performance increase as the number of
worker threads is ramped up. I think the only thing I'm not happy with
are my changes to dtNavMesh. The way I've done it was to make as few
code changes as possible, but I feel like there is probably a nicer
way. One idea I had was to separate the navigation mesh data from the
navigation mesh queries. The navigation mesh data can be shared. A
navigation mesh query class could have a const pointer to the
navigation mesh data and it's own node pool and open list. This way
each worker thread could have its own dtNavMeshQuery class which is
thread safe - providing the dtNavMesh isn't being modified while
queries are executed. Splitting data and querying might be nice for
PS3 too, since an SPU version of the data class could hide the DMAing
of nav mesh data. Although there's no real reason why that couldn't be
done in the current code with some #ifdefs.

My changes are all at http://github.com/bitshifter/recastnavigation/tree/parallel
if anyone is interested, if you don't have git you can just hit the
download source button. I'm using a CMake project to build. In theory
the CMake build should work on Mac and Windows, I've only tried it on
Linux though.

Mikko Mononen

unread,

Jul 22, 2010, 2:18:43 AM7/22/10

to recastna...@googlegroups.com

Hi,

So far I have been trying to do the work on dtNavMesh so that the data
could be as read only as possible. But other than that I have not
tinkered too much about parallel stuff.

I like the idea of separating the navmesh data and the query/tools
which use it. Maybe the query class could be always thread safe, and
there would be no reason to add some defaults for the methods for a
common use case or create duplicate functions which does the same
thing by thread safely. Should make the dtNavMesh class cleaner too.

So basically the dtNavMesh would have methods to add, remove and
change data, and dtNavMeshQuery would have all the finds, getters and
querys in there.

One additional pro for the separate dtNavMesh query is that it would
also allow to add some extra costs or path find blockers per query.
_

A related problem I have not been able to solve are:
- How to handle concurrent data changes?
- Is the data allowed to change while the query methods are running?

Locks, task queues?

Adding and removing tiles happen relatively seldom, but changing
polygon flags and types can happen more often. I wonder if locks are
going to work since setting a polygon flag might block until a path
request is finished.

My idea with queues was that all the queries and data changes would go
into one work queue and data change tasks would work like flush, so
they would wait for all the previous tasks to finish until they are
executed. So normally everything would work in parallel.

I don't want to that much functionality to Detour, though. Everybody
has their own job managers and I would rather allow Detour to be
integrated to those. I don't wan to add a thread library to Detour
either. I also want to keep the general use case as simple as it is
now.

These ideas probably reveal my total lack of knowledge on parallel
programming :)

--mikko

Bill Bierman

unread,

Jul 22, 2010, 4:07:06 AM7/22/10

to recastna...@googlegroups.com

The answer I think to most of these issues is simple: mutex mutex mutex

For multiple threads to share a node pool (which you want since they
are so large), you need only wrap the functions which change them in a
mutex. We use recast on World of Warcraft and to do some of the
longest possible pathfinds take a matter of milliseconds. Obviously
there wouldn't be concurrent pathfinds going on, but it would allow
different threads to share the data.

As for Mikko's issues of concurrent data changes and whether the data
is allowed to change during a query, a mutex or two or three would
solve the concurrency issue, and basically the answer to changing the
data during a query is no. Sure you lose some performance by sleeping
a thread while you wait for a lock, but even if you could find a
graceful way to recover from changing the data mid-query, I would
guess the overhead required to do so would make it even less useful
than the delay in sleeping for the mutex.

Cam

unread,

Jul 22, 2010, 7:42:56 AM7/22/10

to recastnavigation

I disagree that mutexes are the answer. They are 'an' answer but one
size doesn't fit all. Throwing a mutex in each access of the node pool
and open list would add quite a lot of overhead to path finding. If
you are using the default of 2048 nodes, that's about 37k for a node
pool and 16k for the open list. Times that by 4 worker threads is
212k. That's not a huge amount of memory, even on a console. Also if
you mutex up the node pool, the more worker threads you add the more
contention you will get on the node pool. Throw a bit of memory at it
problem and there is no concurrency problem.

I think Mikko is right to not want to add to much of this to Recast/
Detour. C++ has no standard concurrency primatives, so they need to be
implemented for each platform. This is an additional hassle for closed
platforms like game consoles as the implementation can't be shared.
Also different projects can have pretty different requirements. For
example the game I'm working on a node pool of 2048 is sufficient and
the path finding requests are reasonably snappy (although I really
need some better stats on this). I imagine this is quite different to
path finding across World of Warcraft maps. I'm mostly interested in
changing the API to make it a little more flexible, for all use cases.

As for updates to the navigation mesh I think these are best done
sequentially. If queries are happening, then queue up nav mesh changes
and apply them when no other thread is accessing the data. If a
polygon flag change is going to invalidate all the paths then perhaps
just cancelling running path finding jobs would be the way to go.
Another option might be limiting path finding jobs to a number of
iterations per run, so long running jobs would stop after visiting
1000 nodes or so at which point the nav mesh could be updated and
potentially partial paths invalidated.

On Jul 22, 8:07 pm, Bill Bierman <b...@thebiermans.org> wrote:
> The answer I think to most of these issues is simple: mutex mutex mutex
>
> For multiple threads to share a node pool (which you want since they
> are so large), you need only wrap the functions which change them in a
> mutex. We use recast on World of Warcraft and to do some of the
> longest possible pathfinds take a matter of milliseconds. Obviously
> there wouldn't be concurrent pathfinds going on, but it would allow
> different threads to share the data.
>
> As for Mikko's issues of concurrent data changes and whether the data
> is allowed to change during a query, a mutex or two or three would
> solve the concurrency issue, and basically the answer to changing the
> data during a query is no. Sure you lose some performance by sleeping
> a thread while you wait for a lock, but even if you could find a
> graceful way to recover from changing the data mid-query, I would
> guess the overhead required to do so would make it even less useful
> than the delay in sleeping for the mutex.
>

> >> My changes are all athttp://github.com/bitshifter/recastnavigation/tree/parallel

Bill Bierman

unread,

Jul 22, 2010, 2:06:17 PM7/22/10

to recastna...@googlegroups.com

I suppose it depends how often you're attempting a pathfind. For us,
we find a large path and take anywhere from 15 seconds to 45 minutes
to walk it. Unless you're calling on an average of once per second, I
wouldn't think the addition of an "if (lock) { ... }" would add that
much overhead. Again this does limit you so that only one thread can
be pathfinding at once. Your numbers about the node pool don't seem
right to me, but I haven't looked at it for quite some time. If it's
true that the additional memory footprint would be negligible, then I
would certainly agree that it would be better to give each thread its
own node pool and space for the open list. That would probably be all
that is necessary to allow for concurrent pathfinds since that is the
only place the data would be changing.

C# has standard syntax for locking objects and makes things really
easy, but it's just syntactic sugar that you don't really need. You
can easily enough do something like:

bool dataMutex;
void addData() {
while (dataMutex) sleep(x);
dataMutex = true;
...
dataMutex = false;
}

This would have essentially the same effect of queuing up changes to
the nav mesh data, but without the need to devise a uniform structure
to build an actual queue of changes to be applied.

Gotisch

unread,

Jul 22, 2010, 4:50:46 PM7/22/10

to recastna...@googlegroups.com

You should at least declare that bool as volatile, no? Else you are just begging for threading problems.
(see http://msdn.microsoft.com/en-us/library/x13ttww7%28v=VS.80%29.aspx )

but this is getting a bit off-topic :)

Mikko Mononen

unread,

Jul 22, 2010, 5:26:55 PM7/22/10

to recastna...@googlegroups.com

Hi,

Yes, it is right there moving along the tangent towards the off-topic planet ;)

I think the case Bill has can be easily handled by just making a
request queue and handle all that stuff on separate thread if
multithreading is necessary. The case Cam has is that there are
potentially many path requests in flight all the time. Adding mutex
there will kill the performance, and using just single path finder
with queue will not scale.

As a first step, I think I will separate the data and the algorithms.
That should allow similar use that you guys are doing currently, as
well as it should allow to have multiple navigation queries in flight
at the same time. It would be up to the user to sync the data updates.

If there are more input on how the data updates could be handled
nicely, I'm interested to hear that. Or any thoughts on the
separation, although I think it is pretty straight forward.

--mikko

Bjoern Knafla

unread,

Jul 22, 2010, 6:51:14 PM7/22/10

to recastnavigation

I don't know much about Detour and Recast but have thought for quite
some time about parallelizing pathfinding and Mikko asked me if I'd
like to chime in, too.

How does Detour handle pathfinding requests?
1. Do you call a function to request a path from a start- to an end-
node and when the pathfinding is done the function returns with a
result?
2. Do you create a pathfinding request instance and call a function
again and again to iteratively search for the path till the end-node
is found?
3. Does the pathfinding work with hierarchies based on the tiles
mentioned in the Detour description?

While I don't know the answer to these questions some general thoughts
(basically a summary of Cam's, Mikko's, and Bill's points from above):

1. Don't put threads or a task pool, etc, into Detour but make it easy
to run pathfinding workloads in parallel from the parallelism runtime
used by the program using Detour.
2. Fine-grained locking to support uncoordinated reading and writing
is a base for disaster: hard to debug and most often won't scale.
3. As I see it (w/o knowing the answers to the questions above)
pathfinding is an iterative process. During each main loop cycle you
run through multiple stages (not the order but the stage-after-stage
structure is important):
3.1 Stage 1: Update the navigation mesh (dynamic changes)
3.2 Stage 2: If open pathfinding requests are stored or known inform/
flag them if they are affected by the changes made
3.3 Stage 3: Collect new pathfinding requests or pathfinding-change
requests (changed end-nodes, etc). Requests are deferred and not
processed when posting them.
3.4 Stage 4: Process all pathfinding requests
3.5 Stage 5: Make pathfinding results or iterative results or results
for the next tile available
3.6 Stage 6: Clear out all closed pathfinding requests and prepare for
the next main loop cycle
3.1 Stage 1: Back to stage 1.

Each stage has clear read or write semantics for the navmesh, flags,
or the pathfinding requests. Document to run these stages one after
the other. As the stages are easy to understand the Detour user should
be able to adapt Detour to his parallelism needs.

Eventually make it possible to pass full batches of navmesh-change
requests, pathfinding-, or pathfinding-change requests at once. At
least document to the user that he should collect all request types
per thread/task but only hand them to Detour from one thread at a time
- he is responsible to synchronize the access. (Requests should be
easily identified via an id or a handler).

If it is possible that each pathfinding request only reads shared
navmesh data while only changing its own request data, then these can
easily run in parallel: have a way to ask Detour for all open requests
ready to run (in parallel). The user can decide himself how to
distribute these requests to threads or a task pool.

If a pathfinding request uses some helper structures (a cache that is
cleaned before and after working on the request and beginning with the
next) then document how to create such a helper structure per thread
or task, so a user can create one per worker thread and reuse it for
all pathfinding requests run per worker.

If Detour works tile- or hierarchy-based it might make sense to add
sub-stages to the processing stage. One sub-stage could be to pre-
process and then sort all requests into buckets associated with the
different tiles / hierarchy-nodes. Then enable the Detour user to
access the open requests bucket per bucket. Buckets can be processed
in parallel, and requests per bucket can also run in parallel. However
as all requests sorted into a bucket use the same shared (read-only)
navmesh data this might help with cache reuse and therefore memory
bandwidth (the scarcest resource today and in the future).

After all requests have been processed the result stage is entered.
Now the user can read the complete paths or just the next step of an
unfinished path. If a path is completed the user should be able to
copy it out of Detour, so detour can free the resources at the end of
its per-main-loop-cycle pipeline.

If it must be possible to run long-running requests then offer a way
to clone the navmesh data (or to share unchanging tiles while having
an copy for changing tiles) - this way long running requests operate
on un-changing data (with certain error rates) while iterative
requests are decoupled and can include changes according to the stages
described above.

Based on tiles it might also be possible to run the navmesh changes in
parallel. During the navmesh change stage offer the tile-change jobs
like the requests in the main pathfinding processing stage so the user
can scheduler them herself.

So far short summary:
- Prepare and document to run Detour in stages with clear semantics
for reading and writing of the different data.
- Document that posting requests, changes, and reading data must be
sequentially and should be protected by the user of Detour.
- Evtl offer batching for requests.
- Request are always deferred (to a later stage).
- Enable the user to access all open requests to organize and
coordinate their computation based on her own parallelization
solution.
- Allow copying of shared navmesh data (exploit tiles) so long-running
requests and short and iterative requests are able to work
concurrently without a need for synchronization.

Result:
- The clear documentation of stages and how to use them is a clear
parallelization guide for the user. No guessing, no uncertainness.
Pure win ;-)
- All synchronization is external to Detour.
- Parallel processing is possible and easily adaptable to the users
runtime system.
- Even when running Detour sequentially the stages with their
coordinated memory accesses to mainly homogeneous data should be
faster than random pathfinding processing.

I am very curious to learn more about Detour and to see where this is
going and where implementation specifics force different solutions
from the one we all drafted here.

Cheers,
Bjoern

On Jul 22, 11:26 pm, Mikko Mononen <memono...@gmail.com> wrote:
> Hi,
>

> Yes, it is right there moving along the tangent towards the off-topic planet ;)
>
> I think the case Bill has can be easily handled by just making a
> request queue and handle all that stuff on separate thread if
> multithreading is necessary. The case Cam has is that there are
> potentially many path requests in flight all the time. Adding mutex
> there will kill the performance, and using just single path finder
> with queue will not scale.
>
> As a first step, I think I will separate the data and the algorithms.
> That should allow similar use that you guys are doing currently, as
> well as it should allow to have multiple navigation queries in flight
> at the same time. It would be up to the user to sync the data updates.
>
> If there are more input on how the data updates could be handled
> nicely, I'm interested to hear that. Or any thoughts on the
> separation, although I think it is pretty straight forward.
>

> --mikkoOn Thu, Jul 22, 2010 at 11:50 PM, Gotisch <goti...@gmail.com> wrote:
> > You should at least declare that bool as volatile, no? Else you are just
> > begging for threading problems.
> > (seehttp://msdn.microsoft.com/en-us/library/x13ttww7%28v=VS.80%29.aspx)
>
> > but this is getting a bit off-topic :)
>

Mikko Mononen

unread,

Jul 23, 2010, 5:09:44 PM7/23/10

to recastna...@googlegroups.com

Thanks Bjoern!

Lots of interesting bits to chew on. I really like the idea of
read/write semantics.

I'm on vacation this week, but I will try to craft a better reply once
I've got a bit time to think about that stuff.

To quickly answer your question, Detour currently does all the queries
as one-shot. My plan is to make it iterative, but only once I have
good interface to do so (i.e. I dont want the navmesh class to have
state).

The current detour navmesh class also has a bit too much stuff in it.
It contains the navigation data which is always treated read-only,
except on several methods which clearly indicate that they add, remove
or change the data.

In addition to that it also contains the open and closed lists, which
are changed by the path finder (and some other searches). I think it
is time to separate the classes as discussed earlier so that it would
be possible to have multiple path find queries running in parallel and
the data would be just data.

There are basically two kinds of queries. Some require "state", like
A* and some directly adjust the output without need for any support
data structures. One such is string pulling another it finding the
nearest navmesh poly. In general the ones that have state also are
really unpredictable on time vs. the others have much more predictable
execution time. It would be interesting to solve, how to make the work
load more consistent, but at first I would like to make a small step
towards concurrency.

There is really nice presentation by the Kynapse guys from GDC'10
regarding parallelism and navigation. Coudn't find the link, though.

The read/write semantics and stages made me think how this all ties to
my idea of navigation loop, and it did get some ideas flowing... so I
think I have my head filled with ideas to get me busy for the rest of
my vacation next week :)

--mikko

Bjoern Knafla

unread,

Jul 23, 2010, 5:36:52 PM7/23/10

to recastna...@googlegroups.com

Hi Mikko,

I believe the presentation you are talking about is here: http://area.autodesk.com/gdc/ondemand2

One-shot queries can make it tricky to handle parallelism/concurrency if they take so long that other queries aren't possible during a frames time.
If some queries have to be delayed to another frame and navmesh changes are mixed in it is easy to get non-deterministic outputs under parallelism (change-request and path-planning-requests are collected from different threads and therefore the order between the queries from different threads isn't deterministic anymore - during different replays a certain path-planning-request might see a change while during another replay the change-request happens afterwards).

Oh, there is another way to enable parallel request execution other than making requests accessible for an outside parallelism runtime:
the Detour user needs to provide callback-functions/objects for task scheduling/dispatching that are then called internally by Detour.
Imo the first method (offer tasks) allows better parallelization by the user - though it publishes the internal request-representation...

Cheers and have a nice vacation!
Bjoern

Cam

unread,

Jul 23, 2010, 7:26:18 PM7/23/10

to recastnavigation

Bjoern's description of the stages is inline with what I had in mind.
If one thread "owns" detour and controls the other threads that access
the navigation mesh there is not much need for synchronisation.

In my particular case I'm not writing to the nav mesh ever so that
simplifies things somewhat. I haven't even got as far as batching path
finding requests right now, path finding isn't showing up in my
profiling but I think that's probably because other systems like
physics and animation are taking too much time. In any case it would
be possible for me to take advantage of parallelism by updating all my
agents at the same time without deferring path finding calls. They
have input buffer and output buffer for their update, so they don't
write to shared resources during their update. This parallel agent
update could include finding a path. I do need to gather better
statistics on how much time path finding takes though. I suspect that
while most path find calls are short that there will be spikes and
that I need to be queuing path finding requests and processing them
over more than a single frame.

I don't know if duplicating the nav mesh data would be a viable option
for me, my largest nav mesh is a bit over 500k, I don't have that much
memory to throw around that I could afford to duplicate it a bunch
times. I think I'd be more in favour of controlling the find finding
iterations, so the path find can run for a maximum number of
iterations per frame, or the path finding worker threads could be
interrupted because there are pending nav mesh changes. Of course if
you have the memory, duplicating the nav mesh data is a lot simpler!
On the other hand, a path might be invalid if it's being calculated
with old data.

Cheers,

Cam

On Jul 23, 10:51 am, Bjoern Knafla <bjoernkna...@googlemail.com>
wrote:

> ...
>
> read more »

Reply all

Reply to author

Forward