Advice, please

309 views
Skip to first unread message

Peter Wilson

unread,
Jan 13, 2021, 7:57:56 PM1/13/21
to golang-nuts
Folks
I have code in C which implements a form of discrete event simulator, but optimised for a clocked system in which most objects accept input on the positive edge of a simulated clock, do their appropriate internal computations, and store the result internally. On the negative edge of the clock, they each take their stored internal state and 'send' it on to the appropriate destination object

This is modelled using a linked list of objects, each of which has a phase0 and phase1 function, and traversing the list twice per clock, calling the appropriate function.

It all works fine. On a uniprocessor. If we have one processor object and one memory object, with the processor implementing a standard instruction fetch decode implement interpreted loop, and playing with simulated caches, reading or writing on cache misses, we can get 20-30 MIPS on a Mac Mini. 

So since (much!) more performance is wanted, implementing this for a multiprocessor seems a good idea. Especially since every computer and its dog is multicore. Using go rather than C also sounds like a good idea.

So  the sketch of the go implementation is that I would have three threads - main, t0, and t1. (more for a real system, but two suffices for explanation)
- main sets stuff up, and t0 and t1 do the simulation work
- main has to initialise, set up any needed synchronization mechanism, and start t0 and t1
- t0 and t1 wait until main says its ok, then both traverse all objects in the list. t0 runs the function if it's an even numbered object, and t1 if it's an odd-numbered. No mutation of state by concurrent threads.
- main loops, as do t0 and t1; t0 and t1 signal that they've finished; when they have, main tells them to start the next traversal

So, after a long ramble, given that I am happy to waste CPU time in busy waits (rather than have the overhead of scheduling blocked goroutines), what is the recommendation for the signalling mechanism when all is done in go and everything's a goroutine, not a thread?

My guess is that creating specialist blocking 'barriers' using sync/atomic (atomic.Operation seems to be around 4nsec on my Mac Mini) is the highest performance mechanism. There's a dearth of performance information on channel communication, waitgroup, mutex etc use, but those I have seen seem to suggest that sending/receiving on a channel might be over the order of 100nsec; since in C we iterate twice through the list in 30-40nsec, this is a tad high (yes, fixeable by modeling a bigger system, but)

I know that premature optimisation is a bad thing, but I'd prefer to ask for advice than try everything..

many thanks for any help

-- P


Robert Engels

unread,
Jan 13, 2021, 8:17:17 PM1/13/21
to Peter Wilson, golang-nuts
This https://github.com/robaho/go-concurrency-test might help. It has some relative performance metrics of some sync primitives. In general though Go uses “fibers” which don’t match up well with busy loops - and the scheduling overhead can be minimal. 

On Jan 13, 2021, at 6:57 PM, Peter Wilson <peter....@bsc.es> wrote:

Folks
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/5a1d1ccb-26f4-4da0-94fb-679c201782dan%40googlegroups.com.

Nikolay Dubina

unread,
Jan 13, 2021, 8:25:14 PM1/13/21
to golang-nuts
Any primitive in `sync` package will do. I would go for two `RWMutex` each for each goroutine, or two unbuffered channels for each gorouitne. However, AFAIK, in Go you can't force start execution of a goroutine. Go will try to wake up any unblocked goroutine as soon as possible though.

Robert Engels

unread,
Jan 13, 2021, 10:59:00 PM1/13/21
to Nikolay Dubina, golang-nuts
Why limit yourself to two? Use N routines and have each process every N in the list. 

On Jan 13, 2021, at 7:25 PM, Nikolay Dubina <nikolay.d...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Kurtis Rader

unread,
Jan 13, 2021, 11:04:15 PM1/13/21
to Robert Engels, Nikolay Dubina, golang-nuts
On Wed, Jan 13, 2021 at 7:58 PM Robert Engels <ren...@ix.netcom.com> wrote:
Why limit yourself to two? Use N routines and have each process every N in the list. 

You missed this statement in the original message of this thread:

So  the sketch of the go implementation is that I would have three threads - main, t0, and t1. (more for a real system, but two suffices for explanation)

They were deliberately simplifying the problem to eliminate irrelevancies. Such as the number of CPU cores available; although, they did assume at least two cores :-)

--
Kurtis Rader
Caretaker of the exceptional canines Junior and Hank

Robert Engels

unread,
Jan 13, 2021, 11:31:31 PM1/13/21
to Kurtis Rader, Nikolay Dubina, golang-nuts
I forgot. But it’s to important to mention that different synchronization methods perform differently under contention so what works well for 2 might be really poor with 64. 

On Jan 13, 2021, at 10:04 PM, Kurtis Rader <kra...@skepticism.us> wrote:



Kurtis Rader

unread,
Jan 13, 2021, 11:47:43 PM1/13/21
to Robert Engels, Nikolay Dubina, golang-nuts
On Wed, Jan 13, 2021 at 8:31 PM Robert Engels <ren...@ix.netcom.com> wrote:
I forgot. But it’s to important to mention that different synchronization methods perform differently under contention so what works well for 2 might be really poor with 64. 

Yes, but the ratio of reads to writes is usually more important with respect to which synchronization method is optimal. 

David Riley

unread,
Jan 14, 2021, 10:34:05 AM1/14/21
to Peter Wilson, golang-nuts
On Jan 13, 2021, at 7:21 PM, Peter Wilson <peter....@bsc.es> wrote:
> So, after a long ramble, given that I am happy to waste CPU time in busy waits (rather than have the overhead of scheduling blocked goroutines), what is the recommendation for the signalling mechanism when all is done in go and everything's a goroutine, not a thread?

This is similar to something I'm working on for logic simulation, and I'd been thinking about the clocked simulation as well. I'll be interested in your results; since I'm also considering remote computation (and GPU computation, which might as well be remote) I'm currently going with the idea of futures driven by either channels or sync.Cond. That may not be as efficient for your use case.

> My guess is that creating specialist blocking 'barriers' using sync/atomic (atomic.Operation seems to be around 4nsec on my Mac Mini) is the highest performance mechanism. There's a dearth of performance information on channel communication, waitgroup, mutex etc use, but those I have seen seem to suggest that sending/receiving on a channel might be over the order of 100nsec; since in C we iterate twice through the list in 30-40nsec, this is a tad high (yes, fixeable by modeling a bigger system, but)

My advice would be to implement the easiest method possible that's not likely to box you in and profile it and see where your bottlenecks are. In my case, so far, the delays introduced by IPC mechanisms (and also allocations) is absolutely dwarfed by just the "business logic" crunching the logical primitives. So far it's not worth trying to improve the IPC on the order of nanoseconds (would be a nice problem to have) because the work done in each "chunk" is big enough that it's not worth worrying about.

This also leads me to the next part, which is that if you have lots of little operations and you're worried about the time spent on IPC for each little thing, you'll probably get the easiest and best performance gains by trying to batch them so that you can burn through lots of similar operations at once before trying to send a slice over a channel or something.

As always, do a POC implementation and then profile it. That's the only productive way to optimize things at this scale, and Go has EXCELLENT profiling capabilities built in.


- Dave

Pete Wilson

unread,
Jan 14, 2021, 11:07:31 AM1/14/21
to David Riley, golang-nuts
Dave

Thanks for thoughts.

My intent is to use non-custom barriers and measure scalability, customising if need be. Not busy-waiting the go routines leaves processor time for other stuff (like visualization, looking for patterns, etc)

There’s also the load-balancing effect; code pretending to be a core fetching an instruction from an icache and waiting for registers to be available has longer path-length than a simple memory access, so one has to avoid the whole set waiting on the longest pole. Simulating vector operations brings with it similar problems.

The ‘batch the instructions’ thing is standard practice even in sequential simulators, where the overheads of walking the list, finding the data structure for the core, and fetching an instruction are sufficient that if you let the simulated core fetch-and-execute 10, 100, or 1000 instructions you’ll get usefully improving performance.

This has to be traded off agains the fact that multicore computer systems have lots of contention, and if you batch stuff up you risk losing sight of the details of that contention (which means the utility of the model can be iffy). This can be attacked by ‘variable batches’ and letting other engines catch up, etc etc, but it all makes it more complicated (which is unwise). So assuring oneself of good scalability with instruction at a time is good ue diligence.

IF this works as desired, it should turn out to be open source. I’m a novice at writing go, so first pass will be as simple as practical rather than polished idiomatic go

I’m also considering making a ‘pure go’ version - a goroutine is an object - one to one. They’re clocked. They communicate by channels. This would be for folk who want to ‘model a system architecture’ more or less directly dropping their vision into go, rather than having to create weird data structures. Given such a thing, it would not be infeasible to create the faster (well, hopefully) model automagically. [Did this way back in the early 90’s, using a homebrew language which introduced the idea of ‘clocked channels'. Unfortunately I wasn’t very good at writing compilers and the tool was… unstable]

— P

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/yqxfGIGDKr4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/24427C92-66CF-4515-ADB4-A3E96059380C%40gmail.com.



WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 






WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer

Pete Wilson

unread,
Jan 14, 2021, 11:09:03 AM1/14/21
to Robert Engels, Nikolay Dubina, golang-nuts
Only because I had started out my explanation in a prior thought trying to use ’naked’ read and write atomic fences, and having exactly 3 (main, t0, t1) exposed all that needed to be exposed of the problem.
Yes, if this scales, there will be more goroutines than 3.


On Jan 13, 2021, at 9:58 PM, Robert Engels <ren...@ix.netcom.com> wrote:

Why limit yourself to two? Use N routines and have each process every N in the list. 

On Jan 13, 2021, at 7:25 PM, Nikolay Dubina <nikolay.d...@gmail.com> wrote:





WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 




Pete Wilson

unread,
Jan 14, 2021, 11:14:41 AM1/14/21
to Robert Engels, golang-nuts
I have decided to believe that scheduling overhead is minimal, and only customize if this is untrue for my workload.
[I don’t like customizing. Stuff in the standard library has been built by folk who have done this in anger, and the results have been widely used; plus principle of least surprise.]
Thanks!

— P 


On Jan 13, 2021, at 7:16 PM, Robert Engels <ren...@ix.netcom.com> wrote:

This https://github.com/robaho/go-concurrency-test might help. It has some relative performance metrics of some sync primitives. In general though Go uses “fibers” which don’t match up well with busy loops - and the scheduling overhead can be minimal. 



WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 

Pete Wilson

unread,
Jan 14, 2021, 11:17:10 AM1/14/21
to Robert Engels, Kurtis Rader, Nikolay Dubina, golang-nuts
N in this case will be similar to the value of GOMAXPROCS, on the assumption that scaling that far pays off. 

I would love to have the problem of having a 128 core computer…. (Though then if scaling tops out at 32 cores one simply runs 4 experiments..)

— P

-- 
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/yqxfGIGDKr4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.

WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 

Bakul Shah

unread,
Jan 17, 2021, 12:14:53 AM1/17/21
to Peter Wilson, golang-nuts
You may be better off maintaining two state *arrays*: one that has the current values as input and one for writing the computed outputs. At "negtive" clock edge you swap the arrays. Since the input array is never modified during the half clock cycle when you compute, you can divide it up in N concurrent computations, provided a given output cell is written by just one thread. So theen the synchronization point is when all the threads are done with they computing. That is when you swap I/O state arrays, advance the clock and unblock all the threads to compute again. You can do this with N+1 channels. The tricky part may be in partitioning the computation in more or less equal N parts.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/5a1d1ccb-26f4-4da0-94fb-679c201782dan%40googlegroups.com.

Pete Wilson

unread,
Jan 17, 2021, 10:11:54 AM1/17/21
to Bakul Shah, golang-nuts
The problem is that N or so channel communications twice per simulated clock seems to take much longer than the time spent doing useful work.

go isn’t designed for this sort of work, so it’s not a complaint to note it’s not as good as I’d like it to be.  But the other problem is that processors are still architected as if most code is sequential and synchronisations (and communications) are extremely rare and so no problem if they’re as slow as molasses (which atomic operations are)

Your description is basically what I’m trying to do, except that I’m using local storage rather than the array elements. It’s not clear that exchanging array pointers is any quicker than having a 2-phase loop; the problem is still the barriers.

P


On Jan 16, 2021, at 11:14 PM, Bakul Shah <ba...@iitbombay.org> wrote:

You may be better off maintaining two state *arrays*: one that has the current values as input and one for writing the computed outputs. At "negtive" clock edge you swap the arrays. Since the input array is never modified during the half clock cycle when you compute, you can divide it up in N concurrent computations, provided a given output cell is written by just one thread. So theen the synchronization point is when all the threads are done with they computing. That is when you swap I/O state arrays, advance the clock and unblock all the threads to compute again. You can do this with N+1 channels. The tricky part may be in partitioning the computation in more or less equal N parts.



WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 

Robert Engels

unread,
Jan 17, 2021, 10:22:20 AM1/17/21
to Pete Wilson, Bakul Shah, golang-nuts
If there is very little work to be done - then you have N “threads” do M partitioned work. If M is 10x N you’ve decimated the synchronization cost. 

On Jan 17, 2021, at 9:11 AM, Pete Wilson <peter....@bsc.es> wrote:

The problem is that N or so channel communications twice per simulated clock seems to take much longer than the time spent doing useful work.
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Pete Wilson

unread,
Jan 17, 2021, 10:50:22 AM1/17/21
to Robert Engels, Bakul Shah, golang-nuts
That’s exactly the plan.

The idea is to simulate  perhaps the workload of a complete chiplet. That might be (assuming no SIMD in the processors to keep the example light) 2K cores. Each worklet is perhaps 25-50 nsec (worklet = work done for one core) for each simulated core

The simplest mechanism is probably that on finishing work, every worker sends a message to main; when it’s got all the messages, it sends a message to each of the workers. Nice and simple. But it seems as though a channel communication is of the order of 100ns, so I’m eating 200nsec per phase change in each worker

With 10 executing cores and 2K simulated cores, we get to do around 200 worklets per phase per executing core. At 25ns per worklet that’s 5 microseconds of work per worker, and losing 200nsec out of that will still let the thing scale reasonably to some useful number of cores. 

But tools are more useful if they’re relative broad-spectrum. If I want to have 20 worklets per core per phase (running a simulation on a subset of the system, to gain simulation speed), I now am using ~ 200ns out of 500 ns of work, which is not a hugely scalable number at all. Probably it’d run slower than a standard single core sequential implementation regardless of the number of cores, so not a Big Win

Were the runtime.Unique() function to exist (a hypothetical scheduler call that, for some number of goroutines and cores, allows a goroutine to declare that it should be the sole workload for a core; limited to a fairly large subset of available cores) I could spinloop on an atomic load, emulating waitgroup/barrier behaviour without any scheduler involvement and with times closer to the 10ns level (when worklet path lengths  were well-balanced)

I’d also welcome the news that under useful circumstances channel communication is only (say) 20ns. That’d simplify things beauteously. (All ns measured on a ~3GHz Core i7 of 2018)

[Historical Note: When I were a young lad, I wrote quite a bit of stuff in occam, so channel-y stuff is lapsed second nature - channels just work - and all this building barrier stuff is terribly unnatural. So my instincts are (were?) good, but platform performance doesn’t seem to want to play along]


On Jan 17, 2021, at 9:21 AM, Robert Engels <ren...@ix.netcom.com> wrote:

If there is very little work to be done - then you have N “threads” do M partitioned work. If M is 10x N you’ve decimated the synchronization cost. 


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 

Robert Engels

unread,
Jan 17, 2021, 11:07:48 AM1/17/21
to Pete Wilson, Bakul Shah, golang-nuts
If you use GOMAXPROCS as a subset of physical cores and have the same number of routines you can busy wait in a similar fashion. 

On Jan 17, 2021, at 9:50 AM, Pete Wilson <peter....@bsc.es> wrote:

That’s exactly the plan.

David Riley

unread,
Jan 17, 2021, 11:12:07 AM1/17/21
to Pete Wilson, Bakul Shah, golang-nuts
This particular problem is akin to the way GPUs work; in GPU computing, you generally marshal all your inputs into one buffer and have the outputs from a single kernel written to an output buffer (obviously there can be more on either side). The execution overhead of a single invocation is large, so batching things up to run massively in parallel in one run is essential.

On a CPU, you'll also get benefits from this sort of arrangement due to things such as data locality (for caching purposes) and reduced IPC overhead.

Barriers can be reasonably simple to implement, depending on your use case (this may not describe yours); thinking about an event-driven simulator such as a VHDL/Verilog simulator, pushing your new events keyed by their scheduled time into a priority queue (generally a heap, so O(log(n)) time per insertion) and popping them (pre-execution) into your batch per time slice until you've reached the end of a given time slice (assuming events cannot push back into their own timeslice, which is usually bad behavior anyway) will let you simulate one "clock" at a time without having to worry about the actual clock, and will also let you simulate things like propagation delay without having to do weird special-casing for that sort of thing.

In any case, this approach doesn't require barriers so much as just managing work batches. And, using this approach, if you ever do decide to do computation on something like OpenCL, you'll be lined up to process it that way.


- Dave
> --
> You received this message because you are subscribed to the Google Groups "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/FE3D2098-6E4A-4A84-BF9B-2CA04B343AA6%40bsc.es.

Pete Wilson

unread,
Jan 17, 2021, 11:33:55 AM1/17/21
to Robert Engels, Bakul Shah, golang-nuts
I’d observed that, experimentally, but I don’t think it’s guaranteed behaviour :-(


On Jan 17, 2021, at 10:07 AM, Robert Engels <ren...@ix.netcom.com> wrote:

If you use GOMAXPROCS as a subset of physical cores and have the same number of routines you can busy wait in a similar fashion. 



WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 

Bakul Shah

unread,
Jan 17, 2021, 11:47:21 AM1/17/21
to Pete Wilson, Robert Engels, golang-nuts
I’d be tempted to just use C for this. That is, generate C code from a register level description of your N simulation cores and run that. That is more or less what “cycle based” verilog simulators (used to) do. The code gen can also split it M ways to run on M physical cores. You can also generate optimal synchronization code.

With linked lists you’re wasting half of the memory bandwidth and potentially the cache. Your # of elements are not going to change so a linked list doesn’t buy you anything. An array is ideal from a performance PoV.

On Jan 17, 2021, at 7:49 AM, Pete Wilson <peter....@bsc.es> wrote:

That’s exactly the plan.

Pete Wilson

unread,
Jan 17, 2021, 12:04:26 PM1/17/21
to Bakul Shah, Robert Engels, golang-nuts
Yes, that is possible.
The simulated cores are already generated functions in C.
It’s my experience that if you can leverage an existing concurrency framework then life is better for everyone; go’s is fairly robust, so this is an experiment to see how close I can get. A real simulation system has to have a good inner engine; but it also needs/wants visualisation, logging, dstats gathering,…. all of which are simpler to write and get correct if one simply leverages a working infrastructure.
It would be very simple to generate go functions instead of C functions
Plus I get the other advantages of go over C (better library, more concise, etc etc)

So, if it works fast enough, life is good.
If it isn’t, then I can try a custom barrier (as Robert has pointed out) and measure again
And if that isn’t fast enough, then back to C. Which would be regrettable.

As to linked lists:
- they’re a wonderful thing in a single-core implementation; not mentioned in the problem expression was that although the population of simulatable objects (“agents”) doesn’t change during runtime, it is the case that some may do a lot less work than others. A memory bank only does something when a miss has filtered through the intervening caches, for example. Just ‘polling’ an agent to see if it’s active or not takes appreciable time. So cluttering up the collection of agents with agents that have a 0.1% duty cycle is a Bad Thing for performance; so you need to remove them from the collection. In a single processor sequential simulation, a linked list lets you remove an agent very cheaply. In go, the built-in slice lets me exchange someone I don’t want any more with the last one and shrink the size; this is no quicker than cutting an agent out of a linked list; and walking down an array of pointers still requires me to get a pointer before playing with the agent. (Experiments in C with slice-like data structures showed there was no performance difference compared with the linked list) (Agents removed from the active collection are re-inserted when something arrives for them, of course)

— P

On Jan 17, 2021, at 10:46 AM, Bakul Shah <ba...@iitbombay.org> wrote:

I’d be tempted to just use C for this. That is, generate C code from a register level description of your N simulation cores and run that. That is more or less what “cycle based” verilog simulators (used to) do. The code gen can also split it M ways to run on M physical cores. You can also generate optimal synchronization code.

With linked lists you’re wasting half of the memory bandwidth and potentially the cache. Your # of elements are not going to change so a linked list doesn’t buy you anything. An array is ideal from a performance PoV.


WARNING / LEGAL TEXT: This message is intended only for the use of the individual or entity to which it is addressed and may contain information which is privileged, confidential, proprietary, or exempt from disclosure under applicable law. If you are not the intended recipient or the person responsible for delivering the message to the intended recipient, you are strictly prohibited from disclosing, distributing, copying, or in any way using this message. If you have received this communication in error, please notify the sender and destroy and delete any copies you may have received. 

http://www.bsc.es/disclaimer 

Kevin Chadwick

unread,
Jan 18, 2021, 7:40:37 AM1/18/21
to golang-nuts
On 1/17/21 4:46 PM, Bakul Shah wrote:
> I’d be tempted to just use C for this. That is, generate C code from a register
> level description of your N simulation cores and run that. That is more or less
> what “cycle based” verilog simulators (used to) do. The code gen can also split
> it M ways to run on M physical cores. You can also generate optimal
> synchronization code.
>
> With linked lists you’re wasting half of the memory bandwidth and potentially
> the cache. Your # of elements are not going to change so a linked list doesn’t
> buy you anything. An array is ideal from a performance PoV.

Potentially forgive me and ignore this message if inappropriate as I haven't
been paying close attention to this thread at all really.

However, it strikes me that arrays are perfectly usable in GO. Similarly to how
you might use a global array with tinygo to ensure memory usage limits are not
breached. What is the issue of using an array in Go? Even a global one, *IF*
suited to the task at hand and dishing the work out to workers with a scheme a
little more complex than odd, even etc. as required?

With the benefit that an off by one etc. causes a panic and not something
potentially worse?

Bakul Shah

unread,
Jan 18, 2021, 1:50:59 PM1/18/21
to Kevin Chadwick, golang-nuts
Go arrays are perfectly usable. My comment had more to do with the fact that
you'd just have a fixed number of threads and if synchronization using channels
is not fast enough, you'd have to roll your own, in which case Go doesn't
really buy you much. I'd just keep the simulator code as a separate process
that does nothing but simulation and have it communicate with a separate
program that does visualization, logging etc. which can be in Go. Pete Wilson
in his response said that simulated cores are already generated functions in C
so this is not a big step. I'd just have the code generator generate the whole
simulator or something like it. Just different tradeoffs to consider.

Reply all
Reply to author
Forward
0 new messages