GA144 use cases

dunno

unread,

Aug 31, 2015, 3:20:21 AM8/31/15

to

After reading documentation on GA144 I wonder what kind of application
could benefit from using it.

Can anyone provide real-world examples?

--
dunno

rickman

unread,

Aug 31, 2015, 9:55:44 AM8/31/15

to

I thought I had an application for the GA144 once, but when push came to
shove I couldn't verify that it had enough performance to implement an
SDRAM controller. Green Arrays won't provide certain information
relating to the timing of the I/O and internode comms so that I could do
a paper analysis. Their suggestion was to build the design and see if
it worked, lol. That completely ignores the various issues of timing in
hardware.

I also considered using it in a product I am supplying to a major
networking company. But again, I couldn't verify the I/O would be fast
enough to implement a 33 MHz serial port similar to SPI. This
particular app could be done mostly on a conventional MCU, but there are
two hardware interfaces that would exceed the capabilities of a 100 MHz
CPU. As it turns out the processors in the GA144 aren't really that
much faster, just a lot more of them. Since the handling of an I/O port
can't be spread over multiple processors, this design wasn't any more
suitable for the GA144 than an ARM CM3. Since neither will do the job
this design will continue to use an FPGA.

--

Rick

Syd Rumpo

unread,

Aug 31, 2015, 11:49:27 AM8/31/15

to

I've thought long and hard about this, but haven't been able to think of
anything useful which can't be done better another way. Maybe it's just
me, but GreenArrays don't seem to have come up with anything substantive
either.

A shame, I like stack machines.

Cheers
--
Syd

rickman

unread,

Aug 31, 2015, 6:35:10 PM8/31/15

to

Yeah, they didn't really provide something with much flexibility because
they only provided processing power and not much else. You can make
some tradeoffs of processing with memory or the need for dedicated
peripherals. But at some point you just need the sorts of things the
other embedded processors have... that's why they have them.

--

Rick

Dennis Ruffer

unread,

Aug 31, 2015, 10:08:37 PM8/31/15

to

On Monday, August 31, 2015 at 12:20:21 AM UTC-7, dunno wrote:
> Can anyone provide real-world examples?

IntellaSys had in mind hearing aids and wireless speakers. I've also heard of video in the automotive industry (forget which company at the moment).

Encryption and signal processing are obvious candidates.

E.g. anything streaming, but I'm not sure it ever was faster than other technologies.

DaR

Mux

unread,

Sep 2, 2015, 1:57:32 PM9/2/15

to

A pipelined Mandelbrot :-)

Not really a real-world application but something I'd love to write. With an external oscillator for frequency stability, you should be able to dedicated two nodes for display timing, another one control and one (or more) acting as a fifo. The remaining nodes each calculate one iteration of a mandelbrot set. Each node would take the result from the previous one, perform an iteration and pass on the result to the next one until it ends up in the fifo. You'd be able to fill / catch-up during hblank and vblank..

Not a real-world example but it's something I'd want to do if I had a GA144 :-)

-Y

rickman

unread,

Sep 2, 2015, 2:17:11 PM9/2/15

to

There are any number of toy applications. I suppose you can generate
your own display from the GA144, so a "real time" Mandelbrot set display
might work ok. Otherwise it is hard to get the data out of the GA144.

There is some irony to the DIY attitude shown by this device. It is so
incestuous that even the high speed link will only work with other
GA144s to the best of my knowledge.

--

Rick

Mux

unread,

Sep 2, 2015, 2:28:28 PM9/2/15

to

It's a tricky subject. Aside from the fact that the status-quo won't go near anything that hasn't wide-spread support, the hardware side of things are a little too alien to designers to leverage it.

For example, had it ran at 3.3v or at least have VCCIO at 3.3v, interfacing it would have been a lot easier. Likewise if there was a TQFP package version, I'd have picked them up a lot sooner. The other fact is that 'the competition' is established, has a great support community behind it and this makes the GA fall into this weird category where yes, it's got a lot of power but it's a bitch to utilize. With zero marketing, intuitive / user-friendly tools and a KISS approach that *might* have gone a little too far I'm unfortunately afraid there's little hope.

Would there be any defense applications that could benefit from it? I'm guessing Chuck kinda banked on that sector or similar.

-Y

Jason Damisch

unread,

Sep 2, 2015, 3:10:31 PM9/2/15

to

On Monday, August 31, 2015 at 12:20:21 AM UTC-7, dunno wrote:

> After reading documentation on GA144

I don't know but it sure is darn pretty.

:^)

rickman

unread,

Sep 2, 2015, 3:15:21 PM9/2/15

to

On 9/2/2015 1:57 PM, Mux wrote:

Here's one that is feasible. Using the built in DACs a multichannel
signal generator can be built. I studied the noise issue a while back
and for many apps phase truncation as is commonly implemented is the
limiting factor in noise generation as it produces close in spurs to the
carrier which can not be filtered. The solution is to *not* use phase
truncation. Rather than truncate the phase, preserve enough bits to
properly represent the desired SNR and use something other than a
straight table lookup to translate the phase to the sine signal. The 18
bit word size in the GA144 is enough to get a very high SNR for a full
scale signal.

I don't recall how fast the DACs are in the GA144, but I believe they
are only 9 bits and require dithering to improve the resolution. This
means the effective sample rate would be greatly lowered. So I'm not
sure what frequency range could be accommodated with 16 to 18 bit
resolutions.

So in the end, this may only be an interesting exercise. It is exactly
this sort of experimentation that a typical MCU provider would have done
in house to show the capabilities of their devices. I'm surprised there
is not a general purpose library of DAC/ADC drivers to demonstrate the
capabilities of the GA144.

--

Rick

hughag...@gmail.com

unread,

Sep 2, 2015, 4:41:06 PM9/2/15

to

On Wednesday, September 2, 2015 at 11:28:28 AM UTC-7, Mux wrote:
> Would there be any defense applications that could benefit from it? I'm guessing Chuck kinda banked on that sector or similar.

When I first heard about the GA144, I assumed that it was for encryption-cracking. That is the only application that I know of that involves doing a lot of calculations that are independent of each other (trying out different keys) --- most applications that involve a lot of calculation don't lend themselves to parallel processing because each calculation depends upon the results of a previous calculation.

Also, there is the game of Go that is quite difficult. This is similar to encryption-cracking except that instead of trying out different keys, you would be trying out different points on the board to play your stone.

I remember a long time ago in "Circuit Cellar Ink." there was a neural-network project described that used hundreds of boards connected together --- I think they were 8080 processors, or something similar --- that would be an another interesting project for the GA144.

rickman

unread,

Sep 2, 2015, 4:50:21 PM9/2/15

to

I can't begin to figure out what Chuck was thinking would be good
applications for this device, but I have followed his blog enough to
believe the real possibility that he didn't plan on any specific
applications at all. The closest thing to an intended app might be the
home theater apps he worked on for ITV. That at least would have been
in the back of his mind as he moved forward with this.

But in reality I think, as someone else said long ago, Chuck doesn't
develop products as much as technology. He thinkers with ideas and the
GA144 is the expression of some ideas. It lacks a great deal and
suffers from its uniqueness, but it is the missing bits that I think
hamper it the most. I have given this a lot of thought and I can't see
an easy fix to the issues.

BTW, the package is the least of its problems. I think that package is
a great compromise between easy prototyping and small size for small
designs. There isn't much call for low power devices in larger
packages. Just ask the FPGA guys. There are some low power FPGAs out
there, but not much in remotely prototyping friendly packages.

--

Rick

Mux

unread,

Sep 2, 2015, 6:03:35 PM9/2/15

to

How about audio synthesis? Given the shear amount of nodes you should be able to generate waveforms pretty easily. Each node / group of nodes could be configured as a simple / complex generator, maybe similar to FM or otherwise just simple square / sine / etc. waveforms.

Given that you have on-chip DAC's it could be a really nifty synth.. Adding a software UART and adding some level-converters and an amp you could have a pretty decent audio device :-)

-Y

rickman

unread,

Sep 2, 2015, 6:18:31 PM9/2/15

to

On 9/2/2015 6:03 PM, Mux wrote:
> How about audio synthesis? Given the shear amount of nodes you should be able to generate waveforms pretty easily. Each node / group of nodes could be configured as a simple / complex generator, maybe similar to FM or otherwise just simple square / sine / etc. waveforms.
>
> Given that you have on-chip DAC's it could be a really nifty synth.. Adding a software UART and adding some level-converters and an amp you could have a pretty decent audio device :-)

Did you read my post? Are you happy with 9 bit audio? The DACs need to
be dithered to get higher resolution which reduces effective sample
rate. I'm not sure what rates work at what resolutions with the GA144
DACs. I did a simple analysis of the ADCs and they won't convert 16 bit
audio fast enough for 20 kHz bandwidth. I think you get a sample rate
of around 30 kHz at 16 bits.

This is why I mentioned that a large company would have provided sample
apps to control the ADCs and DACs allowing users to know what to expect
without ramping up the huge learning curve of these devices and trying
it for themselves. If GA has such a sample app I missed it.

--

Rick

Mux

unread,

Sep 2, 2015, 7:14:07 PM9/2/15

to

> Did you read my post? Are you happy with 9 bit audio? The DACs need to
> be dithered to get higher resolution which reduces effective sample
> rate. I'm not sure what rates work at what resolutions with the GA144
> DACs. I did a simple analysis of the ADCs and they won't convert 16 bit
> audio fast enough for 20 kHz bandwidth. I think you get a sample rate
> of around 30 kHz at 16 bits.
>
> This is why I mentioned that a large company would have provided sample
> apps to control the ADCs and DACs allowing users to know what to expect
> without ramping up the huge learning curve of these devices and trying
> it for themselves. If GA has such a sample app I missed it.
>
> --
>
> Rick

Must have missed that bit... Yeah, that's not that great. Then again, you could always have an external DAC and use one of the outer nodes push data out. With that in mind though, hearing aids and wifi speakers surely can't be a good target then, no?

-Y

rickman

unread,

Sep 2, 2015, 8:40:29 PM9/2/15

to

I think the dither thing can work ok. For a signal generator I would
want good resolution at higher frequencies than the GA144 might be able
to accomplish... although I'm not sure since I don't know what it will
do really.

--

Rick

Greg

unread,

Sep 2, 2015, 8:53:17 PM9/2/15

to

In this interview: http://www.informit.com/articles/article.aspx?p=1193856
Donald Knuth says: "Let me put it this way: During the past 50 years, I've
written well over a thousand programs, many of which have substantial size.
I can't think of even five of those programs that would have been enhanced
noticeably by parallelism or multi-threading."

So maybe that's the problem, most programs that we write are sequential in
nature. Or at least our brains have a harder time solving problems by slicing
into small independent units and then assembling results together.

At the same time, some very bright people have sometimes said pretty stupid
things too. Maybe the future will prove Don wrong and we will all have had
developed the required skills to solve problems using 144 or even thousands of cores efficiently.

rickman

unread,

Sep 2, 2015, 9:25:22 PM9/2/15

to

On 9/2/2015 8:53 PM, Greg wrote:
> In this interview: http://www.informit.com/articles/article.aspx?p=1193856
> Donald Knuth says: "Let me put it this way: During the past 50 years, I've
> written well over a thousand programs, many of which have substantial size.
> I can't think of even five of those programs that would have been enhanced
> noticeably by parallelism or multi-threading."
>
> So maybe that's the problem, most programs that we write are sequential in
> nature. Or at least our brains have a harder time solving problems by slicing
> into small independent units and then assembling results together.
>
> At the same time, some very bright people have sometimes said pretty stupid
> things too. Maybe the future will prove Don wrong and we will all have had
> developed the required skills to solve problems using 144 or even thousands of cores efficiently.

What about multiple programs running in parallel? That is very common
on nearly every computer down to a lowly 8 bit PIC. It may not be done
with a multitasking OS, but how many times have UART interfaces been
done with interrupts? That is another task, or two actually, TX and RX.
How about polling a push button?

--

Rick

Mux

unread,

Sep 3, 2015, 1:28:08 AM9/3/15

to

> In this interview: http://www.informit.com/articles/article.aspx?p=1193856
> Donald Knuth says: "Let me put it this way: During the past 50 years, I've
> written well over a thousand programs, many of which have substantial size.
> I can't think of even five of those programs that would have been enhanced
> noticeably by parallelism or multi-threading."
>
> So maybe that's the problem, most programs that we write are sequential in
> nature. Or at least our brains have a harder time solving problems by slicing
> into small independent units and then assembling results together.
>
> At the same time, some very bright people have sometimes said pretty stupid
> things too. Maybe the future will prove Don wrong and we will all have had
> developed the required skills to solve problems using 144 or even thousands of cores efficiently.

I think the thing that's overlooked is that we're kinda up to the (economic) wall of Moore's law and the only way to improve performance is to go wide. So yeah, it's a lot harder to write programs that are spread out over a multitude of processors. At the same time, tools have improved as well. If you look GPU's, there's no chance in hell you would have been able to do that without parallelism / multiple cores in hell.

I really, really like the premise of the GA144 but what it really needs, like so many other things, is a killer app or something that's a niche and can only be handled by a large number of cores..

-Y

Paul Rubin

unread,

Sep 3, 2015, 1:53:52 AM9/3/15

to

Greg <bor...@gmail.com> writes:
> In this interview: http://www.informit.com/articles/article.aspx?p=1193856
> Donald Knuth says: "Let me put it this way: During the past 50 years, I've
> written well over a thousand programs, many of which have substantial size.
> I can't think of even five of those programs that would have been enhanced
> noticeably by parallelism or multi-threading."

To be fair, through most of Knuth's career, computer hardware was very
expensive and almost all computers had single CPU's. So people didn't
care about parallelism as much because they didn't have the hardware for
it. There was also no giant Internet so people didn't care about
concurrency as much. These days everyone doing anything significant has
to think about parallelism. E.g. about half of TAOCP vol 3 is about
munching large (for that era) datasets on tape drives connected to a
single computer. These days those problems are many OOM bigger, and
they're done on Hadoop clusters (or whatever) across 1000's of nodes in
a data center.

Mark Wills

unread,

Sep 3, 2015, 3:05:19 AM9/3/15

to

On Wednesday, 2 September 2015 23:18:31 UTC+1, rickman wrote:
> Did you read my post? Are you happy with 9 bit audio?

Hey! Don't knock it. The Emulator I and II Synths from E-Mu were
both 8-bit samplers, and they're on more pop and rock records from
the 80s and 90s than you can shake a stick at!

:-)

Stefan Mauerhofer

unread,

Sep 3, 2015, 4:27:55 AM9/3/15

to

Hi guys

Why not buy an EvalBoard and find out yourself what the GA144 is capable of doing?

Anton Ertl

unread,

Sep 3, 2015, 6:51:06 AM9/3/15

to

Greg <bor...@gmail.com> writes:
>In this interview: http://www.informit.com/articles/article.aspx?p=1193856
>Donald Knuth says: "Let me put it this way: During the past 50 years, I've
>written well over a thousand programs, many of which have substantial size.
>I can't think of even five of those programs that would have been enhanced
>noticeably by parallelism or multi-threading."

I think that's more a sign that he did not think about parallelism
very much.

The next sentence is: "Surely, for example, multiple processors are no
help to TeX.[1]".

And then there is a footnote (makes me wonder how that interview was
conducted) that says:

|[1] My colleague Kunle Olukotun points out that, if the usage of TeX
|became a major bottleneck so that people had a dozen processors and
|really needed to speed up their typesetting terrifically, a
|super-parallel version of TeX could be developed that uses
|"speculation" to typeset a dozen chapters at once: Each chapter could
|be typeset under the assumption that the previous chapters don't do
|anything strange to mess up the default logic. If that assumption
|fails, we can fall back on the normal method of doing a chapter at a
|time; but in the majority of cases, when only normal typesetting was
|being invoked, the processing would indeed go 12 times faster. Users
|who cared about speed could adapt their behavior and use TeX in a
|disciplined way.

And the same goes for paragraphs. OTOH, TeX is so fast on modern CPUs
that one can really say that paralellizing it would not enhance it.
Also, parallelizing it would require quite a bit of effort.

>So maybe that's the problem, most programs that we write are sequential in
>nature. Or at least our brains have a harder time solving problems by slicing
>into small independent units and then assembling results together.

I don't think so. Small units are not hard, but inefficient (slower
than sequential). If you want to have a speedup from parallelizing,
you need relatively big units; and the unit size is dependent on the
number of cores and the input data, and all that is not very
compatible with he usual abstraction mechanisms that we use to conquer
the complexity of programming.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2015: http://www.rigwit.co.uk/EuroForth2015/

Syd Rumpo

unread,

Sep 3, 2015, 8:32:37 AM9/3/15

to

On 03/09/2015 09:27, Stefan Mauerhofer wrote:
> Hi guys
>
> Why not buy an EvalBoard and find out yourself what the GA144 is capable of doing?
>

Because I'm not independently wealthy and I need to eat.

Cheers
--
Syd

Bernd Paysan

unread,

Sep 3, 2015, 9:54:30 AM9/3/15

to

Anton Ertl wrote:
> And the same goes for paragraphs. OTOH, TeX is so fast on modern CPUs
> that one can really say that paralellizing it would not enhance it.
> Also, parallelizing it would require quite a bit of effort.

Well, TeX is written in an "unmaintainable" programming language, c-web.
Let's go with the paragraphs: TeX tries to render the paragraph with several
different attempts, and chooses the best-"looking", with a metric of what is
"best". That is fairly easy to parallelize, you just run all the trials in
parallel, and then collect the results. The one with the lowest badness
wins, and the others get deleted.

The problem with parallelism I have is not that I can't imagine it, but that
the overhead is too large. E.g. take net2o: I send and receive messages
(each takes time in the kernel), and I decrypt and encrypt them. With a
four-core CPU, I could run that in four tasks.

However, when trying that, I found that the IPC is killing the performance.
These tasks don't take too much time - en/decrypting a 1k block takes 10k
cycles, and the IPC then takes about as long, too, so all the gain goes down
the flush.

What we need for this sort of lightweight tasks done in parallel is a
lightweight, fast, in-hardware IPC, similar in spirit to the GA144
connections or Transputer links.

Such a link would pass up to one word (that's 64 bit now) in a go, and each
process can wait for data from a link or several links (whichever comes
first, will be used). The links would be routed links, so there's no need
to put a task on a specific core. And the registers of a waiting task would
still be kept in the register file; maybe you can have 2 active tasks per
core (SMT), and 8 waiting. If the OS decides that one task needs to be
swapped out of the core, its links become dangling, and will send a "wake me
up" token to the scheduler task (which the OS never throws out of a core).

This sort of GA144-like nanosecond-turnaround for IPC is necessary the
microseconds we need today is completely inacceptable for distributing small
tasks.

And many tasks that can be parallelized *are* small tasks.

The next issue I have is that the shared memory approach is broken, it is
way too hard to debug. I'd prefer a single-writer, multiple reader shared
memory approach, so each process would have its writable memory, and all
other threads can only read that. You may have common shared memory for
something like the root pointers of RCU trees (you can read and copy from
other nodes into your own writable memory, and then update the root
pointer), which need more care, but most of the memory really shouldn't be
writable by every thread. A common address space is still ok.

That could be done with some more fields in the page table entry, having
access rights for the thread itself and others. The thread itself is
identified by some higher bits in the virtual address, so the property "this
thread" vs "other threads" is decided by the first level page table.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
net2o ID: kQusJzA;7*?t=uy@X}1GWr!+0qqp_Cn176t4(dQ*
http://bernd-paysan.de/

rickman

unread,

Sep 3, 2015, 11:32:13 AM9/3/15

to

That is what I was told by GA when I was trying to push the timing of
the parallel I/O for an SDRAM interface. I asked for detailed timing
info just as you would need for any memory interface. It was suggested
that I should build it and see how fast it ran. lol Not only might
that be a big waste of my time, it wouldn't tell me how fast it would
run on any GA144 other than the one I was testing it on.

--

Rick

rickman

unread,

Sep 3, 2015, 11:32:17 AM9/3/15

to

What you say is true. Much of what we currently do when designing
systems is based on the idea that a CPU is the expensive part of the
computer and all else is intended to optimize its use.

But I recall an early machine that had 64 compute heads. I don't recall
the name, but it may have been Eniac. I think it had 64 ALUs with SIMD.
Not the same thing as multiple CPUs, but they were thinking about
parallelism even very early on. So obviously there were one of two
problems with it. Either it was very hard to do (programming) or it
didn't give much improvement to most programming problems. Likely both.

But with today's multitasking OSs running on big iron, I can't see how
having hundreds of processors wouldn't be useful. Now the bottle neck
is in memory access. Even at the low end the use of simple CPUs with
lots of performance and extremely low cost per unit has its advantages.
Before we see that being used much, it will take a lot of other work
to deal with the two problems discovered some 60 years ago.

--

Rick

Anton Ertl

unread,

Sep 3, 2015, 12:54:45 PM9/3/15

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>> And the same goes for paragraphs. OTOH, TeX is so fast on modern CPUs
>> that one can really say that paralellizing it would not enhance it.
>> Also, parallelizing it would require quite a bit of effort.
>
>Well, TeX is written in an "unmaintainable" programming language, c-web.
>Let's go with the paragraphs: TeX tries to render the paragraph with several
>different attempts, and chooses the best-"looking", with a metric of what is
>"best". That is fairly easy to parallelize, you just run all the trials in
>parallel, and then collect the results. The one with the lowest badness
>wins, and the others get deleted.

And you can do many paragraphs in parallel.

>The problem with parallelism I have is not that I can't imagine it, but that
>the overhead is too large. E.g. take net2o: I send and receive messages
>(each takes time in the kernel), and I decrypt and encrypt them. With a
>four-core CPU, I could run that in four tasks.
>
>However, when trying that, I found that the IPC is killing the performance.
>These tasks don't take too much time - en/decrypting a 1k block takes 10k
>cycles, and the IPC then takes about as long, too, so all the gain goes down
>the flush.
>
>What we need for this sort of lightweight tasks done in parallel is a
>lightweight, fast, in-hardware IPC, similar in spirit to the GA144
>connections or Transputer links.

If it's just communication, that can go pretty fast through shared
memory; on a single-socket multi-core CPU, that's a trip through the
L3 cache (<100 cycles latency for the first cache line; and you can
communicate pretty long blocks with pretty low additional overhead).

If it's synchronization, that's a bit more expensive, but also
determined by L3 latency in such a system if you do it on the user
level (with locked instructions or similar).

If you manage to use large blocks and synchronization is rare, you can
make good use of multiple cores. OTOH, if you need to create lots of
short-running tasks and combine their results, you will probably see a
lot of overhead.

>The next issue I have is that the shared memory approach is broken, it is
>way too hard to debug.

Yes, especially with the perverse consistency models that some
architectures provide. But as a primitive for better models (such as
pipelining), it's not too bad. I first thought that one would need
something like links for implementing pipelines, but it can be done
with shared memory.

> I'd prefer a single-writer, multiple reader shared
>memory approach, so each process would have its writable memory, and all
>other threads can only read that. You may have common shared memory for
>something like the root pointers of RCU trees (you can read and copy from
>other nodes into your own writable memory, and then update the root
>pointer), which need more care, but most of the memory really shouldn't be
>writable by every thread. A common address space is still ok.
>
>That could be done with some more fields in the page table entry, having
>access rights for the thread itself and others. The thread itself is
>identified by some higher bits in the virtual address, so the property "this
>thread" vs "other threads" is decided by the first level page table.

You can get that by mmap()ing the memory once read-only and once
read/write. If you want to mmapt them to the same address, I think
you need to do that in two different processes.

But that would still expose the programs to the perversities of the
consistency models.

Marcel Hendrix

unread,

Sep 3, 2015, 12:57:12 PM9/3/15

to

Bernd Paysan <bernd....@gmx.de> writes Re: GA144 use cases

> Anton Ertl wrote:
>> And the same goes for paragraphs. OTOH, TeX is so fast on modern CPUs
>> that one can really say that paralellizing it would not enhance it.
>> Also, parallelizing it would require quite a bit of effort.

The NGSPICE manual is 650 pages and takes more than two minutes to process.
When maintaining the manual (small, infrequent edits) that is too long and
quality suffers (one doesn't do a single edit, or one doesn't want to check
the outcome).

> Well, TeX is written in an "unmaintainable" programming language, c-web.
> Let's go with the paragraphs: TeX tries to render the paragraph with several
> different attempts, and chooses the best-"looking", with a metric of what is
> "best". That is fairly easy to parallelize, you just run all the trials in
> parallel, and then collect the results. The one with the lowest badness
> wins, and the others get deleted.

That approach would suffer unbearably from Amdahl's law, doesn't it?

> The problem with parallelism I have is not that I can't imagine it, but that
> the overhead is too large. E.g. take net2o: I send and receive messages
> (each takes time in the kernel), and I decrypt and encrypt them. With a
> four-core CPU, I could run that in four tasks.

Why not have a writable instruction set for these kind of problems?

> However, when trying that, I found that the IPC is killing the performance.
> These tasks don't take too much time - en/decrypting a 1k block takes 10k
> cycles, and the IPC then takes about as long, too, so all the gain goes down
> the flush.

I am an ngspice maintainer and see the same thing there. The actual device code
that is executed per time step is almost always so small that e.g. openmp
does not work. Although all cores run at 100%, the simulation takes 20%
longer. It works a little bit for some transistor models that are extremely
lengthy, but there Amdahl's law cuts in again when all other processing is
included.

-marcel

rickman

unread,

Sep 3, 2015, 1:06:31 PM9/3/15

to

On 9/3/2015 12:11 PM, Anton Ertl wrote:
> Bernd Paysan <bernd....@gmx.de> writes:
>> Anton Ertl wrote:
>>> And the same goes for paragraphs. OTOH, TeX is so fast on modern CPUs
>>> that one can really say that paralellizing it would not enhance it.
>>> Also, parallelizing it would require quite a bit of effort.
>>
>> Well, TeX is written in an "unmaintainable" programming language, c-web.
>> Let's go with the paragraphs: TeX tries to render the paragraph with several
>> different attempts, and chooses the best-"looking", with a metric of what is
>> "best". That is fairly easy to parallelize, you just run all the trials in
>> parallel, and then collect the results. The one with the lowest badness
>> wins, and the others get deleted.
>
> And you can do many paragraphs in parallel.

I think the point is that until the previous paragraphs have been
processed, you can't finalize the current paragraph. It may have
different margins or be on a page break, etc. all dependent on previous
document code.

Like many programs, I expect there would be ways to parallelize even
word processor code if it is considered. Some have said parallel code
needs to be done at a high level, but I think it would be better done on
the details. But then I am thinking of low level code such as would be
run on the GA144, not parallel code run on the x86 megalith.

--

Rick

rickman

unread,

Sep 3, 2015, 1:08:43 PM9/3/15

to

Are you talking about the distortion based music? I remember an ad for
some line of stereos back in the day where they were saying their
equipment didn't add distortion.... then next to a picture of a hard
rock group, "it won't take away any either"

--

Rick

Paul Rubin

unread,

Sep 3, 2015, 2:57:05 PM9/3/15

to

rickman <gnu...@gmail.com> writes:
> But I recall an early machine that had 64 compute heads. I don't
> recall the name, but it may have been Eniac.

You probably mean ILLIAC IV. It was an array processor built in the
1960's, but it was a unique, specialized academic research machine,
that most programmers wouldn't have thought about.

> But with today's multitasking OSs running on big iron, I can't see how
> having hundreds of processors wouldn't be useful. Now the bottle neck
> is in memory access.

Actually graphics cards now have 100's of processors with local memory,
and consumer laptop cpu's have graphics extensions with dozens of
processors. I forgot about those entirely. And there is the Xeon Phi
with 60 or so Pentium-like cores. Big-data parallelism though is done
mostly with networks of PC-like computers in a data center.

Bernd Paysan

unread,

Sep 3, 2015, 4:08:47 PM9/3/15

to

Anton Ertl wrote:
> If it's just communication, that can go pretty fast through shared
> memory; on a single-socket multi-core CPU, that's a trip through the
> L3 cache (<100 cycles latency for the first cache line; and you can
> communicate pretty long blocks with pretty low additional overhead).

No, the main problem is telling the other thread that there is some data for
it. That's using an OS IPC call, and that takes several thousand cycles.
You can pass around a lot of data between the cores; that works well.

AMD's recent GPUs have hardware support for asynchronous tasks, and that
includes activating them. That's the tough part, passing the data around is
already solved.

NVidia screwed that up completely, and as a result, using asyncronous tasks
on a modern NVidia card (with DirectX 12) doesn't work.

> If you manage to use large blocks and synchronization is rare, you can
> make good use of multiple cores. OTOH, if you need to create lots of
> short-running tasks and combine their results, you will probably see a
> lot of overhead.

Yes, and that's the problem. There are a lot of low-hanging fruits for
parallelization which are short-running tasks.

>>The next issue I have is that the shared memory approach is broken, it is
>>way too hard to debug.
>
> Yes, especially with the perverse consistency models that some
> architectures provide. But as a primitive for better models (such as
> pipelining), it's not too bad. I first thought that one would need
> something like links for implementing pipelines, but it can be done
> with shared memory.

Plus IPC, to notify the other task that there is something to do. Or you
spin-loop, which means each task blocks an entire core, even if they don't
have much work -> no a good idea.

> You can get that by mmap()ing the memory once read-only and once
> read/write. If you want to mmapt them to the same address, I think
> you need to do that in two different processes.

Multiple threads in one process all see the same mapping. It works between
different processes, because they have different page tables.

If you want to avoid that overhead, you need hardware support.

> But that would still expose the programs to the perversities of the
> consistency models.

Yes. The consistency models are still a problem with single-writer,
multiple-readers.

Paul Rubin

unread,

Sep 3, 2015, 5:00:21 PM9/3/15

to

Bernd Paysan <bernd....@gmx.de> writes:
> No, the main problem is telling the other thread that there is some data for
> it. That's using an OS IPC call, and that takes several thousand cycles.
> You can pass around a lot of data between the cores; that works well.

In the uncontended case you check a futex in user space and proceed if
there's something to do, or yield to the kernel if there's nothing to
do. There's only IPC overhead if there's contention in which case you
have to make a system call.

Bernd Paysan

unread,

Sep 3, 2015, 5:58:59 PM9/3/15

to

Marcel Hendrix wrote:

> Bernd Paysan <bernd....@gmx.de> writes Re: GA144 use cases
>
>> Anton Ertl wrote:
>>> And the same goes for paragraphs. OTOH, TeX is so fast on modern CPUs
>>> that one can really say that paralellizing it would not enhance it.
>>> Also, parallelizing it would require quite a bit of effort.
>
> The NGSPICE manual is 650 pages and takes more than two minutes to
> process. When maintaining the manual (small, infrequent edits) that is too
> long and quality suffers (one doesn't do a single edit, or one doesn't
> want to check the outcome).

That gives an estimate about how long it takes per paragraph: Let's assume
10 paragraphs per page (some of them small, some larger), you maybe have
6000 paragraphs in your document. Makes 20ms per paragraph.

That's in the ballpark where the microsecond IPC is a non-issue, and where
algorithmic improvements are probably an even lower-hanging fruit.

> That approach would suffer unbearably from Amdahl's law, doesn't it?

When it's really 20ms per paragraph, and maybe 2ms per variant, no.

>> The problem with parallelism I have is not that I can't imagine it, but
>> that
>> the overhead is too large. E.g. take net2o: I send and receive messages
>> (each takes time in the kernel), and I decrypt and encrypt them. With a
>> four-core CPU, I could run that in four tasks.
>
> Why not have a writable instruction set for these kind of problems?

If Intel puts some FPGA stuff into each core (after all, they bought
Altera), and connects the FPGAs together, that would solve part of the job.
The other part, keeping some contexts in the core's large register file, but
not executing the instructons and wait for wakeup, needs to go deeper in the
details.

However, for the crypto stuff, I'd probably rather use the FPGA to implement
the crypto primitive (Keccak) directly, that's using a few thousand LEs with
a potentially higher speedup than parallelization.

>> However, when trying that, I found that the IPC is killing the
>> performance. These tasks don't take too much time - en/decrypting a 1k
>> block takes 10k cycles, and the IPC then takes about as long, too, so all
>> the gain goes down the flush.
>
> I am an ngspice maintainer and see the same thing there. The actual device
> code that is executed per time step is almost always so small that e.g.
> openmp does not work. Although all cores run at 100%, the simulation takes
> 20% longer. It works a little bit for some transistor models that are
> extremely lengthy, but there Amdahl's law cuts in again when all other
> processing is included.

Yes, that's why we must reduce the communication overhead. Amdahl's law is
there, because communication between threads is way too expensive.

Bernd Paysan

unread,

Sep 3, 2015, 7:28:57 PM9/3/15

to

Unfortunately, the contention is the rule, and the uncontented case the
exception. The problem of asynchronous tasks is that either the producer or
the consumer waits, and that's the contented case. So either the producer
has to tell the consumer "good morning, time to wake up, I've some new data
for you", or the consumer says "good afternoon, producer, stop napping, and
give me another chunk of data".

In a non-trivial parallel program, you have tasks which run short, and tasks
which run longer, joins where you wait for all inputs to compilete, and so
on. This only works well if the context switch and wake/sleep IPC by itself
is a lightweight operation.

franck....@gmail.com

unread,

Sep 4, 2015, 2:04:12 AM9/4/15

to

Hello,

Oforth parallelism model is lightweight tasks that can communicate
using channels.

Tasks are created using & on a word (or a quotation or a closure).
They are not threads, they are just small objects that can run
concurrently. They can be long-running or short-running objects.

For instance, writing this on the interpreter :
#[ "Hello world" . 5000 System sleep "Done" . ] &

will run this quotation into a separate task.

.w lists all tasks that are currently running or waiting in the system.

A task can wait on a channel using #receive word. If the channel is empty, the
task stops until something is into the channel. When stopped, a task
does not use CPU (no spin-loop, no system call) and does not block any core or thread.

Channels are multiple-writers / multiple-readers queues.
They are the only way to synchronize tasks. There is no other mecanism for
synchronisation (no CAS, no mutex, no semaphore, ... nothing). Tasks are never
waiting for other tasks, but for a resource. And if the resource is empty, the
task just stops and will resume when the resource is available.

For instance, a basic ping pong between 2 tasks :

: pongTask(n, ping, pong) // ( aChannel aChannel aInteger -- )
{
| i |
n loop: i [ ping receive pong send drop ]
}

: pingpong(n) // ( aInteger -- )
{
| ping pong i |
Channel new ->ping
Channel new ->pong
#[ pong ping n pongTask ] &
n loop: i [ i ping send drop pong receive . ]
}

This will write all integers between 1 and n.
- Two channels are created.

- A closure that runs pongTask is created and launched using &
==> pongTask is now running in parallel, waiting for something into ping
channel.

- Each integer is sent to ping channel.
- pongTask receives it and resend it to pong channel.
- pingpong function waits for the return on pong channel and write the result.

Franck

Anton Ertl

unread,

Sep 4, 2015, 4:06:24 AM9/4/15

to

m...@iae.nl (Marcel Hendrix) writes:
>Bernd Paysan <bernd....@gmx.de> writes Re: GA144 use cases

>> Well, TeX is written in an "unmaintainable" programming language, c-web.
>> Let's go with the paragraphs: TeX tries to render the paragraph with several
>> different attempts, and chooses the best-"looking", with a metric of what is
>> "best". That is fairly easy to parallelize, you just run all the trials in
>> parallel, and then collect the results. The one with the lowest badness
>> wins, and the others get deleted.
>
>That approach would suffer unbearably from Amdahl's law, doesn't it?

Does it? What is the sequential component? Probably input parsing.
Although even for that you do not have to parse everything before
formatting the first paragraph. How much is the time spent in input
parsing?

I think it might be a lot for those inputs that take a lot of time (at
least I have a TeX source that collects data from several files and
then produces a document from that, and that takes a lot more time
than the Gforth manual which produces a document that has six times as
many pages. Looking at your numbers, the ngspice manual might also be
such a case. I guess one would need other speedup techniques for that
(and parallelizing would probably not be the one to try first).

Anton Ertl

unread,

Sep 4, 2015, 4:13:54 AM9/4/15

to

rickman <gnu...@gmail.com> writes:
>On 9/3/2015 12:11 PM, Anton Ertl wrote:
>> Bernd Paysan <bernd....@gmx.de> writes:
>>> Anton Ertl wrote:
>> And you can do many paragraphs in parallel.
>
>I think the point is that until the previous paragraphs have been
>processed, you can't finalize the current paragraph. It may have
>different margins or be on a page break, etc. all dependent on previous
>document code.

Different margins are rare for TeX documents, so you can format the
paragraph assuming the usual margins, and if the margins for the
current paragraph are as you assumed, you don't need to recompute it.
Page breaks don't influence paragraphs in TeX AFAIK, because the model
is that text is rendered to a scroll, and that is then cut into pages.
Also, TeX (at least LaTeX) takes page numbers and other references
from the last run, so you don't get any dependencies from that, either
(you may need to rerun it until it's stable, but that also happens for
the sequential version).

Anton Ertl

unread,

Sep 4, 2015, 4:24:38 AM9/4/15

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>> If it's just communication, that can go pretty fast through shared
>> memory; on a single-socket multi-core CPU, that's a trip through the
>> L3 cache (<100 cycles latency for the first cache line; and you can
>> communicate pretty long blocks with pretty low additional overhead).
>
>No, the main problem is telling the other thread that there is some data for
>it. That's using an OS IPC call, and that takes several thousand cycles.
>You can pass around a lot of data between the cores; that works well.

That's synchronization, and you don't need to to that through the OS,
you can also use the user level. Unfortunately, I don't know of a
user-level portable library for this stuff.

>> If you manage to use large blocks and synchronization is rare, you can
>> make good use of multiple cores. OTOH, if you need to create lots of
>> short-running tasks and combine their results, you will probably see a
>> lot of overhead.
>
>Yes, and that's the problem. There are a lot of low-hanging fruits for
>parallelization which are short-running tasks.

However, even for that you do not need OS-level synchronization for
every task: the boss can set up lots of tasks, and each worker thread
takes a task, performs it, then calls the per-thread scheduler, which
takes and performs the next task. If there are many short tasks, a
whole batch of them should be assigned to one worker to reduce
synchronization overhead between workers.

>> You can get that by mmap()ing the memory once read-only and once
>> read/write. If you want to mmapt them to the same address, I think
>> you need to do that in two different processes.
>
>Multiple threads in one process all see the same mapping. It works between
>different processes, because they have different page tables.

Yes, so if you want different permissions for threads of one process,
you need to map the same thing two times (at different addresses),
once read-write and once read-only.

Anton Ertl

unread,

Sep 4, 2015, 4:43:22 AM9/4/15

to

Bernd Paysan <bernd....@gmx.de> writes:

>Paul Rubin wrote:
>> In the uncontended case you check a futex in user space and proceed if
>> there's something to do, or yield to the kernel if there's nothing to
>> do. There's only IPC overhead if there's contention in which case you
>> have to make a system call.
>
>Unfortunately, the contention is the rule, and the uncontented case the
>exception. The problem of asynchronous tasks is that either the producer or
>the consumer waits, and that's the contented case.

No, the contended case is when both want to change the status at the
same time.

> So either the producer
>has to tell the consumer "good morning, time to wake up, I've some new data
>for you", or the consumer says "good afternoon, producer, stop napping, and
>give me another chunk of data".

Sounds to me like you have a GA144-like model with little or no
buffering. Then of course you will have little parallelism; after
all, the GA144 was designed to let many of the cores sleep most of the
time.

But that's the wrong model for an AMD64 or ARM CPU. If you really
have lots of parallelism, then there is no need for an OS-level
synchronization after every piece of work; instead, each thread calls
its scheduler, which looks in the list of tasks to be done, and then
performs the next one, possibly producing another task (in the case of
a producer). Most of that can be done with user-level
synchronization, and if even that is too expensive, reduce it with
batching. You do need buffering for that, however.

>In a non-trivial parallel program, you have tasks which run short, and tasks
>which run longer, joins where you wait for all inputs to compilete, and so
>on. This only works well if the context switch and wake/sleep IPC by itself
>is a lightweight operation.

So don't use the OS in the normal case.

Bernd Paysan

unread,

Sep 4, 2015, 9:09:55 AM9/4/15

to

Anton Ertl wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>>Paul Rubin wrote:
>>> In the uncontended case you check a futex in user space and proceed if
>>> there's something to do, or yield to the kernel if there's nothing to
>>> do. There's only IPC overhead if there's contention in which case you
>>> have to make a system call.
>>
>>Unfortunately, the contention is the rule, and the uncontented case the
>>exception. The problem of asynchronous tasks is that either the producer
>>or the consumer waits, and that's the contented case.
>
> No, the contended case is when both want to change the status at the
> same time.

No, that's not a resource lock, that's a wakeup signal. Therefore, the
contented (expensive) case is when one or the other sleeps, and the
uncontended is when both are awake.

>> So either the producer
>>has to tell the consumer "good morning, time to wake up, I've some new
>>data for you", or the consumer says "good afternoon, producer, stop
>>napping, and give me another chunk of data".
>
> Sounds to me like you have a GA144-like model with little or no
> buffering. Then of course you will have little parallelism; after
> all, the GA144 was designed to let many of the cores sleep most of the
> time.

When the consumers are faster than the produces, your buffers won't fill up.
You can introduce arbitrary delays to keep your buffers filled, but this
also reduces performance, as other tasks will be waiting much longer for the
results.

And yes, I've a model in mind, where more compute cores are sleeping than
working; with a x64-style implementation, you would use SMT for that, not
multiple cores. Some part of the large register file is for sleeping tasks.

> But that's the wrong model for an AMD64 or ARM CPU. If you really
> have lots of parallelism, then there is no need for an OS-level
> synchronization after every piece of work; instead, each thread calls
> its scheduler, which looks in the list of tasks to be done, and then
> performs the next one, possibly producing another task (in the case of
> a producer). Most of that can be done with user-level
> synchronization, and if even that is too expensive, reduce it with
> batching. You do need buffering for that, however.

And buffering reduces performance, as it requires way more actual
parallelism.

>>In a non-trivial parallel program, you have tasks which run short, and
>>tasks which run longer, joins where you wait for all inputs to compilete,
>>and so
>>on. This only works well if the context switch and wake/sleep IPC by
>>itself is a lightweight operation.
>
> So don't use the OS in the normal case.

Anton, that's the entire point of suggesting that the wake/sleep IPC should
be done in hardware.

Bernd Paysan

unread,

Sep 4, 2015, 9:13:07 AM9/4/15

to

Anton Ertl wrote:

>>Multiple threads in one process all see the same mapping. It works
>>between different processes, because they have different page tables.
>
> Yes, so if you want different permissions for threads of one process,
> you need to map the same thing two times (at different addresses),
> once read-write and once read-only

That would still give the other thread write-access to the read-write
mapping, and you have to solve the relocation problem, too. You really want
the data to sit on the same addresses,so that all pointers work without
trickery.

I don't see a way to have different permissions for threads in one process
cheaply (i.e. without having separate page tables for each thread) without
hardware support.

rickman

unread,

Sep 4, 2015, 11:49:53 AM9/4/15

to

On 9/4/2015 4:07 AM, Anton Ertl wrote:
> rickman <gnu...@gmail.com> writes:
>> On 9/3/2015 12:11 PM, Anton Ertl wrote:
>>> Bernd Paysan <bernd....@gmx.de> writes:
>>>> Anton Ertl wrote:
>>> And you can do many paragraphs in parallel.
>>
>> I think the point is that until the previous paragraphs have been
>> processed, you can't finalize the current paragraph. It may have
>> different margins or be on a page break, etc. all dependent on previous
>> document code.
>
> Different margins are rare for TeX documents, so you can format the
> paragraph assuming the usual margins, and if the margins for the
> current paragraph are as you assumed, you don't need to recompute it.
> Page breaks don't influence paragraphs in TeX AFAIK, because the model
> is that text is rendered to a scroll, and that is then cut into pages.
> Also, TeX (at least LaTeX) takes page numbers and other references
> from the last run, so you don't get any dependencies from that, either
> (you may need to rerun it until it's stable, but that also happens for
> the sequential version).

I'm not sure the paged waterfall approach works properly. Anytime you
have an illustration, table, etc, it will be fitted into the page
depending on the page break. If text is flowing around it potentially
two pages will need to be reformatted depending on which page the
illustration ends up on. How is this handled?

--

Rick

Paul Rubin

unread,

Sep 5, 2015, 1:27:13 AM9/5/15

to

Bernd Paysan <bernd....@gmx.de> writes:
> Unfortunately, the contention is the rule, and the uncontented case
> the exception. The problem of asynchronous tasks is that either the
> producer or the consumer waits, and that's the contented case.

If one of them is waiting then that's uncontended. The one that is
waiting is in a kernel sleep. Assuming both tasks actually do
significant computation (otherwise there's no point in putting them on
different cores), if they're communicating through a task queue, they
should both spend almost no time dealing with the lock (futex). The
producer generates a new task, acquires the lock, puts the task on the
queue, and releases the lock. The consumer finishes doing a task and
checks whether there's a new one by acquiring the lock and similarly
manipulating the queue. Both queue manipulations take maybe 100 cycles
and there is only contention if both processes are simultaneously doing
that. And of course you can use multiple queues etc.

New Intel processors have some kind of hardware assistance for memory
transactions (TSX extension) but I don't currently understand how it
works.

Bernd Paysan

unread,

Sep 5, 2015, 7:07:25 AM9/5/15

to

Paul Rubin wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>> Unfortunately, the contention is the rule, and the uncontented case
>> the exception. The problem of asynchronous tasks is that either the
>> producer or the consumer waits, and that's the contented case.
>
> If one of them is waiting then that's uncontended.

Like Anton, you don't get it. The wake signal "contention" is when you need
a state transition, sleep to active. It's not like a mutex where contention
means "both want the same resource at the same time".

> The one that is
> waiting is in a kernel sleep. Assuming both tasks actually do
> significant computation (otherwise there's no point in putting them on
> different cores), if they're communicating through a task queue, they
> should both spend almost no time dealing with the lock (futex). The
> producer generates a new task, acquires the lock, puts the task on the
> queue, and releases the lock. The consumer finishes doing a task and
> checks whether there's a new one by acquiring the lock and similarly
> manipulating the queue. Both queue manipulations take maybe 100 cycles
> and there is only contention if both processes are simultaneously doing
> that. And of course you can use multiple queues etc.

You can only continue working with a queue that's filling up. In this case
you'll end up with the producer sleeping and waking, as the queue is not
infinite in size, and you really need a distribution of workload where the
consumer has more work to do than the producer. If the consumer has only
slightly less workload than the producer, the queue will always be empty.
That was the case with my net2o experiment: The kernel took longer to
deliver a packet through the socket layer than Keccak took to decrypt it.
So the consumer (decrypt task) always had an empty queue.

The pushing into the queue does not even require setting a lock. It's a
single writer, single reader situation. You just put your data into the
queue, update the "end" counter, and then you need to make sure the consumer
is awake. That's the only expensive operation.

> New Intel processors have some kind of hardware assistance for memory
> transactions (TSX extension) but I don't currently understand how it
> works.

That's essentially for spin locks. If you understand how a data base works,
it's easy to understand: You read several values in transaction mode, and
then you write your results into a commit buffer. Finally, you either
commit (if none of the read values have been changed in the meantime), or
you rinse and repeat.

Anton Ertl

unread,

Sep 5, 2015, 7:17:07 AM9/5/15

to

I have seen examples of shape-formatted text formatted with TeX, but
I did not follow how it was done. In LaTeX without special packages
for such things, floats (figures an tables) are full-width (a columns
wide or a page wide), so it does not affect paragraph formatting.
Even if you have a few paragraphs flowing around a figure, you just
need to reformat those, and can use the parallel-processed ones for
the majority of cases.

Anton Ertl

unread,

Sep 5, 2015, 7:29:18 AM9/5/15

to

Paul Rubin <no.e...@nospam.invalid> writes:
>New Intel processors have some kind of hardware assistance for memory
>transactions (TSX extension) but I don't currently understand how it
>works.

AFAIK they don't specify that, so it can be implemented in different
ways, but the little I have read about it sounded like an extended
form of load-locked/store-conditional to me: I.e., one CPU does things
in cache lines, and if another CPU accesses these cache lines in the
meantime, the transaction is rolled back to the beginning. In any
case, the minimal latency for some synchronization thing is still the
same (determined by the round-trip through the shared cache), but you
can do more sophisticated synchronization stuff with it.

Paul Rubin

unread,

Sep 5, 2015, 12:05:54 PM9/5/15

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> Even if you have a few paragraphs flowing around a figure, you just
> need to reformat those, and can use the parallel-processed ones for
> the majority of cases.

TeX is essentially an imperative programming language, e.g. any
paragraph might have a command that changes the font size. Of course
that completely changes the formatting of the next paragraph. It can
also set the equivalent of mutable variables and things like that, which
could also affect the next paragraph.

I've seen some large TeX documents like manuals organized into multiple
files, e.g. one file per chapter. I guess those could be done in
parallel. There is some kind of two-pass formatting system in LaTeX
where the first pass figures out the page numbers of cross-references so
the second pass can fill them in. I guess that can also be used to deal
with page-numbering the second chapter. I'm not sure exactly how it is
done. I could imagine some cases where there is a chance of getting a
wrong page number, so for the final production copy, you might have to
format serially to be sure.

Metafont of course should parallelize nicely, first in computing the
shapes of all the letters, then in rasterizing them.

Anton Ertl

unread,

Sep 5, 2015, 12:06:03 PM9/5/15

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>
>> Bernd Paysan <bernd....@gmx.de> writes:
>>>Paul Rubin wrote:
>>>> In the uncontended case you check a futex in user space and proceed if
>>>> there's something to do, or yield to the kernel if there's nothing to
>>>> do. There's only IPC overhead if there's contention in which case you
>>>> have to make a system call.
>>>
>>>Unfortunately, the contention is the rule, and the uncontented case the
>>>exception. The problem of asynchronous tasks is that either the producer
>>>or the consumer waits, and that's the contented case.
>>
>> No, the contended case is when both want to change the status at the
>> same time.
>
>No, that's not a resource lock, that's a wakeup signal.

Only one thread is active, so there can be no contention. You are
thinking about something that is not contention.

>>> So either the producer
>>>has to tell the consumer "good morning, time to wake up, I've some new
>>>data for you", or the consumer says "good afternoon, producer, stop
>>>napping, and give me another chunk of data".
>>
>> Sounds to me like you have a GA144-like model with little or no
>> buffering. Then of course you will have little parallelism; after
>> all, the GA144 was designed to let many of the cores sleep most of the
>> time.
>
>When the consumers are faster than the produces, your buffers won't fill up.
>You can introduce arbitrary delays to keep your buffers filled, but this
>also reduces performance, as other tasks will be waiting much longer for the
>results.

If the consumers are faster than the producers, run the producers on
more cores than the consumers. If you have only one producer, and
that is sequential, and the consumers combined are still faster than
the producer, you have a parallelism of <2. You can still reduce the
wakeup overhead by only sending wakeup signals when the producer has
produced enough (and stored it in a buffer) that the wakeup overhead
is small compared to the processing time. Yes, this increases
latency, so it's a balance of efficiency vs. latency; you can bound
the increase in latency by waking the consumer some fixed time after
the producer has started producing.

>And yes, I've a model in mind, where more compute cores are sleeping than
>working; with a x64-style implementation, you would use SMT for that, not
>multiple cores.

If there are not enough active threads for all the cores, the only
reason to use SMT is to conserve energy or because communication is
cheaper (the shared cache is L1 instead of L3). But if you are
interested in performance, I expect that, running two threads on two
otherwise idle cores usually gives better performance than running
them in the same core, even if they communicate. Sure, there is the
PAUSE instruction for slowing a waiting thread down, but even with
that, i expect that situation where SMT performs better than two cores
are not very common.

>And buffering reduces performance, as it requires way more actual
>parallelism.

Buffering also enables parallelism. E.g., with a conventional screen
and with vertical synchronization to avoid tearing, double buffering
means that rendering has to wait for the vsync, while with tripple
buffering there can be rendering all the time.

Anyway, yes, you were talking of having lots of little tasks, which
sounds like having lots of parallelism to me. Then you switch to a
hardly-parallel problem of one sequential producer and consumers that
need little CPU. Of course these different kinds of problems need
different solutions.

>> So don't use the OS in the normal case.
>
>Anton, that's the entire point of suggesting that the wake/sleep IPC should
>be done in hardware.

Well, Intel tried that kind of thing in the iAPX 432 and in the 80286
protected mode and the hardware operations (task gates) were slow. A
context switch is slow whether it is done in hardware or in software.

You may be dreaming of hardware that has, say, dozens sets of
process/thread states per core, most of which are sleeping most of the
time, and that can put themselves to sleep cheaply, and be woken up
cheaply, but you have to make a really good case for that to get it
included in hardware. Is your model of lots of small tasks with
little parallelism overall really that relevant?

Anton Ertl

unread,

Sep 5, 2015, 12:10:20 PM9/5/15

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>
>>>Multiple threads in one process all see the same mapping. It works
>>>between different processes, because they have different page tables.
>>
>> Yes, so if you want different permissions for threads of one process,
>> you need to map the same thing two times (at different addresses),
>> once read-write and once read-only
>
>That would still give the other thread write-access to the read-write
>mapping, and you have to solve the relocation problem, too.

Yes.

>You really want
>the data to sit on the same addresses,so that all pointers work without
>trickery.
>
>I don't see a way to have different permissions for threads in one process
>cheaply (i.e. without having separate page tables for each thread) without
>hardware support.

You have the hardware support, it's used for processes. There have
been various discussions about providing things in OSs that are
between threads and processes; I don't know if different memory
permissions have been touched in these discussions.

Waldek Hebisch

unread,

Sep 5, 2015, 12:32:03 PM9/5/15

to

Marcel Hendrix <m...@iae.nl> wrote:
> Bernd Paysan <bernd....@gmx.de> writes Re: GA144 use cases
>
> > Anton Ertl wrote:
> >> And the same goes for paragraphs. OTOH, TeX is so fast on modern CPUs
> >> that one can really say that paralellizing it would not enhance it.
> >> Also, parallelizing it would require quite a bit of effort.
>
> The NGSPICE manual is 650 pages and takes more than two minutes to process.
> When maintaining the manual (small, infrequent edits) that is too long and
> quality suffers (one doesn't do a single edit, or one doesn't want to check
> the outcome).

That is unusually long time. For example the TeX book has 535
pages and takes 0.727s to process on my machine. So 650 pages
plain TeX document should take less than a second. I do not
have so large LaTeX document. But my 44 pages document takes
of order of 0.2 seconds, so I would expect about 3 seconds
for 650 pages LaTeX. Main difference between TeX and LaTeX
seem to be that LaTeX defines several complex macros and
macro processing starts to dominate the runtime. Macro
processing in TeX depends on continously redefining macros
and due to this is very hard to do in parallel.

One guess is that long typesetting times for NGSPICE manual
are due to graphics: there are popular packages which
abuse TeX typesetting engine to simulate graphics engine.
Typically one wants several pictures and each picture
have a priori known bounding box, so in principle all
pictures could be processed in parallel (and parallel
to the main document). But due to the way picutures
are presented TeX does not know that this is not a usual
unparallelizabel case.

You may also get slowdown due to file searches: modern
macro packages are split into a lot of files spread over
a lot of directories. In number of files (and especially
directories) in question exceed size of kernel caches
there is thrashing and huge slowdown. In such case
parallel execution does not help at all. Bigger kernel
caches or better organization of TeX distribution
should help.

--
Waldek Hebisch
heb...@antispam.uni.wroc.pl

rickman

unread,

Sep 5, 2015, 1:00:16 PM9/5/15

to

I had a compiler problem like this. The jump was a variable size
instruction based on the address offset. The offset value could change
the length of the instruction. The instruction length change could
change the value of the offset again in a non-resolvable loop. In the
end I had to leave the instruction format in the long form with an
offset that would fit in the short form.

--

Rick

m...@iae.nl

unread,

Sep 5, 2015, 1:52:52 PM9/5/15

to

On Saturday, September 5, 2015 at 6:32:03 PM UTC+2, Waldek Hebisch wrote:
> Marcel Hendrix <m...@iae.nl> wrote:
[..]

> > The NGSPICE manual is 650 pages and takes more than two minutes to process.
> > When maintaining the manual (small, infrequent edits) that is too long and
> > quality suffers (one doesn't do a single edit, or one doesn't want to check
> > the outcome).
>
> That is unusually long time. For example the TeX book has 535
> pages and takes 0.727s to process on my machine. So 650 pages
> plain TeX document should take less than a second. I do not
> have so large LaTeX document. But my 44 pages document takes
> of order of 0.2 seconds, so I would expect about 3 seconds
> for 650 pages LaTeX. Main difference between TeX and LaTeX
> seem to be that LaTeX defines several complex macros and
> macro processing starts to dominate the runtime. Macro
> processing in TeX depends on continously redefining macros
> and due to this is very hard to do in parallel.

[..]

Well, actually I use LyX, and it is not two minutes
but 84 seconds on my noise-tuned PC (it's probably 22 seconds
in hi-perf mode). The time is for LyX start-up and one rebuild.
I didn't change anything in the text, so maybe there is caching
going on. There are maybe 8 pictures in the manual and of course
some formulas.
( http://sourceforge.net/p/ngspice/ngspice-manuals/ci/master/tree/ )

FORTH> TIMER-RESET cr s" manual.lyx" SYSTEM .ELAPSED
84.353 seconds elapsed. ok

I don't know LaTeX / LyX and what/how it is doing, but
0.727s for 535 pages ... Where did you buy that
workstation :-)

-marcel

m...@iae.nl

unread,

Sep 5, 2015, 1:56:17 PM9/5/15

to

Because of the variable length transputer instructions, our
metacompiler for tForth had that problem very frequently.
I see it iForth too, but generous alignment can take care
of it -- I am only interested in speed , not in size.

-marcel

Bernd Paysan

unread,

Sep 5, 2015, 3:05:30 PM9/5/15

to

Anton Ertl wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>>That would still give the other thread write-access to the read-write
>>mapping, and you have to solve the relocation problem, too.
>
> Yes.

In a 386 system, you might be able to do that with segments... but nobody
likes segments ;-).

>>You really want
>>the data to sit on the same addresses,so that all pointers work without
>>trickery.
>>
>>I don't see a way to have different permissions for threads in one process
>>cheaply (i.e. without having separate page tables for each thread) without
>>hardware support.
>
> You have the hardware support, it's used for processes.

No, you have separate page tables for different processes, and you swap page
tables on context switch. That's a software solution. The main reason for
threads is to not have the same overhead as for processes.

You could share the read-only page tables between all threads that are not
allowed to read, but the context switch still would be an expensive inter-
process switch with a TLB flush.

> There have
> been various discussions about providing things in OSs that are
> between threads and processes; I don't know if different memory
> permissions have been touched in these discussions.

I hope so. One problem why the existing sandbox in Linux is next to useless
is that it needs a helper thread which runs the checks, and there, only the
register file is private data. That doesn't work very well.

For that purpose, using one of the unused rings could help - the hardware
has support for separate permissions in different rings. Therefore, the
kernel-based sandbox (with code compiled by the Berkeley packet filter
stuff) is better, but the awful part here is that now the supervisor part
executes code in kernel space. Any red pill code not only can break out of
the sandbox, but then owns the kernel.

Bernd Paysan

unread,

Sep 5, 2015, 4:04:15 PM9/5/15

to

Anton Ertl wrote:
>>No, that's not a resource lock, that's a wakeup signal.
>
> Only one thread is active, so there can be no contention. You are
> thinking about something that is not contention.

"Contention" here is when someone has to wait. The producer needs space in
the queue - if the queue is full, it's a contention for the producer
("congestion", write access is blocked). The consumer needs data in the
queue, if the queue is empty, it's a contention for the consumer, read
access is blocked.

It's not the same sort of contention you have with equal processes, it's
"pipeline contention".

> If the consumers are faster than the producers, run the producers on
> more cores than the consumers. If you have only one producer, and
> that is sequential, and the consumers combined are still faster than
> the producer, you have a parallelism of <2.

I don't quite see the problem of having a parallelism of say 1.8, and be
able to use it. When the wake/sleep operation costs me as much as
processing one block, I can't use it. The overall number of available cores
today is a small integer, so I can't have 10 producers and 9 consumers
running in parallel.

> You can still reduce the
> wakeup overhead by only sending wakeup signals when the producer has
> produced enough (and stored it in a buffer) that the wakeup overhead
> is small compared to the processing time. Yes, this increases
> latency, so it's a balance of efficiency vs. latency; you can bound
> the increase in latency by waking the consumer some fixed time after
> the producer has started producing.

net2o's timing-based flow control breaks with that approach. You have to
process each packet as soon as possible, and finally, when a packet arrived
unencrypted in its destination, you take the time of arrival.

There are cases where you can do that, though.

>>And yes, I've a model in mind, where more compute cores are sleeping than
>>working; with a x64-style implementation, you would use SMT for that, not
>>multiple cores.
>
> If there are not enough active threads for all the cores, the only
> reason to use SMT is to conserve energy or because communication is
> cheaper (the shared cache is L1 instead of L3). But if you are
> interested in performance, I expect that, running two threads on two
> otherwise idle cores usually gives better performance than running
> them in the same core, even if they communicate. Sure, there is the
> PAUSE instruction for slowing a waiting thread down, but even with
> that, i expect that situation where SMT performs better than two cores
> are not very common.

My situation is different: I've running and sleeping processes, and I want
to wake/sleep them quickly. Using different cores and having them all spin-
loop is consuming energy and a waste of resources, because the threads are
waiting. So I put them all as "semi-active" on a SMT core, where the
sleeping threads only consume registers in the register file, but can start
running within a nanosecond.

SMT with more than one active thread per core is also faster with your usual
pipeline bubbles like mispredicted branches and waits for cache accesses (L2
and beyond), where one thread has all resources for itself.

For real-time tasks, just being able to run at all, even at a slower pace,
is better than having to wait for the next time slice.

>>And buffering reduces performance, as it requires way more actual
>>parallelism.
>
> Buffering also enables parallelism. E.g., with a conventional screen
> and with vertical synchronization to avoid tearing, double buffering
> means that rendering has to wait for the vsync, while with tripple
> buffering there can be rendering all the time.

But the right solution is to go away from the fixed refresh, and display the
buffers when they are ready. So the rendering only has to wait when the
render time is so short that it exceeds the bandwidth (e.g. 140fps), and not
when it's just faster than the minimum acceptable rate (e.g. 30fps).

> Anyway, yes, you were talking of having lots of little tasks, which
> sounds like having lots of parallelism to me. Then you switch to a
> hardly-parallel problem of one sequential producer and consumers that
> need little CPU. Of course these different kinds of problems need
> different solutions.

The little tasks are usually connected in such a way. There may be overall
way more than just two tasks, but the typical relation between these tasks
is often producer/consumer, and they aren't well balanced.

>>> So don't use the OS in the normal case.
>>
>>Anton, that's the entire point of suggesting that the wake/sleep IPC
>>should be done in hardware.
>
> Well, Intel tried that kind of thing in the iAPX 432 and in the 80286
> protected mode and the hardware operations (task gates) were slow. A
> context switch is slow whether it is done in hardware or in software.

That's why I say "use the SMT capabilities". That's how to do that cheap.

BTW: Just moving the task switch from software into microcode doesn't make
it "hardware". Neither iAPX 432 or 286 PM had any hardware capabilities for
task switching, they both just had complicated microcode. AFAIK, a Forth
PAUSE with pusha/mov sp,[up+next]/popa was about two orders of magnitude
faster than a task gate call.

> You may be dreaming of hardware that has, say, dozens sets of
> process/thread states per core, most of which are sleeping most of the
> time, and that can put themselves to sleep cheaply, and be woken up
> cheaply, but you have to make a really good case for that to get it
> included in hardware. Is your model of lots of small tasks with
> little parallelism overall really that relevant?

That's your model, not mine. Let's repeat it:

* The tasks are small, so that expensive wake/sleep operations don't work,
and moving the tasks from one core to the other is also not a good idea.
Hot tasks which are already in the L1 code cache and data in L1 data cache
can easily be several times faster than "cold tasks". There's not enough
work for one specific task to run all the time, because the tasks are
diverse.

* The tasks are related to each other in a consumer-producer relation,
sometimes with DAG-like structures, i.e. a task may combine the output of
two producers, or feed two consumers.

* There are enough of those tasks to keep all cores in current CPUs active
(and maybe many more), but when used with the current high-overhead IPC,
there's no benefit.

Take Marcel's ngspice as example: You really can compute the "next
voltage/current" output for all active nodes at the same time (and that's
often only a part of the entire circuit, not all devices have fast input
voltage or current swings). The parallelism is there. Nodes depending on
fast-changing voltages need to recalculate their output more often when
triggered, but those nodes which stay mostly the same don't need to
calculate often (then the linear solver is sufficient). The number of
different device models can easily be 20 or more, with device-parameter
depending pathes (CMOS, NMOS, different gate thickness, Zener, Schottky and
"normal" diodes, NPN, PNP bipolars, several types of capacitors, including
parasitics, and resistors, including parasitics).

AMD's GCN GPUs have to some extend what I'm asking for: you can run several
asynchronous tasks, and feed data from one to the next, context switching is
cheap.

http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading

It does considerably improve performance. This is a rather new approach for
GPU-style parallelism; it's not just 1000s of cores, each doing the same
thing on different data. The number of different contexts active at the
same time might be a bit on the low side now, but that stuff is just new.

Bernd Paysan

unread,

Sep 5, 2015, 4:11:14 PM9/5/15

to

m...@iae.nl wrote:

> I don't know LaTeX / LyX and what/how it is doing, but
> 0.727s for 535 pages ... Where did you buy that
> workstation :-)

Sounds a bit fast. Gforth's texi manual, just one run, takes 1,329s user
time here with texi2dvi (3GHz Core i7). That's almost 300 pages.

Waldek Hebisch

unread,

Sep 5, 2015, 7:42:47 PM9/5/15

to

The timings are on 2.4 GHz Core 2. This is quad core machine.
AFAIK TeX uses only one core, but there is some possiblity
for parallel execution of TeX and OS. Note that the 535
pages are plain TeX. IME LaTeX is usually slower, freqently
3 to 4 times slower (I mean single run of LaTeX).
Apparently LyX introduces substantial additional overhead.

--
Waldek Hebisch
heb...@math.uni.wroc.pl

Paul Rubin

unread,

Sep 5, 2015, 8:25:44 PM9/5/15

to

Waldek Hebisch <heb...@math.uni.wroc.pl> writes:
> Note that the 535 pages are plain TeX. IME LaTeX is usually slower,
> freqently 3 to 4 times slower

Hmm, that might also explain the gforth manual (and the emacs manual
which I'm more used to) formatting comparatively slowly. They're both
in Texinfo, which is a quite complicated TeX macro package that makes
TeX look like a predecessor of Scribe (another markup-based formatter)
with roots in MIT history. So Texinfo may also slow things down
compared to plain TeX.

Formatting big documents in TeX used to be an inconveniently slow
operation like a big compilation, but on today's hardware TeX's speed is
usually not a significant problem.

Anton Ertl

unread,

Sep 6, 2015, 10:27:16 AM9/6/15

to

Bernd Paysan <bernd....@gmx.de> writes:
>m...@iae.nl wrote:
>
>> I don't know LaTeX / LyX and what/how it is doing, but
>> 0.727s for 535 pages ... Where did you buy that
>> workstation :-)
>
>Sounds a bit fast. Gforth's texi manual, just one run, takes 1,329s user
>time here with texi2dvi (3GHz Core i7). That's almost 300 pages.

Hmm, it's

1.181u 0.354s 1.430r 107.35%

on my 1.9GHz Core i3-3227U laptop.

On our shiny new Core i7-4790K box (not overclocked), it's

0.868u 0.128s 0.989r 100.00%

I am a bit disappointed at the low speedup from the 4790K, which
should be at least twice as fast. Maybe different software versions.

Anyway, plenty fast. makeinfo is a bigger problem than TeX.

Anton Ertl

unread,

Sep 6, 2015, 10:36:38 AM9/6/15

to

rickman <gnu...@gmail.com> writes:
>On 9/5/2015 12:05 PM, Paul Rubin wrote:
>> I've seen some large TeX documents like manuals organized into multiple
>> files, e.g. one file per chapter. I guess those could be done in
>> parallel. There is some kind of two-pass formatting system in LaTeX
>> where the first pass figures out the page numbers of cross-references so
>> the second pass can fill them in. I guess that can also be used to deal
>> with page-numbering the second chapter. I'm not sure exactly how it is
>> done. I could imagine some cases where there is a chance of getting a
>> wrong page number, so for the final production copy, you might have to
>> format serially to be sure.

Serial formatting does not help, because for forward references you
would have to look into the future. If the references don't resolve
as predicted, you get a warning and you just rerun LaTeX. Nomally
(and I have never seen any deviation from this) it is fixed after the
second run. If it is not resolved after several runs, the solution is
to change the source such that this does not happen.

>I had a compiler problem like this. The jump was a variable size
>instruction based on the address offset. The offset value could change
>the length of the instruction. The instruction length change could
>change the value of the offset again in a non-resolvable loop. In the
>end I had to leave the instruction format in the long form with an
>offset that would fit in the short form.

That's a classic problem. The usual solution is to start with all
offsets short, have a pass and remember those that don't fit; repeat
the same, with long offsets for those that did not fit. Repeat until
everything fits in the space allocated to them. One can construct
funny problems where this is suboptimal, but in the usual case, this
is also optimal.

Anton Ertl

unread,

Sep 6, 2015, 10:49:35 AM9/6/15

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:

>>>I don't see a way to have different permissions for threads in one process
>>>cheaply (i.e. without having separate page tables for each thread) without
>>>hardware support.
>>
>> You have the hardware support, it's used for processes.
>
>No, you have separate page tables for different processes, and you swap page
>tables on context switch. That's a software solution.

Well, switching between threads with only one hardware thread running
is also a software solution.

The corresponding hardware solution is to have two cores running the
two threads or processes, or two SMT threads on a core, and for that
it does not matter whether the OS sees the two hardware threads as
processes or as threads.

>The main reason for
>threads is to not have the same overhead as for processes.

I think that the main reason for threads is to have less work when you
want to share things, in particular memory. No need to create a
shared-memory object or a file, and map it into the address space,
etc.

>You could share the read-only page tables between all threads that are not
>allowed to read, but the context switch still would be an expensive inter-
>process switch with a TLB flush.

Other architectures have an address-space ID (ASID) to avoid flushing
the TLB on a context switch, but I guess Intel/AMD found efficient
ways to work without that; or have they? The introduction of AMD64
would have been a good time to add ASIDs, but they didn't.

Anton Ertl

unread,

Sep 6, 2015, 11:41:52 AM9/6/15

to

Bernd Paysan <bernd....@gmx.de> writes:
>Anton Ertl wrote:
>>>No, that's not a resource lock, that's a wakeup signal.
>>
>> Only one thread is active, so there can be no contention. You are
>> thinking about something that is not contention.
>
>"Contention" here is when someone has to wait.

It does not help if you use established terms with a different
meaning. When someone talks about locks (and mutexes (including
futexes) are kinds of locks), contention has a specific meaning, and
when only one thread is accessing the lock, there is no contention.

>It's not the same sort of contention you have with equal processes, it's
>"pipeline contention".

Googling for "pipeline contention" did not give me appropriate hits,
so it seems that is not an established term.

>> If the consumers are faster than the producers, run the producers on
>> more cores than the consumers. If you have only one producer, and
>> that is sequential, and the consumers combined are still faster than
>> the producer, you have a parallelism of <2.
>
>I don't quite see the problem of having a parallelism of say 1.8, and be
>able to use it. When the wake/sleep operation costs me as much as
>processing one block, I can't use it.

Make the blocks longer, and you can. Or, if you don't want to do
that, and you are certain about the 1.8 paralellism, use busy waiting
with PAUSE. It would be cool if you could slow the clock for the core
with the faster thread, but I think that's not possible (having only
one thread on the core and that using PAUSE might be a sign to the
hardware to slow the clock down, but I doubt that this is done yet).

>The overall number of available cores
>today is a small integer, so I can't have 10 producers and 9 consumers
>running in parallel.

On some Xeons you can, but anyway, you get different problems when you
want to exploit a parallelism of 1.8 on a four-core CPU, or a
parallelism of 19.

>> You can still reduce the
>> wakeup overhead by only sending wakeup signals when the producer has
>> produced enough (and stored it in a buffer) that the wakeup overhead
>> is small compared to the processing time. Yes, this increases
>> latency, so it's a balance of efficiency vs. latency; you can bound
>> the increase in latency by waking the consumer some fixed time after
>> the producer has started producing.
>
>net2o's timing-based flow control breaks with that approach. You have to
>process each packet as soon as possible, and finally, when a packet arrived
>unencrypted in its destination, you take the time of arrival.

So when your machine is busy and may context-switch your net2o threads
out in favor of something else, your control-flow approach breaks.
Hmm. Maybe it needs some more work to be less brittle.

>>>And yes, I've a model in mind, where more compute cores are sleeping than
>>>working; with a x64-style implementation, you would use SMT for that, not
>>>multiple cores.
>>
>> If there are not enough active threads for all the cores, the only
>> reason to use SMT is to conserve energy or because communication is
>> cheaper (the shared cache is L1 instead of L3). But if you are
>> interested in performance, I expect that, running two threads on two
>> otherwise idle cores usually gives better performance than running
>> them in the same core, even if they communicate. Sure, there is the
>> PAUSE instruction for slowing a waiting thread down, but even with
>> that, i expect that situation where SMT performs better than two cores
>> are not very common.
>
>My situation is different: I've running and sleeping processes, and I want
>to wake/sleep them quickly. Using different cores and having them all spin-
>loop is consuming energy and a waste of resources, because the threads are
>waiting.

Yes, but it's what is available.

>So I put them all as "semi-active" on a SMT core, where the
>sleeping threads only consume registers in the register file, but can start
>running within a nanosecond.

Sure, but if you have such such sleep/wakeup instructions, there is no
need to restrict this to SMT. An otherwise idle core can also sleep
with the thread state in the register file etc, and then wake up in a
nanosecond.

>>>And buffering reduces performance, as it requires way more actual
>>>parallelism.
>>
>> Buffering also enables parallelism. E.g., with a conventional screen
>> and with vertical synchronization to avoid tearing, double buffering
>> means that rendering has to wait for the vsync, while with tripple
>> buffering there can be rendering all the time.
>
>But the right solution is to go away from the fixed refresh, and display the
>buffers when they are ready.

If you have a screen that supports that, fine. Most screens don't,
yet. Anyway, this is an example of how having more buffers helps
parallelism. I can also give others: E.g., CPUs have write buffers,
OSs have write buffers, hard disk drives have write buffers, ... They
also have read buffers, and data from disk is stored in buffers in RAM
instead of being caught from the wire by the CPU.

>BTW: Just moving the task switch from software into microcode doesn't make
>it "hardware". Neither iAPX 432 or 286 PM had any hardware capabilities for
>task switching, they both just had complicated microcode. AFAIK, a Forth
>PAUSE with pusha/mov sp,[up+next]/popa was about two orders of magnitude
>faster than a task gate call.

It also does much less. But it seems that, what the task gate does
more is not actually worth the price (although there are probably
still fans of advanced security concepts that disagree).

>> You may be dreaming of hardware that has, say, dozens sets of
>> process/thread states per core, most of which are sleeping most of the
>> time, and that can put themselves to sleep cheaply, and be woken up
>> cheaply, but you have to make a really good case for that to get it
>> included in hardware. Is your model of lots of small tasks with
>> little parallelism overall really that relevant?
>
>That's your model, not mine. Let's repeat it:
>
>* The tasks are small, so that expensive wake/sleep operations don't work,
>and moving the tasks from one core to the other is also not a good idea.
>Hot tasks which are already in the L1 code cache and data in L1 data cache
>can easily be several times faster than "cold tasks". There's not enough
>work for one specific task to run all the time, because the tasks are
>diverse.
>
>* The tasks are related to each other in a consumer-producer relation,
>sometimes with DAG-like structures, i.e. a task may combine the output of
>two producers, or feed two consumers.
>
>* There are enough of those tasks to keep all cores in current CPUs active
>(and maybe many more), but when used with the current high-overhead IPC,
>there's no benefit.

But for that problem, you don't need much OS-level stuff. If you can
keep all the cores (and hardware threads) busy, the solution is to
create one OS-level thread per hardware thread and have a user-level
schedulers within each thread that schedules the tasks and coordinates
with the other schedulers (using user-level means) to balance the load
etc.

I don't see a need for special hardware for that problem. It's only
when you cannot keep everything busy that OS-level stuff (or busy
waiting) comes into play.

Bernd Paysan

unread,

Sep 6, 2015, 4:51:02 PM9/6/15

to

Anton Ertl wrote:

> Bernd Paysan <bernd....@gmx.de> writes:
>>Anton Ertl wrote:
>>>>No, that's not a resource lock, that's a wakeup signal.
>>>
>>> Only one thread is active, so there can be no contention. You are
>>> thinking about something that is not contention.
>>
>>"Contention" here is when someone has to wait.
>
> It does not help if you use established terms with a different
> meaning. When someone talks about locks (and mutexes (including
> futexes) are kinds of locks), contention has a specific meaning, and
> when only one thread is accessing the lock, there is no contention.

Well, I don't remember any specific term for this. The discussion in
computer science is always resource sharing, and sleeping apparently isn't
even under consideration. Probably because power management wasn't an issue
40 years ago, when the terms were invented.

But now, you have to run to finish and then go to sleep, because power
budget is a problem. And we have multiple cores and need parallelism.

>>It's not the same sort of contention you have with equal processes, it's
>>"pipeline contention".
>
> Googling for "pipeline contention" did not give me appropriate hits,
> so it seems that is not an established term.

The reason why I use "contention" is that it fits the way you think about
contention with a lock: If there is contention, one task has to go to
sleep. And a pipeline is a shared resource, with one writer and one reader.
The difference between a lock is the contention case, because the access to
the pipeline is not identical.

Due to the different situations, pipelines can either jam or stall.

If you don't like the term "contention", give me a reference to the current
computer science term. If there is no, write a paper to create one.

>>I don't quite see the problem of having a parallelism of say 1.8, and be
>>able to use it. When the wake/sleep operation costs me as much as
>>processing one block, I can't use it.
>
> Make the blocks longer, and you can.

In net2o, the block size depends on the size of Internet packets; there's a
fixed limit. Actually, I think this suggestion is silly, because the
smaller the blocks are we can handle efficiently, the higher the potential
of parallelism is; this is just the result of Amdahl's law. And the
question is whether the high overhead we have now is really necessary, and
what can be done about it.

The discussion is not about how more parallelism can be achieved in
different situations, because these different situations already do well.

> Or, if you don't want to do
> that, and you are certain about the 1.8 paralellism, use busy waiting
> with PAUSE. It would be cool if you could slow the clock for the core
> with the faster thread, but I think that's not possible (having only
> one thread on the core and that using PAUSE might be a sign to the
> hardware to slow the clock down, but I doubt that this is done yet).

Unfortunately, you don't know in advance which path is actually used how
often, so full busy-waiting is out of question. Busy-waiting during
activity is possible, but so far, I haven't found a robust way to switch
over between busy-waiting and non-busy. The Linux kernel futex() call,
which wakes up a thread when you write a memory location from another
thread, could do that.

> So when your machine is busy and may context-switch your net2o threads
> out in favor of something else, your control-flow approach breaks.
> Hmm. Maybe it needs some more work to be less brittle.

No, this works perfectly. When net2o gets kicked out, it detects the slow-
down, and reports that to the other side, which adjusts its sender rate.

If I change the internal operation of net2o, even though it's active all the
time, a false reported "I'm slow" just because I need to fill up an internal
pipeline before I start running decryption due to slow IPC is really not
what I want. I would have more processing power, and could process packets
actually faster!

>>My situation is different: I've running and sleeping processes, and I want
>>to wake/sleep them quickly. Using different cores and having them all
>>spin- loop is consuming energy and a waste of resources, because the
>>threads are waiting.
>
> Yes, but it's what is available.

And I'm not happy with what's available and try to think how to improve
that.

>>So I put them all as "semi-active" on a SMT core, where the
>>sleeping threads only consume registers in the register file, but can
>>start running within a nanosecond.
>
> Sure, but if you have such such sleep/wakeup instructions, there is no
> need to restrict this to SMT. An otherwise idle core can also sleep
> with the thread state in the register file etc, and then wake up in a
> nanosecond.

The point with SMT is that you can have more than one sleeping thread per
core. In a non-SMT-situation, you can only have one sleeping thread per
core, and completely sleeping cores are often not startable in a nanosecond.
That's because sleeping cores have reduced voltages. Running SMT cores have
full voltage, and therefore can go full speed from zero in a nanosecond.
endering all the time.

> But for that problem, you don't need much OS-level stuff. If you can
> keep all the cores (and hardware threads) busy, the solution is to
> create one OS-level thread per hardware thread and have a user-level
> schedulers within each thread that schedules the tasks and coordinates
> with the other schedulers (using user-level means) to balance the load
> etc.

Works fine when you can run the thread all the time; but this is completely
out of question with the power budget problems we have today...

If you don't have an OS at all, there's already support for sleep/wake IPC
for hardware threads (by monitoring memory in the core during sleep, and
writing to that memory region for wake), so with my own OS, I could do
something.

> I don't see a need for special hardware for that problem. It's only
> when you cannot keep everything busy that OS-level stuff (or busy
> waiting) comes into play.

Indeed, but that's a very typical situation.

Message has been deleted

Anton Ertl

unread,

Sep 7, 2015, 4:05:14 AM9/7/15

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Bernd Paysan <bernd....@gmx.de> writes:
>>m...@iae.nl wrote:
>>
>>> I don't know LaTeX / LyX and what/how it is doing, but
>>> 0.727s for 535 pages ... Where did you buy that
>>> workstation :-)
>>
>>Sounds a bit fast. Gforth's texi manual, just one run, takes 1,329s user
>>time here with texi2dvi (3GHz Core i7). That's almost 300 pages.
>
>Hmm, it's
>
>1.181u 0.354s 1.430r 107.35%
>
>on my 1.9GHz Core i3-3227U laptop.
>
>On our shiny new Core i7-4790K box (not overclocked), it's
>
>0.868u 0.128s 0.989r 100.00%

That's wrong, these numbers probably came from a 2.8GHz Xeon X3460
(Lynnfield/Nehalem).

The Core i7-4790K takes:

real 0m0.479s
user 0m0.404s
sys 0m0.004s

>Anyway, plenty fast. makeinfo is a bigger problem than TeX.

Still valid:-)

Stefan Mauerhofer

unread,

Sep 7, 2015, 6:18:47 AM9/7/15

to

I suggest we return to our topic, the GA144. (Please open another thread for discussing the speed of TeX interpreters).

The GA144 has some unique features that distinguishes it from other devices. First it is asynchronous. It has no clock or any sort of time reference. This makes it a little bit more difficult to use the chip in applications that requires a time base, like communications.

However a couple of nodes can start and drive a crystal oscillator. It can even stop the crystal from oscillating when not needed. This makes it uniquely suitable for application where low power is a must. E.g. imagine a data acquisition station where the data is transmitted only once a day. The communication crystal needs to run only for a few minutes during a day.

I don't know any other device that can stop and restart a crystal quartz.

Raimond Dragomir

unread,

Sep 7, 2015, 6:31:37 AM9/7/15

to

> However a couple of nodes can start and drive a crystal oscillator. It can even stop the crystal from oscillating when not needed. This makes it uniquely suitable for application where low power is a must. E.g. imagine a data acquisition station where the data is transmitted only once a day. The communication crystal needs to run only for a few minutes during a day.
>
> I don't know any other device that can stop and restart a crystal quartz.

Many cortex-m devices claim a sleeping sub-microampere power budget these days, including an rtc for precise waking times. I don't know if they stop the crystals or not and I don't care. Give me power figures so I can compare the things.

forther

unread,

Sep 7, 2015, 2:21:11 PM9/7/15

to

On Monday, August 31, 2015 at 7:08:37 PM UTC-7, Dennis Ruffer wrote:
> On Monday, August 31, 2015 at 12:20:21 AM UTC-7, dunno wrote:
> > Can anyone provide real-world examples?
>
> IntellaSys had in mind hearing aids and wireless speakers. I've also heard of video in the automotive industry (forget which company at the moment).
>
> Encryption and signal processing are obvious candidates.
>
> E.g. anything streaming, but I'm not sure it ever was faster than other technologies.
>
> DaR

They, actually, made a working one: https://docs.google.com/presentation/d/15eouEljeZ7Wp0qUBk-FpPR2DLnqUAavRTY80EcleUmA/edit?usp=sharing

forther

unread,

Sep 7, 2015, 2:26:51 PM9/7/15

to

On Wednesday, September 2, 2015 at 3:03:35 PM UTC-7, Mux wrote:
> How about audio synthesis? Given the shear amount of nodes you should be able to generate waveforms pretty easily. Each node / group of nodes could be configured as a simple / complex generator, maybe similar to FM or otherwise just simple square / sine / etc. waveforms.
>
> Given that you have on-chip DAC's it could be a really nifty synth.. Adding a software UART and adding some level-converters and an amp you could have a pretty decent audio device :-)
>
> -Y

https://docs.google.com/presentation/d/1pgamdRQnmXLCIVCCPNvfiGBjXVrsa9WmyvRphCRtgaE/edit?usp=sharing

rickman

unread,

Sep 7, 2015, 3:06:58 PM9/7/15

to

Yes, crystals are stopped and started on any processor that is shooting
for really low power levels while sleeping. They either use a second
crystal, 32.768 kHz or use an internal RC slow clock to wake up by. The
high speed crystal clock takes some milliseconds to restart and a low
speed crystal clock takes a significant part of a second to restart and
stabilize. I doubt anything done with the GA144 will significantly
improve on this. In fact, they never got a general purpose high speed
crystal clock to work. I think Chuck punted and used a resonator. The
initial accuracy of the crystal stimulus is too critical for the GA144
tolerances.

Did anyone else explore the world of crystal oscillators with the GA144?

--

Rick

Stefan Mauerhofer

unread,

Sep 9, 2015, 12:41:06 AM9/9/15

to

Hi Rick

Thanks for the info about other low power devices. It seems quite obvious to shut down the high speed clock and run a low speed clock to monitor the wake-up condition.

Regarding your question: Yes, I did. I could start and keep running crystals up to 16 MHz without any external electronics or manipulating the voltage or temperature.

rickman

unread,

Sep 9, 2015, 2:54:21 AM9/9/15

to

Did you ever publish your code? Chuck was not able to start a 10 MHz
crystal oscillator due to the lack of adequate resolution in his timing
method, the software loop. How do you get yours going?

--

Rick

Stefan Mauerhofer

unread,

Sep 9, 2015, 6:02:03 AM9/9/15

to

Hi Rick

I am planning to publish the code this year. The problem is that currently I have other priorities. You are right, the software loop resolution is not sufficient. By spreading the code & data over 4 nodes I could increase the average resolution of the software loop. I am using 1 node to hold 64 software loop counter. 1 node is calculating the in-between values with a kind of Bresenham algorithm and write them to the data node. 1 node is controlling the search algorithm and the last node is controlling the I/O pin, sense the feedback from the crystal and report the result back. I am not happy with the result yet, because I had a 2 MHz crystal, which would not run with my program. So I need some more investigation on what is happening with the crystals. I was astonished, that I could start and lock on a 16 MHz crystal. I tried about 20 different crystals from 32 kHz to 18 MHz. I could excite every crystal (even the 2 MHz) up to 16 MHz. Beyond 16 MHz the F18A node was to slow to create a stable frequency with a software loop.

- Stefan

rickman

unread,

Sep 9, 2015, 4:55:14 PM9/9/15

to

I'm not clear on how you obtain adequate resolution in the timing node.
I can see the possibility of using alternate pathways with different
instruction timing. Is that what you mean? I'm not sure how the
Bresenham algorithm might be used for this. Are you saying you dither ±
a loop count so that you ping the crystal at the right rate on average?
That could work, but how do you deal with the timing variations caused
by PVT (Process, Voltage and Temperature)? Start up times for a crystal
would be greatly extended if you had to sweep looking for the right rate.

I'm not an analog guy, so I can't explain in great detail how an
oscillator works. I have wondered what crystal parameters will matter
to a digital oscillator.

I see this has gotten far off topic for the group. Any interest in
discussing this in a group where others know more about such things? I
might start a discussion in sci.electronics.design.

--

Rick

Paul Rubin

unread,

Sep 9, 2015, 11:12:35 PM9/9/15

to

rickman <gnu...@gmail.com> writes:
> I see this has gotten far off topic for the group. Any interest in
> discussing this in a group where others know more about such things?

It's about a Forth chip, so I'd say it's on-topic.

rickman

unread,

Sep 9, 2015, 11:40:11 PM9/9/15

to

But it's not really, in fact stack chips are not automatically "forth"
chips. The real point is that we were talking about digital oscillators
which has nearly nothing to do with Forth while there are other groups
where it would fit right in.

Let me say this, which at least applies to the GA144 which is as close
to a Forth chip as you can get. The whole reason we are discussing a
digital oscillator is because the philosophy to the GA144 is to provide
a minimum variety of capability which can be adapted to a variety of
needs. Virtually all embedded applications require timing information
(a clock) even if they don't need it to make the logic work. The GA144
has no provision to add a crystal oscillator, I can only assume because
it was felt this could be done with the available circuitry. But no one
at GA has bitten the bullet to demonstrate this capability. This is
true for many functions needed of MCUs used in conventional
applications. There is an assumption that each developer will be happy
to roll their own widget interface or function while more conventional
MCUs provide the needed logic or software out of the box.

It's as if the GA144 was birthed, then laid in a crib and abandoned by
the parents.

I'm curious about what the folks who developed a hearing aid application
did for a clock? I can't imagine using the ADC or DAC without a clock.

--

Rick

rickman

unread,

Sep 9, 2015, 11:43:25 PM9/9/15

to

On 9/9/2015 11:12 PM, Paul Rubin wrote:

I wrote a slightly long post continuing the conversation on the GA144
but Thunderbird lost it. Oh well...

--

Rick

rickman

unread,

Sep 10, 2015, 12:07:51 AM9/10/15

to

I guess not. Seems I misunderstood the prompts I got. Glad I didn't go
at it again...

--

Rick

Stefan Mauerhofer

unread,

Sep 10, 2015, 12:55:54 AM9/10/15

to

Since this post is about the GA144 I will post some (very short) code for this chip.

Yes Rick, in the GA144 we have no time base whatsoever. The (almost) only thing we have that goes into this direction is a counted loop. The smallest loop for a F18A is "for unext". The unext instruction is in slot 0 and its execution time is around 2 ns. If we call the counted loop with a parameter n on top of the stack we have an delay of (n+1)*2 ns. The next longer counted loop is (n+1) with a delay of (n+2)*2 ns.

So what can we do if we want a delay between these discrete points? We could alternate the loop values between n and n+1. If we feed the loop with an alternate sequence of values n n+1 (n n+1 n n+1 ...), then we get a average delay that is roughly (n + 1.5)*2 ns.

If we want a higher resolution then we must enlarge the (n n+1) pattern. I am using a pattern of size 64. I could send 63 n and 1 n+1 value to the counted loop. This gives me a delay of (n + 1 + (1/64)) * 2 ns delay. If I send 62 n and 2 n+1 then the average delay is (n + 1 + (2/64)) * 2 ns. we could go on until we reach 1 n and 63 n+1 values (average delay of (n + 1 + (63/64)) * 2 ns. This is the basic mechanism to increase the time resolution of a counted loop.

The Bresenham algorithm is used to distribute the n and n+1 values as evenly as possible.

Starting a crystal means to pump energy into the crystal, thus converting electrical energy into mechanical energy. The energy is most easily transmitted if we pump the energy with the crystal's resonance frequency. If we don't hit the resonance frequency then our energy will dissipate and the crystal will not oscillate.

Since we don't know the exact delay of a unext loop, which is dependent of the voltage, temperature and chip fabrication, we must scan through a range of delay values until we hit the crystal's resonance frequency.

I hope this explains the method to start a 16 MHz crystal with a GA144.

rickman

unread,

Sep 10, 2015, 1:43:13 AM9/10/15

to

On 9/10/2015 12:55 AM, Stefan Mauerhofer wrote:
> Since this post is about the GA144 I will post some (very short) code for this chip.
>
> Yes Rick, in the GA144 we have no time base whatsoever. The (almost) only thing we have that goes into this direction is a counted loop. The smallest loop for a F18A is "for unext". The unext instruction is in slot 0 and its execution time is around 2 ns. If we call the counted loop with a parameter n on top of the stack we have an delay of (n+1)*2 ns. The next longer counted loop is (n+1) with a delay of (n+2)*2 ns.
>
> So what can we do if we want a delay between these discrete points? We could alternate the loop values between n and n+1. If we feed the loop with an alternate sequence of values n n+1 (n n+1 n n+1 ...), then we get a average delay that is roughly (n + 1.5)*2 ns.
>
> If we want a higher resolution then we must enlarge the (n n+1) pattern. I am using a pattern of size 64. I could send 63 n and 1 n+1 value to the counted loop. This gives me a delay of (n + 1 + (1/64)) * 2 ns delay. If I send 62 n and 2 n+1 then the average delay is (n + 1 + (2/64)) * 2 ns. we could go on until we reach 1 n and 63 n+1 values (average delay of (n + 1 + (63/64)) * 2 ns. This is the basic mechanism to increase the time resolution of a counted loop.
>
> The Bresenham algorithm is used to distribute the n and n+1 values as evenly as possible.

I don't recall exactly the Bresenham algorithm, but I'd be willing to
bet it ends up being the same as an NCO, numerically controlled
oscillator. The average frequency is the ratio of two integers, the
step size and the counter modulus. This is an often used algorithm for
generating a frequency with as much precision as required.

> Starting a crystal means to pump energy into the crystal, thus converting electrical energy into mechanical energy. The energy is most easily transmitted if we pump the energy with the crystal's resonance frequency. If we don't hit the resonance frequency then our energy will dissipate and the crystal will not oscillate.

I seem to recall the info I read at GA on their oscillator attempt put
out a ping, a minimum width impulse into the crystal. That is likely
more than enough for maintaining the energy in the crystal, but to start
it a square wave might work better to put more energy in more quickly.
Which did you do at the start?

> Since we don't know the exact delay of a unext loop, which is dependent of the voltage, temperature and chip fabrication, we must scan through a range of delay values until we hit the crystal's resonance frequency.

Do you stimulate the crystal for a fixed amount of time at a given
frequency? Or do you "look" for the crystal oscillating in response to
your stimulus? I guess it is likely both, stimulate for a time and then
look. If no input back from the crystal, move on to the next frequency.
What range of timing values do you cover?

> I hope this explains the method to start a 16 MHz crystal with a GA144.

Yes, this is as expected. The interesting part will be to see how long
it takes to get a crystal started in the worst case which is likely one
or the other temperature extreme.

I know that many MCU makers don't specify their crystal oscillators very
well. I have had to ask manufacturers for more info which in one case
found it's way into the data sheet. The trick here will be to get well
specified operation for specified crystals. But then I am thinking like
a potential GA customer and this is not being done by GA, but by you.

--

Rick

Stefan Mauerhofer

unread,

Sep 10, 2015, 2:49:43 AM9/10/15

to

> I don't recall exactly the Bresenham algorithm, but I'd be willing to
> bet it ends up being the same as an NCO, numerically controlled
> oscillator. The average frequency is the ratio of two integers, the
> step size and the counter modulus. This is an often used algorithm for
> generating a frequency with as much precision as required.

Yes, you are right. I am using this algorithm.

> I seem to recall the info I read at GA on their oscillator attempt put
> out a ping, a minimum width impulse into the crystal. That is likely
> more than enough for maintaining the energy in the crystal, but to start
> it a square wave might work better to put more energy in more quickly.
> Which did you do at the start?

I am using a square wave for starting the crystal.

> Do you stimulate the crystal for a fixed amount of time at a given
> frequency? Or do you "look" for the crystal oscillating in response to
> your stimulus? I guess it is likely both, stimulate for a time and then
> look. If no input back from the crystal, move on to the next frequency.
> What range of timing values do you cover?

The crystal is stimulated for a fixed amount of square wave cycles (to avoid measuring the elapsed time). Then the pin is switched to input to check if there is a transition coming back from the crystal. Initially I used way too long wires to connect the crystal to the EvalBoard, so a nearby power line could induce a 60 Hz signal, messing up the whole process.
If no transition is detected for a certain time, then the next frequency is selected and the whole process starts again.

> Yes, this is as expected. The interesting part will be to see how long
> it takes to get a crystal started in the worst case which is likely one
> or the other temperature extreme.

The time to start the crystal is depending on the scanning range and resolution. For 1 MHz crystals I don't need an increment in 1/64. 1/2 or 1/4 is sufficient for those.

> I know that many MCU makers don't specify their crystal oscillators very
> well. I have had to ask manufacturers for more info which in one case
> found it's way into the data sheet. The trick here will be to get well

> specified operation for specified crystals. ...

Yes, it is different to make a crystal work in the lab or to provide safe operations of a crystal in a product sold by the thousands or millions. I am still not pleased with my code, especially the failure with the 2 MHz crystal worries me. I could start the 2 MHz xtal but could not keep it running. I get some feedback but I think I did not stimulate the crystal enough to keep it running. I will improve my code not to halt on the first sign of resonance but to measure how many times the crystal oscillates by itself without pumping in any energy. Then I can find the maximum stimulus response frequency and hopefully make run also the 2 MHz algorithm with the improved algorithm.

Anyway, I am having still a lot of fun with the GA144, although I feel a bit claustrophobic programming a chip with so little memory per node :-)
On the other hand it has some unique features which distinguishes it from any other chip I know. And if I have some problems with the unorthodox IDE I get help from GA. They have a hotline.

-Stefan

rickman

unread,

Sep 10, 2015, 2:43:16 PM9/10/15

to

> Yes, it is different to make a crystal work in the lab or to provide safe operations of a crystal in a product sold by the thousands or millions. I am still not pleased with my code, especially the failure with the 2 MHz crystal worries me. I could start the 2 MHz xtal but could not keep it running.. I get some feedback but I think I did not stimulate the crystal enough to keep it running. I will improve my code not to halt on the first sign of resonance but to measure how many times the crystal oscillates by itself without pumping in any energy. Then I can find the maximum stimulus response frequency and hopefully make run also the 2 MHz algorithm with the improved algorithm.

>
> Anyway, I am having still a lot of fun with the GA144, although I feel a bit claustrophobic programming a chip with so little memory per node :-)
> On the other hand it has some unique features which distinguishes it from any other chip I know. And if I have some problems with the unorthodox IDE I get help from GA. They have a hotline.

I had typed a reply with more comments and questions, but my computer
shut down and I lost it. I am tired of fighting this Lenovo piece of
crap. I'm going out to get a new one. Be back later...

--

Rick

rickman

unread,

Sep 10, 2015, 4:10:20 PM9/10/15

to

On 9/10/2015 2:49 AM, Stefan Mauerhofer wrote:

1/64 or 1/4 mean what, that portion of a step size ( 1 count of UNEXT )
by dithering?

>> I know that many MCU makers don't specify their crystal oscillators
>> very well. I have had to ask manufacturers for more info which in
>> one case found it's way into the data sheet. The trick here will
>> be to get well specified operation for specified crystals. ...
>
> Yes, it is different to make a crystal work in the lab or to provide
> safe operations of a crystal in a product sold by the thousands or
> millions. I am still not pleased with my code, especially the failure
> with the 2 MHz crystal worries me. I could start the 2 MHz xtal but

> could not keep it running.. I get some feedback but I think I did not

> stimulate the crystal enough to keep it running. I will improve my
> code not to halt on the first sign of resonance but to measure how
> many times the crystal oscillates by itself without pumping in any
> energy. Then I can find the maximum stimulus response frequency and
> hopefully make run also the 2 MHz algorithm with the improved
> algorithm.

Ok, so a 1 MHz crystal works as do several up to 16 MHz, but not the 2
MHz. Do you have detailed info on this crystal? If it starts but dies
off after switching to "ping" mode from square wave, maybe the ping
needs to be longer than the minimum on/off time which is what I assume
you are doing? A tiny impulse doesn't put much energy into the crystal
and the 2 MHz part may have a higher ESR which increases the damping
factor.

In addition to "working", I would like to understand how the digital
oscillator interacts with the crystal in a systematic way. There is a
lot of info on crystals and most of it is not terribly complete. I have
found one or two references that seem to cover the topic thoroughly. It
is not always easy to construct a crystal oscillator that works under a
range of conditions (usually PVT, but in this case also the crystal
specs).

> Anyway, I am having still a lot of fun with the GA144, although I
> feel a bit claustrophobic programming a chip with so little memory
> per node :-) On the other hand it has some unique features which
> distinguishes it from any other chip I know. And if I have some
> problems with the unorthodox IDE I get help from GA. They have a
> hotline.

I have three chips sitting here that I don't know what to do with. It
seems every thing about this part is done halfway. Even the low cost
mounting option has no provision for decoupling caps, the Schmartboard.

****
Very frustrating working with an unreliable computer. In addition to the
several issues this computer has had since day one, it now has an
intermittent software problem that shuts down all networking. I can't
even reach the router or modem when it is acting up. I was letting the
battery run down to see if that would help reset the condition and was
typing a reply when the computer just quit without any of the several
warnings I am supposed to get. Crap!

I got all ticked off and started yelling about driving up to the city to
get a new machine. I started the computer back up and after three days
of very limited Internet it is working again. :P I'm still pissed, but
I'll wait until tomorrow to get the new machine.

--

Rick

Stefan Mauerhofer

unread,

Sep 12, 2015, 6:15:11 AM9/12/15

to

Am Donnerstag, 10. September 2015 22:10:20 UTC+2 schrieb rickman:
>
> I have three chips sitting here that I don't know what to do with. It
> seems every thing about this part is done halfway. Even the low cost
> mounting option has no provision for decoupling caps, the Schmartboard.
>

Hi Rick

I recommend you to buy a EvalBoard from GA. It works and you don't have to worry about caps.

rickman

unread,

Sep 12, 2015, 10:28:39 AM9/12/15

to

I don't have $500 worth of interest in the device at this point.

--

Rick

Greg

unread,

Sep 13, 2015, 4:05:17 PM9/13/15

to

I am a software guy so this discussion about crystals is way over my head, at this point in time.
I did run into some GA144 related posts here though, where it was claimed that if one managed to get the crystal oscillating reliably, that had huge markerting potential. Some going as far as calling this a Cana of Galilee like moment (Jesus started his ministry with the first public miracle there).
So what Stefan is talking in here, isn't that?

I took arrayforth school classes, which are done quite professionally, and found the chip to be quite original and interesting. Most programming documentation though assumes Forth knowledge, so I decided to learn Forth before buying the evalboard. I am itching to buy it earlier, are there enough people intersted in the chip, where we could start maybe a dedicated group, and where also clueless people like me could be helped along the way by the more advanced people?

rickman

unread,

Sep 13, 2015, 6:59:32 PM9/13/15

to

On 9/13/2015 4:05 PM, Greg wrote:
> I am a software guy so this discussion about crystals is way over my
> head, at this point in time. I did run into some GA144 related posts
> here though, where it was claimed that if one managed to get the
> crystal oscillating reliably, that had huge markerting potential.
> Some going as far as calling this a Cana of Galilee like moment
> (Jesus started his ministry with the first public miracle there). So
> what Stefan is talking in here, isn't that?

No, Stefan is just saying you can pay GA $500 for the privilege of
playing with their software and chip to see if you can figure out how to
use it in a product.

I was pointing out that they worked with Schmartboards to promote a low
cost product which allowed easy access to the pins on the chip, but
didn't bother to get any decoupling caps included on the board which
would allow the user to actually use *all* 144 processors on the chip.
Obviously this was something done with virtually no investment (or
possibly *literally* none), so was not targeted to anything like the
typical low cost starter kit. Not sure what they were thinking.

The issue of the crystal oscillator is not at all a show stopper or the
great enabler of this chip. It is just another useful feature that was
left out of the design for unknown reasons. One might think it was
because the processors don't need a clock, but that is simply an issue
of controlling the timing of each processor. Nearly all embedded
processors need timing for various interfaces and algorithms. So a
clock is required. Having an oscillator on the chip saves the cost of
using an external oscillator vs. just a crystal, about $0.50 in
quantity. This does not seem to be a large cost, but when making high
volume products this becomes a large cost. Just one of many
shortcomings of the GA144 for high volume apps.

> I took arrayforth school classes, which are done quite
> professionally, and found the chip to be quite original and
> interesting. Most programming documentation though assumes Forth
> knowledge, so I decided to learn Forth before buying the evalboard. I
> am itching to buy it earlier, are there enough people intersted in
> the chip, where we could start maybe a dedicated group, and where
> also clueless people like me could be helped along the way by the
> more advanced people?

I would say I am "interested" in the chip, but not enough to spend $500.
I have several of the Schmartboard devices and have never powered them
up. Part of the reason is that I did take a look at the development
system from GA and found it to be rather labyrinthine. I spent
significant time trying to learn the chip at the lower levels where I
wanted to optimize some of the timing for an interface to SDRAM. So I
did learn the instructions and their timing, but it turns out there are
details of the I/O instruction timing missing. I wasn't able to get
that timing from GA, so I dropped my efforts.

There may be other projects which the GA144 could be an interesting
device, but something would need to motivate me to get up and over the
learning curve for the tools. Are the videos fairly painless? I seem
to recall watching two or three finding that progress was slow and full
of details which are not otherwise documented (perhaps in the source
code of the system).

--

Rick

Greg

unread,

Sep 13, 2015, 9:42:21 PM9/13/15

to

I have found the videos clear and well done. They must have put quite some effort into them.
Thing is though that I can't see myself programming much at that level. So what you have next is Coloforth and more recently PolyForth and Eforth. Any of these options involves knowing Forth which is what I am trying to do in my little spare time now.
Compared to x86/x64, F18 is extremely simple, once over the initial shock of being so different, there isn't that much to it, actually. The arrayforth classes are structered so that after you are finished, you have a pretty solid understanding of the processor instruction set and their timing. But I do not recall any timing information being given in relation to I/O.
Everytime I called/emailed Green arrays I have found very receptive ears, even though I haven't bouhgt anything from them yet. Problem though is that I am a software developer with very limited understanding of the actual hardware pieces that go into the makings of a computer, so I would need to couple with somebody who has that knowledge to get something going.
I will search the forum about those postings I was talking about, it seems like what Stefan has going is something big, according to those. There was something about getting the crystal to oscillate on a single pin, or something like that, whatever that means :)

rickman

unread,

Sep 13, 2015, 10:48:01 PM9/13/15

to

On 9/13/2015 9:42 PM, Greg wrote:
> I have found the videos clear and well done. They must have put quite
> some effort into them. Thing is though that I can't see myself
> programming much at that level.

Not sure what you mean by "that level"? Is this just the assembly level
for the F18A? Personally I don't find it that hard, but that is my
thing. I've built microprogrammed machines and programmed them.

> So what you have next is Coloforth
> and more recently PolyForth and Eforth. Any of these options involves
> knowing Forth which is what I am trying to do in my little spare time
> now.

Whether to use Forth or F18A assembly depends on what you are doing. My
understanding is that Polyforth runs in an interpreted mode using
multiple F18A nodes. That's great for high level application code. I
expect most of the processors will be doing specific tasks, especially
in support of I/O. That will likely need to be programmed in F18A
assembly.

> Compared to x86/x64, F18 is extremely simple, once over the
> initial shock of being so different, there isn't that much to it,
> actually.

That depends on how you look at it. There is a lot of detail that is
hard to pull out of the docs I found. But if you read between the lines
there is a lot there.

> The arrayforth classes are structered so that after you are
> finished, you have a pretty solid understanding of the processor
> instruction set and their timing. But I do not recall any timing
> information being given in relation to I/O. Everytime I
> called/emailed Green arrays I have found very receptive ears, even
> though I haven't bouhgt anything from them yet.

I was initially an eager and willing potential customer. But I guess
they didn't think I was working in a way that was compatible with their
methods. I was told that rather than worry about designing to timing
data, just try it and see if it would work. :) Even if it worked on
their eval board, timing has to work over PVT (process, voltage and
temperature). The one variable that can't be tested is "process". In
every device I've ever designed timing was analyzed by.... analysis from
data sheet timing or timing tools with that information built in.

> Problem though is
> that I am a software developer with very limited understanding of the
> actual hardware pieces that go into the makings of a computer, so I
> would need to couple with somebody who has that knowledge to get
> something going. I will search the forum about those postings I was
> talking about, it seems like what Stefan has going is something big,
> according to those. There was something about getting the crystal to
> oscillate on a single pin, or something like that, whatever that
> means :)

Normally a crystal is used in a circuit with an amplifier. There are
two pins on the amplifier, input and output. Some passives need to be
added to set the circuit to the right conditions to oscillate at the
correct frequency. The amplifier is so simple that it is nearly
universally included in all MCU devices as well as other chips. The
GA144 doesn't do that.

In a GA144 the idea is that you can connect a crystal between an I/O pin
and ground. The I/O pin can be driven high for a short time, then put
into input mode. If the crystal is ringing at high enough amplitude,
the input can detect when the voltage rises above the 0/1 threshold and
wake up the processor. The software then drives the pin to a high for a
short time, sets the I/O to input mode again and sleeps waiting for the
next threshold crossing.

This is all good, but it requires starting with a series of pulses which
will put enough energy into the crystal so it will ring with an
amplitude high enough to cross the input threshold. That's the hard
part. My recollection is a bit fuzzy, but I think both Chuck and
someone at GA tried this and could not get it to work for the general
case of an arbitrary crystal frequency. They used a simple timing loop
which has a limited resolution of one instruction. I say limited
because crystals have a very high Q which means they will only resonate
at a frequency very close to the intended frequency. So if you try to
start it at a frequency outside of this range, the energy won't build up
in the crystal and it won't start ringing. Think of this as pushing a
swing, same exact problem. If you feel the right time to give them a
push, they swing more and more each time. If you blindly push them
without syncing to the right frequency some of the time you are slowing
them down rather than helping.

Stefan realized he could use a dithering technique to drive the crystal
at a closer average frequency even if each pulse isn't exactly on the
timing mark. This "average" frequency is the important part. So he has
gotten more crystals to work even if it isn't perfect. The hard part
will be characterizing this and assuring it will work over PVT.

There are also factors to consider regarding the stability of this
oscillator. Driving the crystal with a 1 or 2 ns wide pulse generates a
wide array of frequencies. I have no idea what this will do to the
oscillations. Many applications are not at all critical in this regard.
They just need to be within 1% of a target frequency. Others are
*very* sensitive to frequency and phase noise.

But as I said before, this crystal oscillator is not *the* factor
keeping people from using the GA144. It is one of many shortcomings
that make this an unprofitable device to use in the vast majority of
applications. In fact, I expect it is way down on the list of important
features in an MCU.

I hope this post wasn't too long.

--

Rick

Greg

unread,

Sep 14, 2015, 12:24:17 AM9/14/15

to

Rick, thanks much for the explanation.
What I mean by 'that level'. To some extent F18 is higher level programming than say x86 assembly, but to some is lower level. One of the classes explains how to do multiplication, for something like that I would certainly want something higher level that does it automatically for me. It is my understanding that ColorForth implements multiplication for you. I think ColorForth is the "assembler" in this world.
Then you have the packing of 4 instructions in an 18 bit word. One instruction takes 5 bits, so to fit 4 instructions in the 18 bit word you can only use the instructions that have the last 2 bits set to zero, on the last position in the word. One of which is conveniently, nop. Because that's what you need to use, if the 4th instruction is not one of the other instructions you can pack in there. Again, it is my understanding that ColorForth handles this for you, putting in a nop whenever your next instruction cannot be fitted in the remaning 3 bits of the 18 bit word... Then you have a few instructions that REQUIRE a nop after them, whenever they are not followed by certain other instructions. IIRC that has to do with carry flag not having enough time to propagate, so a nop is needed before the next instruction can correctly be processed. I am not even sure if this is handled by ColorForth for you, I think that's something you need to keep tabs on.

I've done Intel assembly in my younger years, but overall I think that even though the instruction set is so much reduced, it is harder and lower level to program this processor. That's why I can't see myself writting reasonably sized programs 'at this level' there simply is too much low level detail that you get bogged down into. And I didn't even throw into the mix, the multi-core thing. If you read from a core, while that core is also in reading mode, you got youself a nice freeze. I don't own an eval board yet, all this is from my recollections from the ArrayForth classes I took a while back.

rickman

unread,

Sep 14, 2015, 2:00:16 AM9/14/15

to

On 9/14/2015 12:24 AM, Greg wrote:
> Rick, thanks much for the explanation. What I mean by 'that level'.
> To some extent F18 is higher level programming than say x86 assembly,
> but to some is lower level. One of the classes explains how to do
> multiplication, for something like that I would certainly want
> something higher level that does it automatically for me. It is my
> understanding that ColorForth implements multiplication for you. I
> think ColorForth is the "assembler" in this world. Then you have the
> packing of 4 instructions in an 18 bit word. One instruction takes 5
> bits, so to fit 4 instructions in the 18 bit word you can only use
> the instructions that have the last 2 bits set to zero, on the last
> position in the word. One of which is conveniently, nop. Because
> that's what you need to use, if the 4th instruction is not one of the
> other instructions you can pack in there. Again, it is my
> understanding that ColorForth handles this for you, putting in a nop
> whenever your next instruction cannot be fitted in the remaning 3
> bits of the 18 bit word... Then you have a few instructions that
> REQUIRE a nop after them, whenever they are not followed by certain
> other instructions. IIRC that has to do with carry flag not having
> enough time to propagate, so a nop is needed before the next
> instruction can correctly be processed. I am not even sure if this is
> handled by ColorForth for you, I think that's something you need to
> keep tabs on.

I guess it has been awhile since I looked at the tools. I didn't
remember that this was color forth. I don't recall if the tool handles
the NOP or not. That is a bit tricky. It's for the ADD instruction
which takes two instruction cycles for the carry to ripple. Since this
is from loading the two stack positions to using the result, if the
previous instruction does not alter the stack the nop is not needed. So
I guess they give you the flexibility of not adding it. Sounds like a
good job for an optimizer. BTW, the nop goes *before* the plus.

> I've done Intel assembly in my younger years, but overall I think
> that even though the instruction set is so much reduced, it is harder
> and lower level to program this processor. That's why I can't see
> myself writting reasonably sized programs 'at this level' there
> simply is too much low level detail that you get bogged down into.
> And I didn't even throw into the mix, the multi-core thing. If you
> read from a core, while that core is also in reading mode, you got
> youself a nice freeze. I don't own an eval board yet, all this is
> from my recollections from the ArrayForth classes I took a while
> back.

There is no such thing as a "reasonable" sized program in the F18A if by
"reasonable" you mean, not small. With only 64 words of total RAM for
program and data there isn't much need for anything other than
assembler. It's not really hard. Just give it a try and you'll get the
hang of it. The problems come from the I/O instructions in my opinion.
At least when you are doing pin I/O... but also inter node comms.
There are no protocols set up, so anything other than very basic comms
will require you to invent protocols. Be ready to read between the
lines on the data sheet.

--

Rick

bor...@gmail.com

unread,

Sep 14, 2015, 11:35:51 AM9/14/15

to

These are some of the links I was referring to:
https://groups.google.com/d/msg/comp.lang.forth/znB0NksUPBo/FMBe5Ddpp8AJ
https://groups.google.com/d/msg/comp.lang.forth/znB0NksUPBo/mr6z8b50xzUJ
where it was said that getting the crystal to oscillate was a big deal.

rickman

unread,

Sep 14, 2015, 12:06:08 PM9/14/15

to

Ok. So what do you think now?

--

Rick

bor...@gmail.com

unread,

Sep 14, 2015, 12:34:00 PM9/14/15

to

Not much.

rickman

unread,

Sep 14, 2015, 1:26:00 PM9/14/15

to

On 9/14/2015 12:33 PM, bor...@gmail.com wrote:
> Not much.

If you want to do something cool with a GA144, figure out just what it
can do well. Not many have been able to do that.

One application I came up with that may not have significant limitations
is a signal generator. If the output is limited to basic waveforms like
sine, square, triangle, there is no need for external RAM. The DACs can
be dithered to give good resolution at lower sample rates. The DACs
won't perform so well at higher frequencies though, 9 bits is rather
limiting.

Still, it can do audio frequencies with good SNR and higher frequencies
with lower SNR. Doing all the math at 18 bits will minimize phase
truncation allowing the spurs to be spread rather than close in to the
carrier.

Get that crystal oscillator working and you'll have a stable
reference... or just use an external oscillator with guaranteed specs.

For an arbitrary waveform generator it will require enough RAM to hold
the waveforms. So that will likely need an external RAM chip. Maybe
figure out how to use the ADCs well and have a waveform capture mode.
Heck, maybe this could be the basis of an electronic instrument.

--

Rick

Greg

unread,

Sep 14, 2015, 5:17:11 PM9/14/15

to

On Monday, September 14, 2015 at 12:26:00 PM UTC-5, rickman wrote:

Rick, thanks much. The EvalBoard is on its way...
http://www.greenarraychips.com/home/products/index.html
The board has some SRAM included, with dedicated nodes to talk to it.

It helps lots to see that I can reach out to somebody who can "speak" at my
level too, thanks. Not that GreenArrays is not easily reachable, just that
they may be talking Everest level and I am at the level where I barely see the
forest from the trees.

rickman

unread,

Sep 14, 2015, 6:12:11 PM9/14/15

to

Are you aware that they have a simulator? You can write all sorts of
code to test your skills and build some experience before adding the
complication of real hardware.

--

Rick

Greg

unread,

Sep 14, 2015, 9:15:20 PM9/14/15

to

Yep, but if IIRC things can only happen sequentially, so it's a not as realistic of a model. At least the softsim that comes from them.