History buffer crash is still with us.

1,183 views
Skip to first unread message

emt377

unread,
Feb 17, 2019, 10:15:36 AM2/17/19
to [PiDP-11]
The "index off the end" problem is still showing up from time to time. 

server11: ../../07.0_blinkenlight_api/historybuffer.c:128: historybuffer_idx2pos: Assertion `idx < _this->endpos' failed.
localhost: RPC: Unable to receive; errno = Connection refused
(in .//../../../07.0_blinkenlight_api/blinkenlight_api_client.c, line 296)

I'm running PiDP software from early Jan (not sure of the version off-hand, I can check later) where I had thought it was said this was fixed. It certainly seems less frequent/

oscarv

unread,
Feb 17, 2019, 6:52:53 PM2/17/19
to [PiDP-11]
Hi,

Good! In a way, at least. Someone who has the crash.

Attached are two updated source files, could you recompile with them and see if that solves the problem definitively? Joerg Hoppe expects it will. But we can't know for sure unless a crash victim tries it out... I've run 4 different machines for hundreds of hours now, but never seem to trip the bug.

Kind regards,

Oscar.

server11_crash.zip

emt377

unread,
Feb 18, 2019, 7:51:07 PM2/18/19
to [PiDP-11]
I'm not set up for building simh from source yet.  (No Linux systems at hand apart from the PiDP... odd for a man who is a Linux programmer at work).

But the assertion failure is in the blinkenbone client, no?

I eyeballed existing code from GitHub and the offending function looks ok on first glance.  But what's the concurrency story in Simh?  This smells a bit like it's a multithreading race hazard.

Stephen Casner

unread,
Feb 18, 2019, 8:27:44 PM2/18/19
to emt377, [PiDP-11]
On Mon, 18 Feb 2019, emt377 wrote:

> I'm not set up for building simh from source yet. (No Linux systems at
> hand apart from the PiDP... odd for a man who is a Linux programmer at
> work).

You can build on the Pi itself; that's what I've done. Not
super-speedy, but it works fine. In fact, there have been mild
assertions from some folks that code built on the Pi did not exhibit
the problem, whereas the cross-compiled code did. I don't think
that's accurate.

> But the assertion failure is in the blinkenbone client, no?

No. Going back to your earlier message, see that the crash happens in
the server11 process:

server11: ../../07.0_blinkenlight_api/historybuffer.c:128:
historybuffer_idx2pos: Assertion `idx < _this->endpos' failed.

Then that causes the simh client to complain because it can't connect
to the server any more:

localhost: RPC: Unable to receive; errno = Connection refused
(in .//../../../07.0_blinkenlight_api/blinkenlight_api_client.c, line 296)

> I eyeballed existing code from GitHub and the offending function looks ok
> on first glance. But what's the concurrency story in Simh? This smells a
> bit like it's a multithreading race hazard.

Right. There are multiple threads in the server. I discovered that
the all-lights-on flashes were caused by a multithreading race hazard.

-- Steve

emt377

unread,
Feb 18, 2019, 9:01:05 PM2/18/19
to [PiDP-11]
I was remembering the _api_client  part of the file name and wrongly concluding client => not server!


                       

emt377

unread,
Feb 18, 2019, 9:01:51 PM2/18/19
to [PiDP-11]
But of course as you pointed out, that was not the file I was looking for.

emt377

unread,
Feb 24, 2019, 3:02:53 PM2/24/19
to [PiDP-11]
OK, I find myself with a few couple of hours free and willingness to use them (work busy with some hard programming -> don't feel like more hacking in evening; sad).  Can you point me to the 'right' SIMH source code, apart from the files previously attached here, for this exercise?  I am under the impression there are several variants floating around.

emt377

unread,
Feb 24, 2019, 4:35:31 PM2/24/19
to [PiDP-11]
Ah, I just noticed this statement on the `obsolescence guaranteed` web site:

./src - the entire blinkenbone source tree. makepidp.sh compiles the client/server binaries.

emt377

unread,
Feb 25, 2019, 10:54:07 PM2/25/19
to [PiDP-11]
Alrighty. Despite the preliminary whining, it all seemed straightforward enough.  All the source code and tools were already in place on the pi; I did not realize that beforehand.

Surprisingly, rebuild of just the server took not much time even on a pi zero.

1.  Found the directory in /opt/pidp11/src with the historybuffer.[ch] files.
2.  Dropped the replacements on top of them.
3.  Run the "server build" shell script
4.  Ran it again and this time "sudo su"'d first  :-(
5.  Poked around and it looks like the output lands in the right directory for it to be run from
6.  Rebooted as the lazy man's way to get everything running again

It generally takes  a day or so to fail running M+, so I'll leave it running.

I diff'd the code and from a quick eyeballing, a mutex was added around the part I suspected of having a race condition, so IMO there's a fair chance of success here.

I wonder why I'm seeing this more than others.  Anyone else using a pi zero?  (zero WH to be exact)

emt377

unread,
Feb 27, 2019, 10:09:35 PM2/27/19
to [PiDP-11]
It's tricky to know when an intermittent problem has been fixed, but so far it's been running for 48 hours (M+, idling) and the blinkenlights are still blinken.

I'm aimng for a week before I declare it fixed.

Oscar Vermeulen

unread,
Feb 28, 2019, 8:01:55 AM2/28/19
to emt377, [PiDP-11]
On Thu, 28 Feb 2019 at 04:09, emt377 <dave_li...@verizon.net> wrote:
It's tricky to know when an intermittent problem has been fixed, but so far it's been running for 48 hours (M+, idling) and the blinkenlights are still blinken.

I'm aimng for a week before I declare it fixed.

That's encouraging! Please keep me posted!

Kind regards,

Oscar.

Gerry Duprey

unread,
Feb 28, 2019, 8:06:19 AM2/28/19
to Oscar Vermeulen, emt377, [PiDP-11]
Just so you know - while a week is good (and I agree at some point, for
non-mission critical software like this, likely more than enough), but between
my two PiDP11s, I get usually 3-4 weeks before I the server stuff crashes (but
it always crashes, eventually). Sometimes it's a few days before a crash, but
mostly on the 3-4 week basis (this was helped profoundly by changing the console
refresh rate to 20ms vs 1).

So just a note that a week running is not necessarily a solid "all clear" signal ;-)

Gerry
> --
> You received this message because you are subscribed to the Google Groups
> "[PiDP-11]" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to pidp-11+u...@googlegroups.com
> <mailto:pidp-11+u...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pidp-11/CAJAwMc1Naqb-h0LF_n%3Dbn1GWHEO%3DVt0aO6FBFbkEF3xbprVZ3g%40mail.gmail.com
> <https://groups.google.com/d/msgid/pidp-11/CAJAwMc1Naqb-h0LF_n%3Dbn1GWHEO%3DVt0aO6FBFbkEF3xbprVZ3g%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

emt377

unread,
Feb 28, 2019, 8:34:10 AM2/28/19
to [PiDP-11]
On this particular system prior to installing this fix, the MTBF was of the order of 24 hours or so.  Or so I guess, since I wasn't taking notes.


Tom Lake

unread,
Feb 28, 2019, 9:10:55 AM2/28/19
to [PiDP-11]
I've been running about three weeks with no crash since updating to the Feb 4 version at


Tom L

emt377

unread,
Feb 28, 2019, 5:45:14 PM2/28/19
to [PiDP-11]
What Pi are you using?  I'm trying to figure out why I have more trouble than some others.  I wonder whether my single-cored Zero is the issue.

(I now have a 3B+ waiting to be installed, but I want to bash this bug a little more first)

sunnyboy010101

unread,
Feb 28, 2019, 6:32:53 PM2/28/19
to [PiDP-11]
My mileage has been quite different. I crashed on Jan 15, updated software. Crashed again Jan 31; restarted the machine and it's still running. So nearly a month on the Jan 15 version of things. (I didn't reload anything on Jan 31, just rebooted the machine).

We've had 3 major power outages in that time and my UPS just keeps the machine humming. Raspberry Pi 3B+, ABOX 3amp power supply. Unknown actual voltage as I can't really get at the power supply to measure it without disrupting a lot of stuff. :-)

I keep meaning to rebuild with the changes Oscar sent me on Jan 31, but I've been waiting for it to fail again first.

Tom Lake

unread,
Feb 28, 2019, 9:10:27 PM2/28/19
to [PiDP-11]
I run the Pi 3 B+.

Tom L

Jörg Hoppe

unread,
Mar 1, 2019, 2:03:36 AM3/1/19
to pid...@googlegroups.com
Crash testers,

> Just so you know - while a week is good (and I agree at some point,
> for non-mission critical software like this, likely more than enough),
> but between my two PiDP11s, I get usually 3-4 weeks before I the
> server stuff crashes (but it always crashes, eventually). Sometimes
> it's a few days before a crash, but mostly on the 3-4 week basis (this
> was helped profoundly by changing the console refresh rate to 20ms vs 1).


that's an *important* thing:

To verify the "server crash" has healed, please set the console  refresh
rate to the maximum of 1ms with
sim>set realcons i=1
The bug frequency is apparently proportional to this setting,
and we want the test runs as aggressive as possible.

With "interval=20" you are wasting lot of time.

kind regards,
Joerg

emt377

unread,
Mar 1, 2019, 6:22:10 PM3/1/19
to [PiDP-11]
Oh $#%^ !   This was somewhere around 3.5 days in.

server11: ../../07.0_blinkenlight_api/historybuffer.c:128: historybuffer_idx2pos: Assertion `idx < _this->endpos' failed.
                                         localhost: RPC: Unable to receive; errno = Connection refused
                       (in .//../../../07.0_blinkenlight_api/blinkenlight_api_client.c, line 296)

But now I'm doubting I am running the fixed version at all.

    if (_this->endpos > _this->startpos) {
        assert(idx < _this->endpos);   <<<<<<<<<<<<<<<<<<
        pos = _this->startpos + idx;
    } else { // end rolled around

The source file on my pidp system shows no sign of 'MUTEX' ifdefs. I am pretty damn certain I put them there.  The line number seems to correspond to the source code I've got.

"Results inconclusive".  

Will retry.... sigh.


emt377

unread,
Mar 1, 2019, 6:54:01 PM3/1/19
to [PiDP-11]
OK.  Sources copied over again, server rebuilt again, rebooted again.   Starting again.

And now I am convinced that 3 days is insufficient to reach firm conclusions - like they say, "a watched program never cores". 

emt377

unread,
Mar 2, 2019, 9:16:13 PM3/2/19
to [PiDP-11]
This weekend I upgraded my PiDP-11 to a Pi 3B+, so it's possible I will never see the problem again.  I'm running with the stock software, so I'm theoretically vulnerable to the race condition.  If it happens, I'll switch to the mutex-protected history buffer code.

Apologies for dismantling the test-bed for those historybuffer changes, but the horribly length RSX-11M-PLUS startup times finally wore out my patience with the PI Zero.

emt377

unread,
Mar 10, 2019, 9:28:04 AM3/10/19
to [PiDP-11]
The blinkenlights have been running on the Pi 3B+ for over a week now.  No crash even though I'm running on the old (non-mutex'd) code.  So, perhaps we can infer from this that the race condition only shows up when the system is starved for CPU, as on the single-cored Zero?

Gerry Duprey

unread,
Mar 10, 2019, 9:33:02 AM3/10/19
to emt377, [PiDP-11]
Well, my experience is also on a rPi 3B+ and with the refresh rate set to 20ms,
I get a crash ever few weeks - sometimes longer, but it crashes. With refresh
rate set to 1, it happens much more frequently (sometimes minutes/hours,
sometimes days, but rarely much past a week or so).

My 2 pidp11s both have 3B+ and both do not run anything other than the PDP
emulator, so are not experiencing any resources starvation/restrictions. They
usually have 3 cores nearly free, plenty of RAM and disk and are running at
reasonable temps.

Just a data point on the topic.

Gerry
> --
> You received this message because you are subscribed to the Google Groups
> "[PiDP-11]" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to pidp-11+u...@googlegroups.com
> <mailto:pidp-11+u...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pidp-11/5fc90eda-c2ac-4aa6-bd10-a6479423426e%40googlegroups.com
> <https://groups.google.com/d/msgid/pidp-11/5fc90eda-c2ac-4aa6-bd10-a6479423426e%40googlegroups.com?utm_medium=email&utm_source=footer>.

oscarv

unread,
Mar 10, 2019, 10:40:02 AM3/10/19
to [PiDP-11]
Gerry,

Then, please test the patched 2 source files in this thread and see if that resolves the problem for you? Just copy over the 2 files in pidp11/src/.... and recompile with the makeserver.sh script.

I'd like to get some tests to see if this is indeed the fix. Jörg suspects it'll be the fix but as we do not have the crash problem here, we won't know until others try.

Kind regards,

Oscar.

oscarv

unread,
Mar 10, 2019, 10:42:23 AM3/10/19
to [PiDP-11]


On Sunday, March 10, 2019 at 2:28:04 PM UTC+1, emt377 wrote:
The blinkenlights have been running on the Pi 3B+ for over a week now.  No crash even though I'm running on the old (non-mutex'd) code.  So, perhaps we can infer from this that the race condition only shows up when the system is starved for CPU, as on the single-cored Zero?

Correct - it requires some system hiccup. Either an overloaded CPU, or a CPU in panic because of power problems (this crash bug is **heavily** correlated with not using a Pi power supply (>5.1V) but a normal one (5.00000V). But not exclusively.

Kind regards,

Oscar.

Johnny Billquist

unread,
Mar 10, 2019, 10:45:53 AM3/10/19
to pid...@googlegroups.com
Definitely not exclusively.

I get the crash after a week or two. Nothing else ever going on on the
Pi, and the power supply is not the issue.

And the Pi is running headless, connected to my internal WiFi, running
RSX-11M-PLUS, TCP/IP but no DECnet.

And when things crash, it is only the front panel that conks out. All
the rest of the machine never have any problems.

Johnny

--
Johnny Billquist || "I'm on a bus
|| on a psychedelic trip
email: b...@softjar.se || Reading murder books
pdp is alive! || tryin' to stay hip" - B. Idol

Gerry Duprey

unread,
Mar 10, 2019, 10:56:05 AM3/10/19
to Johnny Billquist, pid...@googlegroups.com
Howdy,

On 3/10/19 10:45 AM, Johnny Billquist wrote:
>> Correct - it requires some system hiccup. Either an overloaded CPU, or a CPU
>> in panic because of power problems (this crash bug is **heavily** correlated
>> with not using a Pi power supply (>5.1V) but a normal one (5.00000V). But not
>> exclusively.
>
> Definitely not exclusively.
>
> I get the crash after a week or two. Nothing else ever going on on the Pi, and
> the power supply is not the issue.
>
> And the Pi is running headless, connected to my internal WiFi, running
> RSX-11M-PLUS, TCP/IP but no DECnet.
>
> And when things crash, it is only the front panel that conks out. All the rest
> of the machine never have any problems.

Yep - exactly my situation too. I am running with a known/good power supply
(about 5.1 @ 3A at the pi - tested with my meter and monitored for low-volt on
the pi itself), nothing else running on the pi and when it does happen, the pdp
simulator itself is fine and continues to run, but the front panel server is
dead (terminology there often takes me a second to process - my gut says the
simulator should be called a server and the front panel stuff a client, but no
big deal, of course, just a quick adjustment).

Based solely on observation and my C/system coding background, this really looks
like either a race/mutex/synchronization problem of some sorts - based on the
seeming randomness, the fact that refresh speed affects how often it occurs and
the nature of the errors when the front panel server dies.

Unfortunately, I'll be on the road for at least next two weeks, so won't be able
to try oscars patch until I get back.

I think a lot of the issues with diagnosing this is there are other, common
reasons why the whole thing may become unstable and some, like single-core pis
and weak power supplies, are super common failure/destablizers and cutting
through that at the same time adds more than a little "fog" to the entire
diagnostic process.

Gerry

Oscar Vermeulen

unread,
Mar 10, 2019, 11:06:32 AM3/10/19
to Johnny Billquist, PiDP-11
Johnny,

Then definitely compile in the two patched source files. It should fix things, but I'd like confirmation before putting them into an updated version.

Kind regards,

Oscar.


Johnny Billquist

unread,
Mar 10, 2019, 11:09:20 AM3/10/19
to Gerry Duprey, pid...@googlegroups.com
Hi.
In this case the front panel is the server, and simh is just a client.
But that's just terminology, and of little relevance. :-)

> Based solely on observation and my C/system coding background, this
> really looks like either a race/mutex/synchronization problem of some
> sorts - based on the seeming randomness, the fact that refresh speed
> affects how often it occurs and the nature of the errors when the front
> panel server dies.

Oh, this reeks race problem all the way.
I just don't want to touch the code. After looking at it for a couple of
seconds, I realized I would much prefer to just rewrite the whole thing.
But of course, I have absolutely no time to do that, so instead I do
nothing. :-)

> Unfortunately, I'll be on the road for at least next two weeks, so won't
> be able to try oscars patch until I get back.
>
> I think a lot of the issues with diagnosing this is there are other,
> common reasons why the whole thing may become unstable and some, like
> single-core pis and weak power supplies, are super common
> failure/destablizers and cutting through that at the same time adds more
> than a little "fog" to the entire diagnostic process.

That might be true. And you might also have people just observing casual
facts and making connections that just are not there at all.
There might in fact be just one problem, and all other observations are
in fact not at all related.
But who knows. Until at least one problem is fixed, we will not know if
there are more.

Johnny Billquist

unread,
Mar 10, 2019, 11:12:06 AM3/10/19
to Oscar Vermeulen, PiDP-11
On 2019-03-10 16:06, Oscar Vermeulen wrote:
> Johnny,
>
> Then definitely compile in the two patched source files. It should fix
> things, but I'd like confirmation before putting them into an updated
> version.

Hmm. I could try it, but I should really take a minute or two and check
what it actually does too. :-)

Oscar Vermeulen

unread,
Mar 10, 2019, 11:13:25 AM3/10/19
to Johnny Billquist, PiDP-11
Johnny,

On Sun, 10 Mar 2019 at 16:12, Johnny Billquist <b...@softjar.se> wrote:
Hmm. I could try it, but I should really take a minute or two and check
what it actually does too. :-)

It fixes the race condition that we suspect is the root cause.

Kind regards,

Oscar.

emt377

unread,
Mar 10, 2019, 12:02:23 PM3/10/19
to [PiDP-11]


On Sunday, March 10, 2019 at 11:12:06 AM UTC-4, Johnny Billquist wrote:

Hmm. I could try it, but I should really take a minute or two and check
what it actually does too. :-)


It protects history buffer accesses with a mutex.

 I concur with the general sentiment that this smells (based on my programmer superpowers, which I acquired when I was bitten by a mutant teletype papertape-reader) like a race condition, so this stands a fair chance of being 'the right fix',  I don't know enough about the internals of simh itself to be sure that multithread access is the root cause, but I take the imolementor's word on that.

Johnny Billquist

unread,
Mar 10, 2019, 12:19:32 PM3/10/19
to pid...@googlegroups.com
On 2019-03-10 17:02, 'emt377' via [PiDP-11] wrote:
>
>
> On Sunday, March 10, 2019 at 11:12:06 AM UTC-4, Johnny Billquist wrote:
>
>
> Hmm. I could try it, but I should really take a minute or two and check
> what it actually does too. :-)
>
>
> It protects history buffer accesses with a mutex.

What it conceptually does, or is intended to do I know. I was saying
this in the sense of checking what it actually do.
Yes, it adds a mutex to protect some parts. But after a quick look, that
protection looks rather partial. So I'm not sure it actually removes the
problem, or just reduce it.

But I have not done a deep analysis. I've applied it, and we'll see how
things go.

Rene Richarz

unread,
Mar 11, 2019, 5:27:11 AM3/11/19
to [PiDP-11]
I am also getting the history buffer crash about once per day on my setup. I will now install the patch and will report whether it's gone.

I have quite some experience writing time critical programs on the Raspberry Pi. One needs to be aware that the 4 individual CPU's constantly change their clock frequencies based on load, temperature and the phase of the moon. The scheduling of jobs in the individual CPU's and the clock frequency used is a much more complex issue than most people can imagine. The major problem which I had to deal with in the past was that actual i2c clock rates, which are derived from the same internal clock as the cpu rates, are neither constant nor what one requests! grrr.... It's an eye opener to look at such things with a logic analyzer. It therefore happened that a slow i2c client pulled the plug at arbitrary times.

What this means is that the occurrence of racing conditions is much more complex than one would naively think. The clock speed of an individual program at a given time might also be lower if it is executed on a cpu which was in a slower state just before the process was awakened! Programs need therefore to take all possible precautions to avoid failures due to racing conditions.


Stephen Casner

unread,
Mar 11, 2019, 5:19:34 PM3/11/19
to [PiDP-11]
On Sunday, March 10, 2019 at 7:56:05 AM UTC-7, Gerry Duprey wrote:
(terminology there often takes me a second to process - my gut says the
simulator should be called a server and the front panel stuff a client, but no
big deal, of course, just a quick adjustment).

In my opinion, the client/server relationship should be swapped.  The history accumulation should be performed in the simh simulator and the front panel process should query the simulation process to get the system state at the rate at which the display will be updated.  I've mentioned this before here, and have been threatening to implement it, but that project has been delayed by others.

Rene Richarz

unread,
Mar 16, 2019, 5:12:37 AM3/16/19
to [PiDP-11]


On Monday, February 18, 2019 at 12:52:53 AM UTC+1, oscarv wrote:
Attached are two updated source files, could you recompile with them and see if that solves the problem definitively? Joerg Hoppe expects it will. But we can't know for sure unless a crash victim tries it out... I've run 4 different machines for hundreds of hours now, but never seem to trip the bug.

Good news: No history buffer crashes on my system since 5 days anymore! The patches appear to have solved the problem.

Test conditions:
   - Running 2.11BSD with several terminals all day for the last 5 days
   - 5.17 V at micro-usb test points of rpi 3B+
   - 5.16 V at GND and +5V rails of prototype area
   - passive cooler  on rp3+
   - CPU temperature 56-58 °C, if no other heavy jobs are running on rpi
   - several tests with additional heavy jobs on rp3, CPU temperature  up to 70 °C.
   - No backplane except my mini-backplane for the sockets

Jon Brase

unread,
May 19, 2019, 10:35:27 PM5/19/19
to [PiDP-11]
A data point on the issue, though I have no idea what it actually means:

After a good long time of never* getting this history buffer crash, I got it repeatedly while running v7 Unix last night. I had seen this thread, but not really read it, so I didn't know what error message people were getting, and the sudden development of the issue made me more nervous about the integrity of my SD card than anything.

So I went to back the card up on my desktop this evening, and got a read error half a gigabyte in, accompanied by the card disappearing from /dev. It was still readable if I unplugged it and plugged it back in, so I tried transferring files through my file browser rather than making a bulk image, and got repeated failures that required me to unplug and replug the USB addapter. I tried using a different USB adapter, and am presently pulling an image of the card, and it hasn't failed yet.

Now, there are a bunch of uncontrolled variables here. It seems really suspicious that I would have no issues with this, and then suddenly have four or five crashes in one night, and then the next night the card would give a bunch of read errors, and that seems to indicate toward the crashes being related to a hardware failure on the card.

OTOH, it seems really suspicious that a hardware failure would show up exactly as the exact same crash others had experienced before, and as the same crash every time. There's also some doubt as to whether a hardware failure has even occured: a different USB <-> SD adapter seems to have eliminated the errors I was receiving in backing up the card. These two factors would indicate against it being a hardware issue.

Then again maybe the second adapter is a higher quality adapter that's retrying on errors that the other one is just passing upstream to the OS. I have no idea how much logic these things actually have in them. Then again, with the other adapter the errors were causing the card to completely disappear as far as the OS was concerned, so I'd think that would be the case on the Pi, too, where there is no adapter.

And then yet again If the bug is caused by a race condition, things like delays due to having to retry failed disk accesses could well trigger it.



*except maybe once, the panel stopped blinking at one point, but I wasn't in front of my monitor, so I missed any error message that might have shown up in the simh console window),except maybe once, the panel stopped blinking at one point, but I wasn't in front of my monitor, so I missed any error message that might have shown up in the simh console window, 

oscarv

unread,
May 20, 2019, 6:11:59 AM5/20/19
to [PiDP-11]
Jon,

That an SD card with trouble would trigger the bug in the unpatched pidp11 software fits the profile.

I'm curious what brand (not a fake?) the SD card is. The Pi fora regularly contain messages about SD cards dying. Some people have it more than once, whilst others never have any problem. Out of superstition, I keep to Kingston and Sandisk SD cards, but whether that really makes a difference I do not know.

Better back it up before the uucp setup disappears!

Kind regards,

Oscar.

Jon Brase

unread,
May 21, 2019, 3:34:26 PM5/21/19
to pid...@googlegroups.com
It's a SanDisk. No idea if it's fake or not. Just got a panel failure followed by a whole bunch of "bad block" errors from 2.11 BSD, so I think SD card failure is the issue here.

EDIT:

Correction: It's actually a monster. I thought it was a SanDisk, but it's actually from before I bought my current batch of SanDisk cards.

Jon Brase

unread,
May 21, 2019, 11:03:47 PM5/21/19
to [PiDP-11]
To be honest, though, I'm not sure I want to install the patch and correct the race condition: The server11 crashes provided a great early diagnostic of the failure of the card, and I think I managed to get everything off before the total failure of the card as a result.

Sunnyboy010101

unread,
May 22, 2019, 7:01:32 PM5/22/19
to pid...@googlegroups.com
Finally today I installed the historybuffer patch and recompiled. It
seems to be working fine.

Just to recap in detail:

1. copied the two files 'historybuffer.c' and 'historybuffer.h' from my
PC onto the raspberry pi into a directory called '/home/pi/save' (I use
WinSCP for this).

2. become root 'sudo su -'. I'm old school and I prefer to actually be
root for  this instead of just using sudo. Habit, not necessarily good
practice.

3. Found the original historybuffer files in
'/opt/pidp11/src/07.1_blinkenlight_server'

cd /opt/pidp11/src/07.1_blinkenlight_server

4. saved the original files ('cause I always do that):

mv historybuffer.c historybuffer.c.orig

mv historybuffer.h historybuffer.h.orig

5. copied the new sources into the blinkenlight directory

cp /home/pi/save/history* .

6. verified all looks good

ls -l

7. recompile the server

cd /opt/pidp11/src

./makeserver.sh

8. I had a clean compile. I checked the build directory /opt/pidp11/bin
and saw the symbolic link for server 11, so checked files in  the actual
directory.

ls -l /opt/pidp11/src/11_pidp_server/pidp11/bin-rpi/*

The date stamp showed it had been recompiled today, so I was satisfied.

9. Rebooted.

10. Everything came up fine (I'm using 2.11 BSD) so again, it all looks
good.



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

sunnyboy010101

unread,
May 29, 2019, 1:55:54 AM5/29/19
to pid...@googlegroups.com
OK. Very, very, very strange. As noted in my post below, I copied the 'fix' files and recompiled. I then restarted the raspberry pi running the PiDP11 and its been running since 4 PM PDT May 22.

Tonight at 9pm PDT May 28, the lights on my PiDP11 froze.

I logged in, and there was no error message at all; nothing in simh to indicate any problem, just ... frozen lights. Everything else was running just fine, including my alternate ip login for BSD2.11. Nothing at all to indicate why the lights froze, just frozen lights.

I rebooted as there was nothing else to do, and  it's up again, but this is probably the first report of frozen lights with the new code.

-R

Warren Hardy

unread,
May 29, 2019, 4:37:47 AM5/29/19
to [PiDP-11]
Funny you say that. Today that happened to me too. I shutdown, and rebooted. I tried to do a button reboot using the button rotary switch, but that failed to work.

Bill E

unread,
May 29, 2019, 8:28:50 AM5/29/19
to [PiDP-11]


On Sunday, February 17, 2019 at 10:15:36 AM UTC-5, emt377 wrote:
The "index off the end" problem is still showing up from time to time.

Looking casually thru the code, while a mutex was added to protect fetching of values from the ring buffer, there are a number of additional places in the code where the ring pointers are tested outside of a mutex. Any of these can potentially be a problem. I can't tell for sure without knowing the threading where that's going on. But, any method x that does a pointer test followed by a pointer action in one thread while another thread can change the pointers is a problem. These things can be very random, a context switch has to happen at just the right time. But, given enough time, it'll happen.

Example, the code that's failing in historybuffer_idx2pos(): if( _this->endpos > _this->startpos ) { assert( idx < _this->endpos); ....}.
Consider the case where the if() executes when endpos is at the end of the buffer. The test will be true. Then, between the time of the if() test and the assert, another thread increments endpos, which causes a wrap, setting endpos to now be less than idx. The assert then fails (rightly so). The entire if-else needs to be in a mutex, and this is clearly one of the methods that is having a threading interaction with another that's changing the pointers.

Summary: in a threaded environment, ANY pointer tests and associated code that can run in a thread other than the one manipulating the pointers must also be in a mutex, not just the pointer manipulation code.

I could do a brute force fix, mutex every test, but that's probably overkill. The tests in methods that can execute in multiple threads do have to be protected.

Bill E

unread,
May 29, 2019, 10:25:28 AM5/29/19
to [PiDP-11]
Ok, putting my money where my mouth is. I've reworked historybuffer.c to lock all critical regions. I'm rebuilding now. I'll run for a week or so and post my results. If it doesn't crash, I'll offer up the revised code.

Warren Young

unread,
May 29, 2019, 11:41:03 AM5/29/19
to [PiDP-11]
On Wednesday, May 29, 2019 at 8:25:28 AM UTC-6, Bill E wrote:
Ok, putting my money where my mouth is. I've reworked historybuffer.c to lock all critical regions. I'm rebuilding now. I'll run for a week or so and post my results. If it doesn't crash, I'll offer up the revised code.

I'm glad you're trying this brute force approach. It could work, and it's diagnostic both directions: if it works, it shows that this is where the problem is, and if it doesn't work it says you should be looking elsewhere.

However, I've just searched this message thread and don't see any mention of either Helgrind or the GCC thread sanitizer. Is there a good reason why these popular and free dynamic bug detection tools have not been tried yet?

Why are threads being used in the first place? I thought the PiDP-11 had a pair of cooperating processes, one to drive the front panel and a separate one for the simulator proper. Why are these not single-threaded programs communicating through some regulated channel, so data races can't happen?

Stephen Casner

unread,
May 29, 2019, 1:04:17 PM5/29/19
to Warren Young, [PiDP-11]
On Wed, 29 May 2019, Warren Young wrote:

> Why are threads being used in the first place? I thought the PiDP-11 had a
> pair of cooperating processes, one to drive the front panel and a separate
> one for the simulator proper. Why are these not single-threaded programs
> communicating through some regulated channel, so data races can't happen?

One motivation is to take advantage of multiple cores for increased
performance. In the "IBM 1620 Jr." project for the Computer History
Museum everything runs in one Pi without the IPC that's part of the
Blinkenlights architecture. But we need two threads because the
machine cycle-level simulation runs at a 20 microsecond cycle time and
consumes most of one core while the display update runs at a 10
millisecond cycle time on a separate core. However, we designed a
lock-free memory-based communication between the two threads so no
mutex is necessary and there are no data races.

I've said here before that what I plan to do is implement a method
similar to what we did for the 1620 where the averaging of the state
of each light is performed within the simulation and then the display
update queries that state to get the appropriate intensity value for
each update. That query could be over IPC rather than directly
through memory as we did for the 1620, but in that case the display
process would be the client and the simulation would be the server.
This eliminates the current history buffer mechanism and can be done
without a mutex.

I've let that project be stalled by other projects, but I swear I will
get to it eventually.

-- Steve

Bill E

unread,
May 29, 2019, 2:43:49 PM5/29/19
to [PiDP-11]
Ok, I've done some code tracing and dug thru the other code that calls the methods in historybuffer.c. I've now de-brute-forced my fix and I think I've identified the culprit. The failing method was failing because of a threading window, but it wasn't the direct culprit. It could get passed an index that was no longer valid. Anyway, the fix is very simple. I'm running with it now. Next update after I see if this actually fixed the problem.

Oscar Vermeulen

unread,
May 29, 2019, 3:16:28 PM5/29/19
to Bill E, [PiDP-11]
Bill,

Much appreciated! Let me know when/if I should amend the code.

Kind regards,

Oscar.


On Wed, 29 May 2019 at 20:43, Bill E <wjegr...@gmail.com> wrote:
Ok, I've done some code tracing and dug thru the other code that calls the methods in historybuffer.c. I've now de-brute-forced my fix and I think I've identified the culprit. The failing method was failing because of a threading window, but it wasn't the direct culprit. It could get passed an index that was no longer valid. Anyway, the fix is very simple. I'm running with it now. Next update after I see if this actually fixed the problem.

--
You received this message because you are subscribed to the Google Groups "[PiDP-11]" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pidp-11+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pidp-11/694045ab-a794-46f9-ae4d-d12c634a1c87%40googlegroups.com.

Johnny Billquist

unread,
May 29, 2019, 4:53:23 PM5/29/19
to pid...@googlegroups.com
Yeah, my PI have been up for quite a while, but the day before last if
finally hung again as well. I did a quick peek at the update that was
done around February, which did improve the situation, but inspection of
the code made me suspect it wasn't enough either, just like Bill reported.

Johnny

On 2019-05-29 07:55, sunnyboy010101 wrote:
> OK. Very, very, very strange. As noted in my post below, I copied the
> 'fix' files and recompiled. I then restarted the raspberry pi running
> the PiDP11 and its been running since 4 PM PDT May 22.
>
> Tonight at 9pm PDT May 28, the lights on my PiDP11 froze.
>
> I logged in, and there was no error message at all; nothing in simh to
> indicate any problem, just ... frozen lights. Everything else was
> running just fine, including my alternate ip login for BSD2.11. Nothing
> at all to indicate why the lights froze, just frozen lights.
>
> I rebooted as there was nothing else to do, and  it's up again, but this
> is probably the first report of frozen lights with the new code.
>
> -R
>
>
> On Sunday, February 17, 2019 at 7:15:36 AM UTC-8, emt377 wrote:
>
> The "index off the end" problem is still showing up from time to time.
>
> server11: ../../07.0_blinkenlight_api/historybuffer.c:128:
> historybuffer_idx2pos: Assertion `idx < _this->endpos' failed.
> localhost: RPC: Unable to receive; errno = Connection refused
> (in .//../../../07.0_blinkenlight_api/blinkenlight_api_client.c,
> line 296)
>
> I'm running PiDP software from early Jan (not sure of the version
> off-hand, I can check later) where I had thought it was said this
> was fixed. It certainly seems less frequent/
>
> --
> You received this message because you are subscribed to the Google
> Groups "[PiDP-11]" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pidp-11+u...@googlegroups.com
> <mailto:pidp-11+u...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pidp-11/dfa0f661-91ca-4845-b2c1-1e6040f3633c%40googlegroups.com
> <https://groups.google.com/d/msgid/pidp-11/dfa0f661-91ca-4845-b2c1-1e6040f3633c%40googlegroups.com?utm_medium=email&utm_source=footer>.

Johnny Billquist

unread,
May 29, 2019, 5:07:18 PM5/29/19
to pid...@googlegroups.com
Threads are not evil. Threads are wonderful. But, as with most things
that have lots of power and possibilities, it can also easily be used
wrong, and is sometimes not trivial for people to do right.

Helgrind is a good tool, but a slowdown of 100:1 might be a problem if
you want to run it... :-)

Johnny

On 2019-05-29 17:41, Warren Young wrote:
> On Wednesday, May 29, 2019 at 8:25:28 AM UTC-6, Bill E wrote:
>
> Ok, putting my money where my mouth is. I've reworked
> historybuffer.c to lock all critical regions. I'm rebuilding now.
> I'll run for a week or so and post my results. If it doesn't crash,
> I'll offer up the revised code.
>
>
> I'm glad you're trying this brute force approach. It could work, and
> it's diagnostic both directions: if it works, it shows that this is
> where the problem is, and if it /doesn't/ work it says you should be
> looking elsewhere.
>
> However, I've just searched this message thread and don't see any
> mention of either Helgrind
> <http://valgrind.org/docs/manual/hg-manual.html> or the GCC thread
> sanitizer
> <https://developers.redhat.com/blog/2014/12/02/address-and-thread-sanitizers-gcc/>.
> Is there a good reason why these popular and free dynamic bug detection
> tools have not been tried yet?
>
> Why are threads being used in the first place? I thought the PiDP-11 had
> a pair of cooperating processes, one to drive the front panel and a
> separate one for the simulator proper. Why are these not single-threaded
> programs communicating through some regulated channel, so data races
> can't happen?
>
> Threads are evil.
> <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf>
>
> --
> You received this message because you are subscribed to the Google
> Groups "[PiDP-11]" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pidp-11+u...@googlegroups.com
> <mailto:pidp-11+u...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pidp-11/0ec771df-73ff-44e1-961d-9b318ba2f849%40googlegroups.com
> <https://groups.google.com/d/msgid/pidp-11/0ec771df-73ff-44e1-961d-9b318ba2f849%40googlegroups.com?utm_medium=email&utm_source=footer>.

Warren Young

unread,
May 30, 2019, 12:02:54 AM5/30/19
to [PiDP-11]
On Wednesday, May 29, 2019 at 11:04:17 AM UTC-6, Stephen Casner wrote:
On Wed, 29 May 2019, Warren Young wrote:

> Why are threads being used in the first place?

One motivation is to take advantage of multiple cores for increased performance.

My understanding is that the PiDP-11 simulator is one process, and the BlinkenBone server is the other, the latter driving the PiDP-11 hardware only. Since the Linux scheduler will tend to put those on separate cores, and the simulator will suck up all or most of a core by itself, that leaves 3 more for the BlinkenBone process. If the display updating process is single-threaded, then surely one whole CPU core is enough for it?

I'm assuming a quad-core Pi here. Is that not the minimum recommended for this kit?
 
In the "IBM 1620 Jr." project for the Computer History
Museum everything runs in one Pi

What generation of Pi?
 
we need two threads because the
machine cycle-level simulation runs at a 20 microsecond cycle time and
consumes most of one core

The PiDP-8/I simulator takes ~6.5% of one Pi 3B core (1.2 GHz) when throttled to run at 333 kIPS, roughly approximating the speed of a real PDP-8/I.

By "cycle-level" I assume you mean some kind of sub-instruction level, comparable to what the PDP-8 architecture calls "steps," which SIMH doesn't try to emulate. If all else were equal, that means the 1620 simulator has more to do, but a quick scan of the machine's Wikipedia page suggests that it's roughly 10 times slower than a mainstream PDP-8, so the two advantages ought to mostly cancel out.

That's only going to happen if the 1620 simulator is as well-optimized as SIMH's PDP-8 simulator. The CPU instruction decode loop for SIMH's PDP-8 simulator is basically an 1100 line "switch" statement. No subroutine calls, no unnecessary branches. I once added a single quick subroutine call in the middle of this loop, and it measurably hurt performance. I don't remember the magnitude, but it wasn't trivial; I ended up manually inlining that function's code to restore performance.

the display update runs at a 10
millisecond cycle time on a separate core.

I suspect that if you just ran it every 20000 iterations of the instruction decoding loop, on the same thread, you wouldn't notice the overhead, especially if the simulator is throttled down to match the original machine's IPS rate.
 
we designed a
lock-free memory-based communication between the two threads so no
mutex is necessary and there are no data races.


It's the same basic technique as double-buffered bitmapped graphics. There are two "display" buffers, one that the CPU decoding loop writes to, and one that the display update code reads from. About 100 times a second, it zeroes the read-from display and swaps it with the write-to display.

Technically there's a race condition here in that the display could be updating at the time of the swap, giving the equivalent of "tearing" in bitmapped graphics displays, but we put up with it for the same reason that a great many early PC games did. I'm pretty sure that Linux process switching causes bigger display jitters than this tearing does.

I've said here before that what I plan to do is implement a method
similar to what we did for the 1620 where the averaging of the state
of each light is performed within the simulation and then the display
update queries that state to get the appropriate intensity value for
each update.

Yes, that's what my PiDP-8/I incandescent lamp simulator does. 

For the PiDP-11, instead of multiple levels, I think you just need a threshold: was this LED mostly on during the last update time? Something like "set_count > instructions_executed / 2".
 
That query could be over IPC rather than directly
through memory as we did for the 1620

Have you looked at SIMH's front panel API? It does that.

I haven't tried it myself, since it didn't exist at the time the PiDP-8/I software was originally written, and I haven't been motivated to rip all that up and replace it with a sim_frontpanel-based version.

It's still on my wishlist, though.

Warren Young

unread,
May 30, 2019, 12:28:22 AM5/30/19
to [PiDP-11]
On Wednesday, May 29, 2019 at 3:07:18 PM UTC-6, Johnny Billquist wrote:
Threads are not evil.

Did you read the linked paper? It shows mathematically why they cause problems, and how that is inherent to the computation model. It also gives alternatives.

I'd rank threads somewhere between pesticides and pestilence on a utilitarian happiness vs. misery scale.

Threads are one of those last-resort sort of things that we should reach for only when the alternative is greater human misery. I doubt that threshold was reached here.

Here we have a bug that's lasted for...what, half a year? This despite lots of attention from multiple clever people. Was the trade worth it?

Threads are wonderful.

For most applications, the alternatives listed in the linked paper are more wonderful.

Helgrind is a good tool, but a slowdown of 100:1 might be a problem if
you want to run it... :-)

The Valgrind tools are written around a software CPU emulator. That lets these tools monitor the application's behavior at an instruction level without needing kernel-level or special hardware support. The cost is a lot of performance.

I don't use these two tools much — guess why :) — but I'd expect ThreadSanitizer to have a much lower CPU impact. It's certainly true for Valgrind vs. AddressSanitizer.

Bill E

unread,
May 30, 2019, 7:43:30 AM5/30/19
to [PiDP-11]
Ah, thread wars. Regardless of various opinions, threads are used all over the place in 'modern' software. I've spent many years dealing with them. When used in isolation, i.e, no interthread communication, they are generally innocuous. Java, for example, is intrinsically multi-threaded. The real problems happen when you need to synchronize across threads, or, shudder, use the absolutely horrendous Java ThreadLocal, storage that is tied to a thread, not to a task. 'Clever' programmers use it and cause nothing but misery, usually because they forget (or don't actually know) that the storage is tied to the thread and they forget to reinitialize it, leading (in my case, a very large multiuser app) to strange behavior because there will be a context change and the new thread user will have data from the old user. Finally, as is apparent here, sharing of data that requires synchronization across threads is highly error-prone. People just don't seem to think in terms of  'this var is shared, so I need to worry about every use of it and expect its value to randomly change in the middle of a statement'.

Anyway, enough philosophy. My change hasn't made anything worse, but because of the rare condition that has to occur to cause the failure, can't tell if the problem is actually fixed. So, if anyone wants to try it, here's the code fragment. You'll have to do your own editing. I don't want to push the whole file, don't want to get random versions floating around.

In historybuffer.c, make the beginning of historybuffer_get_average_vals() look like this:

   // defaults: all 0
    memset(_this->control->averaged_value_bits, 0, sizeof(_this->control->averaged_value_bits));
    _this->control->averaged_value = 0;

// wje - the lock must be placed before historybuffer_fill() is called.
// Otherwise, it can return an index valid at the time but that becomes invalid
// by the time historybuffer_get() is called because historybuffer_set_val() could have run in between the two
// calls, and could have caused a buffer wrap.
#ifdef USE_MUTEX
    pthread_mutex_lock(&_this->mutex) ; // inhibit concurrent writes
#endif
    last_idx = historybuffer_fill(_this) - 1;
    if (last_idx < 0)
    {
#ifdef USE_MUTEX
        pthread_mutex_unlock(&_this->mutex) ; // allow write
#endif
        return; // buffer empty, return all 0's
    }

    hbe = historybuffer_get(_this, last_idx);

That's it, just moves the lock to before historybuffer_fill(). Why? fill() uses the start/end pointers.  idx2pos() does tests based on a value returned by fill() and is called by get_average_vals() farther down. Since fill() isn't in the lock, it can compute an index that is then invalidated indirectly by set_value() that gets called in the other thread before the lock is done in get_average_vals(). That's my theory and I'm sticking to it. For now.

Johnny Billquist

unread,
May 30, 2019, 8:03:28 AM5/30/19
to pid...@googlegroups.com
On 2019-05-30 06:28, Warren Young wrote:
> On Wednesday, May 29, 2019 at 3:07:18 PM UTC-6, Johnny Billquist wrote:
>
> Threads are not evil.
>
>
> Did you read the linked paper? It shows /mathematically/ why they cause
> problems, and how that is inherent to the computation model. It also
> gives alternatives.

Just because you have someone trying to prove something mathematically,
it does not mean he is right. I am using threads all day long at my
work, and they work just fine thank you.

Just like bumblebees actually fly, although some people mathematically
proved it was impossible.

Besides, threads are just a software construct that gives user programs
the same kind of concurrency and issues you always have to deal with
when writing code for a kernel, where you have multiple cores and
various interrupts to deal with all the time, which have exactly the
same issues. So it's not like this is a new problem, or a problem for
which we don't know there are solutions. So the mathematical proof that
they cause problems is just meaningless. Threads pose problems, but
problems for which we already have solutions. Computing the power of two
is also a problem, but we have a solution for that too. Are you
suggesting we shouldn't do something because it is a problem?

> I'd rank threads somewhere between pesticides and pestilence on a
> utilitarian happiness vs. misery scale.
>
> Threads are one of those last-resort sort of things that we should reach
> for only when the alternative is greater human misery. I doubt that
> threshold was reached here.

And I couldn't disagree more. But if you don't want to deal with them,
you are of course free to. Meanwhile I will continue to use them
happily, and create solutions that make me warm inside.

> Here we have a bug that's lasted for...what, half a year? This despite
> lots of attention from multiple clever people. Was the trade worth it?

We might have different definitions of clever people then. Threads are
not easy to deal with, and you have lots of ways things can go wrong.
The same thing is totally true about C in general, and I think most
people who code should actually be kept away from a compiler.
Or maybe left playing with some safe language that will not allow you to
do all the crazy things C let you do.

But just because people can't do things right in C does not mean C is a
bad language. It's just a difficult language to handle, and people need
more practice and experience.

But how do one get experience? Usually by trying and trying again. Same
thing is true with threads. Or anything else...

So I don't mind. The PiDP-11 is a hobby project. A perfect place for
people to play around and learn. Personally, it is not important enough
for me to put my time into fixing things. I'll happily let others
continue to play and learn. And yes, that means lots of attention of
multiple people. Which is good. More people get more experience. Which
is part of the whole purpose of the thing, right?

> Threads are wonderful.
>
>
> For most applications, the alternatives listed in the linked paper are
> more wonderful.

Opinions differ.

> Helgrind is a good tool, but a slowdown of 100:1 might be a problem if
> you want to run it... :-)
>
>
> The Valgrind tools are written around a software CPU emulator
> <http://www.valgrind.org/docs/callgrind2004.pdf>. That lets these tools
> monitor the application's behavior at an instruction level without
> needing kernel-level or special hardware support. The cost is a lot of
> performance.
>
> I don't use these two tools much — guess why :) — but I'd expect
> ThreadSanitizer to have a much lower CPU impact. It's certainly true for
> Valgrind vs. AddressSanitizer.

According to documentation ThreadSanitizer cause a 5-20 time slowdown
(so better than a 100 time, but still not so much fun), and a 5-10 time
memory increase. Which might very well blow up in your face as well here.

But these tools, along things like valgrind are sometimes very useful
nonetheless.

Johnny

Geoffrey McDermott

unread,
May 30, 2019, 10:15:32 AM5/30/19
to [PiDP-11]
"Just because you have someone trying to prove something mathematically,
it does not mean he is right. I am using threads all day long at my
work, and they work just fine thank you.

Just like bumblebees actually fly, although some people mathematically
proved it was impossible."



That comment touched a nerve about mathematical proofs defining the real world.......the ENTIRE premise of the proof about the bumblebee was flawed from the start, maybe because of a misunderstanding of the actuality of aerodynamics, or the calculations of what was being attempted to prove was much too complex, but the result was the same.

The ONLY reason that the result was as stated was the incorrect assumption that a bumblebee flew with unmovable wings, like a conventional aircraft......the bee in question moves its wings back and forth, so it doesn't depend on forward motion through the air to generate lift.

The fact that this 'proof' continues to be assumed correct after all this time (70+ years) really pisses me off.

Mathematical proofs are like statistics.....you can prove ANYTHING with enough manipulation in either case.

END OF TIRADE...................

Johnny Billquist

unread,
May 30, 2019, 10:43:12 AM5/30/19
to pid...@googlegroups.com
Not disagreeing with anything you said. I just want to explicitly point
out that I did not think that the mathematical proof of (non)flying
bumblebees was correct (it obviously is not). I think it was also not so
long ago acknowledged that the proof was in fact incorrect, and should
be passed on to the annals of silly mistakes in history, and not be
considered current anymore.

But I like bringing it up when people think that they have a
mathematical proof for something, to remind people that such proofs are
maybe not that good proofs to start with. Mathematics is great, but
proofs in some context are much more a reflection of what the author
wants to prove rather than some objective stated problem and solution.

Which comes down to the same point you make in your final paragraph.

And for anyone reading through the proof about threads in the paper
referenced, it's by now a rather old paper (written in 2006), and the
world have moved on quite a bit since then, some of the claims in that
paper are a bit outdated, some are a bit incorrect, and the conclusions
and wishes do not at all, in fact, reflect what actually have happened
since. The person writing it is still active, but that whole research
group seem to have slowed down over the years, and in my view is not
really producing or publishing much that is relevant or interesting. (Of
course, that is a very personal opinion.) I could probably ramble on
quite a lot about what they have done, and how, but I think I should
stop while I'm ahead, and also acknowledge that this forum is not the
right place for this discussion.

sunnyboy010101

unread,
May 30, 2019, 11:08:17 AM5/30/19
to [PiDP-11]
Johnny,

I completely agree with you.

A couple of points to add to the discussion from one who has programmed since 1979...

1. This bug has taken some time to find and fix simply because, as with many bugs, it is triggered very infrequently in the running system. If the bug takes weeks to appear, the whole 'find-fix-test' cycle will be similarly extended. This but took a long time not because the participants did anything 'not clever', but simply because there were few ways to speed up the process.

2. I am of the opinion that many of our "modern" languages were developed simply because majority of people using C back in the 80's and 90's were utterly incapable of coding without shooting themselves in the foot.

The college I worked at in 2000 switched from C to Java as  primary coding language because industry (who hired our grads) demanded we shift to Java because they could not hire quality entry-level C programmers and therefore had switched to Java.

I wrote C for years (still do when I need low-level stuff that works), and absolutely love it.

sunnyboy010101

unread,
May 30, 2019, 11:59:13 AM5/30/19
to [PiDP-11]
Done and compiled. I'll let everyone know how it goes. I'm running 2.11BSD as my PiDP11 OS.

I made a mistake in my post on May 22. The files are in '/opt/pidp11/src/07.0_blinkenlight_api', not '07.1_blinkenlight_server'.

Bill E

unread,
May 30, 2019, 2:42:12 PM5/30/19
to [PiDP-11]



Ok, in the spirit of providing a modern solution and stopping any flame war, here's my new, current, totally PC solution:

Remove the threading. The sender task can do an HTML POST update to the cloud, I suggest an Amazon Cloud virtual instance.
Then, the display task can do a GET from the cloud server, looking for the local ID. Of course, both sides will have to do HTML encoding. But, cycles are free, right? No overhead.

This will guarantee portability and consistency. Don't worry about latency, that's just some bourgeois obfuscation. And, you can use 200 different 3rd-party libs, each with undocumented dependencies on a vast number of other 3PP libs, to implement this elegant solution.

No more crashes. Well, ok, your panel might not update all that often, but what do you want? Usability? Highly overrated.

sunnyboy010101

unread,
May 30, 2019, 2:54:07 PM5/30/19
to [PiDP-11]
Great idea Bil. At the same time we could rewrite it all in Java and use a SOAP interface for the API. :-D

Stephen Casner

unread,
May 30, 2019, 4:03:25 PM5/30/19
to Warren Young, [PiDP-11]
On Wed, 29 May 2019, Warren Young wrote:
> On Wednesday, May 29, 2019 at 11:04:17 AM UTC-6, Stephen Casner wrote:
> > On Wed, 29 May 2019, Warren Young wrote:
> >
> > > Why are threads being used in the first place?
> >
> > One motivation is to take advantage of multiple cores for
> > increased performance.
>
> ... If the display updating process is single-threaded, then surely
> one whole CPU core is enough for it?

I was simply speaking generally, that if some task requires more oomph
than one core can provide, then splitting the task into multiple
threads allows using multiple cores.

> > In the "IBM 1620 Jr." project for the Computer History
> > Museum everything runs in one Pi
>
> What generation of Pi?

3B

> > we need two threads because the machine cycle-level simulation
> > runs at a 20 microsecond cycle time and consumes most of one core
>
> The PiDP-8/I simulator takes ~6.5% of one Pi 3B core (1.2 GHz) when
> throttled to run at 333 kIPS, roughly approximating the speed of a
> real PDP-8/I.
>
> By "cycle-level" I assume you mean some kind of sub-instruction level,
> comparable to what the PDP-8 architecture calls "steps," which SIMH doesn't
> try to emulate. If all else were equal, that means the 1620 simulator has
> more to do, but a quick scan of the machine's Wikipedia page suggests that
> it's roughly 10 times slower than a mainstream PDP-8, so the two advantages
> ought to mostly cancel out.

The 1620 panel contains approximately 192 lights, many of which
reflect the state of internal logic gates in the machine. In order to
update those lights accurately, the simulation basically needs to
implement the same logic as in the flow sequence diagrams that specify
the operation of the computer.

> > the display update runs at a 10 millisecond cycle time on a
> > separate core.
>
> I suspect that if you just ran it every 20000 iterations of the
> instruction decoding loop, on the same thread, you wouldn't notice
> the overhead, especially if the simulator is throttled down to match
> the original machine's IPS rate.

No, because the display update takes 2 milliseconds to execute, so
that would distort the display for the cycle at which the update
occurred. The display update uses I2C to feed 12 LED driver chips
each of which drives 16 LEDs at a variable intensity using separate
PWM for each LED. It might be possible to divide up the display
update work into little pieces to be done in each 20 microsecond
machine cycle time, but it is much easier to maintain timing accuracy
with separate threads running on separate dedicated cores.

The simulator thread does throttle the execution rate by waiting to
the end of the 20 microsecond interval after it finishes the work of a
cycle.

> > we designed a
> > lock-free memory-based communication between the two threads so no
> > mutex is necessary and there are no data races.
>
> I did that for the PiDP-8/I as well.
>
> It's the same basic technique as double-buffered bitmapped graphics. There
> are two "display" buffers, one that the CPU decoding loop writes to, and
> one that the display update code reads from. About 100 times a second, it
> zeroes the read-from display and swaps it with the write-to display.

That is similar to the technique we use with two buffers. The
simulation accumulates counts of cycles during which each light should
be ON as well as a total count of the cycles. The display intensity
is the ratio of those two counts. The simuation works on the
currently active buffer, but which one is active is determined only by
the display thread. The display thread clears the counts as it
consumes the inactive buffer, then swaps. It does not matter if there
is some variation in the number of cycles that a buffer is active.

> Technically there's a race condition here in that the display could be
> updating at the time of the swap, giving the equivalent of "tearing" in
> bitmapped graphics displays, ...

We easily avoid that problem by having the display update thread delay
at least 20 microseconds (by doing other work) between the time it
switches which buffer is active and when it begins consuming the
data. That gives time for the simulation cycle to complete its update
of the buffer that was active at the time the cycle started.

> > of each light is performed within the simulation and then the display
> > update queries that state to get the appropriate intensity value for
> > each update.
>
> Yes, that's what my PiDP-8/I incandescent lamp simulator does.

The 1620 also used incandescent bulbs, so we are managing the change
in intensity in a manner similar to what you implemented.

> For the PiDP-11, instead of multiple levels, I think you just need a
> threshold: was this LED mostly on during the last update time? Something
> like "set_count > instructions_executed / 2".

I'm guessing that would still show too much flicker. That's the
problem with the existing PiDP11 software that motivates me to try a
different implementation. Plus I think I could make it do something
more realistic for the micro-ADR display (not truly accurate, but more
representative).

> > That query could be over IPC rather than directly
> > through memory as we did for the 1620
>
> Have you looked at SIMH's front panel API? It does that.

To drive the address and data lights more realistically I would like
to accumulate counts for the lights being on or off in each memory
cycle of an instruction. I don't think the FP API can provide that.

-- Steve

Warren Young

unread,
May 30, 2019, 11:45:42 PM5/30/19
to [PiDP-11]
On Thursday, May 30, 2019 at 6:03:28 AM UTC-6, Johnny Billquist wrote:

Just like bumblebees actually fly, although some people mathematically
proved it was impossible.

The problem with that analogy is that mathematics doesn't tell nature how to behave.

A better example is string theory, where we have amazingly complicated mathematics that may or may not have any connection to reality, and we don't know which is the case, because string theory is physically untestable within our present capabilities.

The thing is, computation is so far removed from the underlying physics that for most programmers, it's irrelevant. Computation is closer to pure mathematics from a practical standpoint.

I can give you the physics explaining how an optical mouse works, and how the USB interface works, and how the transistors within that USB controller work, but that doesn't tell you why Chrome and Firefox behave differently when you click the mouse at the same spot in the web page. There is no physics equation for web browsers that would tell us, unequivocally, which browser is correct, because there is no physics that constrains the behavior of web browsers at that level. A browser is a mathematical construct that behaves however its creators designed it.

So, when someone comes up with a mathematical argument for why a given computing construct is bad, I think the only possible refutation is a different mathematical argument.

Besides, threads are just a software construct that gives user programs
the same kind of concurrency and issues you always have to deal with
when writing code for a kernel, where you have multiple cores and
various interrupts to deal with all the time, which have exactly the
same issues.

That's why I rank threads on a continuum of evil: there are some evils — like pesticides, in my example — that result in more happiness than sorrow. Most of the planet's population would literally starve to death without pesticides, so it's worth putting up with their problems.

My general rule is that threads are best used at the infrastructure level only. Your OS kernel can be threaded, your language runtime can be threaded, your database can be threaded...but your high-level application probably shouldn't be. There is usually a safer alternative that will also work.

Even then, it should only be done for infrastructure code that is so widely used that it can attract sufficient resources to correctly do the locking and such required for proper threading. MySQL should be multi-threaded, but your hobby DBMS project shouldn't.

Threads pose problems, but
problems for which we already have solutions.

Yes, whole books full of them, which is the problem: threads are a technology that requires domain mastery to apply competently in any but the most trivial cases.

I have no problem with a world where we have a small number of excellent thread masters running around doing their magic. The problem is when we teach third-year CS majors about threads and say "This is how you do multi-processing." We should instead be teaching them the actor model, message busses, promises and futures...

Computing the power of two
is also a problem, but we have a solution for that too. Are you
suggesting we shouldn't do something because it is a problem?

False equivalency. Complete and correct solutions to exponentiation — any base, not just 2 — were set down in stone by experts decades ago, to the point where the majority of computing devices above a particular threshold of complexity have one of these solutions built into their silicon. For CPUs too small to have a built-in FP unit, we have code libraries to do it in software which were well-debugged decades ago as well. No one gets this wrong any more. We just call the built-in routine and assume the answer is correct, within the precision of the number system used.

Where is the equivalent in threading?

> Here we have a bug that's lasted for...what, half a year? This despite
> lots of attention from multiple clever people. Was the trade worth it?

We might have different definitions of clever people then. Threads are
not easy to deal with, and you have lots of ways things can go wrong.

I can't decide if you're actually agreeing with me or making a No True Scotsman argument.

While the PiDP-11 project undoubtedly attracts people who only want it as a really fancy Raspberry Pi enclosure, it should also be attracting some of the best-trained, most experienced, and most capable people in computing. If this community hasn't got the expertise to quickly squish such bugs, why should the rest of the computer science world be choosing threads as their first resort for multi-processing and concurrency problems?
 
But just because people can't do things right in C does not mean C is a
bad language. It's just a difficult language to handle, and people need
more practice and experience.

But how do one get experience? Usually by trying and trying again. Same
thing is true with threads. Or anything else...

I agree. Where I think we might disagree is how many people should be so-trained.

Few programmers these days have written their own balanced tree implementation, and if they have, it was probably in college as an assignment, not for a production system. The vast majority of programs that use such a thing are based on whatever's provided by the language runtime, and the programmer doesn't care how it's actually implemented. For most programs, it doesn't even matter if it's a hash table instead of a balanced tree. We just call it a dict, or a map, or a hash, and move on.

Why should threads be different?

Warren Young

unread,
May 31, 2019, 12:21:00 AM5/31/19
to [PiDP-11]
On Thursday, May 30, 2019 at 9:08:17 AM UTC-6, sunnyboy010101 wrote:

1. This bug has taken some time to find and fix simply because, as with many bugs, it is triggered very infrequently in the running system. If the bug takes weeks to appear, the whole 'find-fix-test' cycle will be similarly extended.

Certainly true.

But if it takes a week to happen, then over 6 months it will happen ~26 times to each competent programmer observing it. The "many eyes" have seen this bug happen thousands of times now, collectively.

The really hard bugs are those that can't be replicated, but we haven't got that problem here.

2. I am of the opinion that many of our "modern" languages were developed simply because majority of people using C back in the 80's and 90's were utterly incapable of coding without shooting themselves in the foot.

All right, so we've had decades to shunt all of the incompetents off to other languages, leaving only competent programmers still using C. What excuse do we now have for continuing to shoot ourselves in the foot? ("We," because I'm also a C and C++ programmer, coming up on 3 decades now.)

If your response is that some people still shouldn't be using C, I think you're on a path that leads to a No True Scotsman argument: only competent C programmers should use the language, and those who perpetrate hard-to-find bugs are incompetent, so only this unachievable ideal programmer should use C.

You have only to contemplate the endless stream of security patches for a modern OS to shake belief in such an argument.

The college I worked at in 2000 switched from C to Java as  primary coding language because industry (who hired our grads) demanded we shift to Java because they could not hire quality entry-level C programmers and therefore had switched to Java.

Yes, and...?

Today, Java is one of the most popular programming platforms. Almost every app on an Android smartphone is written in a dialect of Java, and it's very popular in the enterprise world, which employs a huge number of working software developers. If you don't like Java the language, there are now a whole bunch of alternatives. The current hotness is Kotlin, most recently preceded by Clojure.

I'm not a particular fan of the Java platform myself, but your university department made the right bet.

Contrast the universities who at the same time continued to try pushing Lisp, Eiffel, Pascal, ML...

I'm a better programmer for having learned several languages I'll likely end up writing only a few thousand lines of code in, lifetime total. But if we're arguing about whether CS grads are better off learning C vs Java, I'm not playing. It's about as much worth arguing over as the result of the next coin toss.

Warren Young

unread,
May 31, 2019, 2:11:25 AM5/31/19
to [PiDP-11]
On Thursday, May 30, 2019 at 12:42:12 PM UTC-6, Bill E wrote:

Remove the threading. The sender task can do an HTML POST update to the cloud, I suggest an Amazon Cloud virtual instance.
Then, the display task can do a GET from the cloud server, looking for the local ID. Of course, both sides will have to do HTML encoding. But, cycles are free, right? No overhead.

I see your straw man and raise you an open-source benchmark:


I get about 30000 messages per second on my Pi 3B (not 3B+) running the TCP benchmark with a 404 byte message, which is the size of struct display in the current PiDP-8/I code. We only need to send updates to the display about 1/300 that fast to fool the eye into believing it's getting a continuous update.

If that's still too much overhead, the pipe(2) version of the benchmark pushes about a quarter million messages per second here.

The problem with your cloud strawman is that we don't have to deal with Internet latencies or network bandwidths for the PiDP-11. It's all IPC on the same hardware.

You might wonder why I've even bothered to consider the TCP case: it's because it lets you run the simulator and the front panel programs on different computers, which not only lets the simulator run faster by using a faster host than the Raspberry Pi, it offloads the simulator's load from the Pi, which means you've got less context switching overhead involved, adding jitter to the display.

This will guarantee portability and consistency.

POSIX's proved pretty portable over the decades. :)
 
Don't worry about latency, that's just some bourgeois obfuscation.

Kernel syscall overhead isn't zero, as this benchmark shows, but I think we can afford it in this case.
 
And, you can use 200 different 3rd-party libs, each with undocumented dependencies on a vast number of other 3PP libs, to implement this elegant solution.

Or you could call pipe(2). :)
Reply all
Reply to author
Forward
0 new messages