Hi Ben,
** Short Answer: you might want to think about porting PLN to it. You may want to rethink how PLN works, first. And we actually have another (very?) interesting option.
** Long Answer (with a quasi-happy ending):
What they call "graph" is not the same thing as what we call "graph". What they call "graph" is actually "memory-access, multiply-add, another memory access, repeat". (see page 9, 12 of pdf you attached) By contrast, what we call a graph is "memory-access, integer compare, another memory access, repeat". Which maybe sounds similar, but its really very different.
** The IPU in review:
So, consider neural-net gradient descent, implemented via back-propagation. Those algos march over memory, in a quasi-regular way (they are vectors, after all), doing lots of multiplies and adds, and maybe a few if-statements (aka "integer compares") and a small handful of other integer operations (jump, increment some loop counter, test to see if done.) Very very few subroutine calls. The PDF says "300MB In-Processor-Memory(TM)" so it sounds like if your NN weight matrix fits in 300MB, you can do gradient descent at... RAM speed. Whatever that is. RAM speed is always a bottleneck, and it depends on how they designed it. 300MB is ballpark of L2 cache and L2-cache tends to be SRAM not DRAM and run at 1/10th the speed of the CPU core. So much much faster than DRAM, but slower than CPU. Anyway, really freaking fast if you can make it fit. If it doesn't fit, then you have a very impressive switching fabric, which is really nice, cause sooner or later, you do have to move training data in and out. When things don't fit, the switch fabric gets the missing parts into place pretty damned fast. The chart on page 15 looks honest and 100% believable to me: dark blue is the multiply-add-loop, and yellow-lightbiue is the moving-in the next block of training data (moving around the weight vectors? Whatever.)
** The MMU and why the IPU doesn't have one:
It's called an IPU and not a CPU because there's no MMU. Which is good and bad. Its good for hardware because MMU's are big, complex, bulky bottlenecks between the CPU and slow, slow memory (because DRAM memory is slow. Fact of life. Deal with it). Its bad because MMU's make programming really really easy: programmers do not have to think about the physical location of their data. Any idiot with --visual basic-- python skills can write a program. Like teenagers. Smart pre-teens. The good news is that several decades of industry experience with GPU's means that the industry possesses really clever compilers and really clever libraries can hide most of the pain of not having an MMU. Its almost fully automated. But still requires pros who know what they're doing. TL;DR: the pros only have to port tensorflow to it once. The hardware savings on MMU complexity is huge! It eliminate lots of awkward bus and data-transfer silicon design with lots of ugly timing chains (google up Spectre, Meltdown, for a flavor of modern MMU design mad skillz)
The IPU is a super-duper neural net machine, no doubt. I'm quite sure the M87-Sag A* Event-Horizon astronomers will love it too. And lots of supercomputer, weather-simulation, nuclear-bomb buffs, too.
** The pattern matcher in review:
Compare this to the pattern-matcher, which is a pure-integer, with no floating-point in it. The pattern matcher is conceptually very simple (practical considerations make it complex). It compares two graphs, side-by-side. At every vertex, it needs to compare edges -- all possible permutations of edges ("choices" actually, not permutations; there are usually fewer choices, but still more than one). Since any one of those choices might be the right one, therefore one must save (push) the current state onto the stack, and then continue exploration of the next vertex. Repeat until match or no more choices, then pop the stack once, explore the next choice. So almost all CPU cycles are spent pushing, popping, and each push-pop is at least one subroutine call, if not 5 or 8. Each push-pop is a lot of memory access, to save the current state.
Can this run fast on IPU? Sure, maybe even really fast, if it fits into 300 MB. And doing 1216 of these in parallel is kick-butt. When happens if our graphs don't fit in 300MB? I have no clue, cause I don't know where the rest of the graph is located, I don't know how to find it, I don't know how to stuff it into 300MB. Maybe this is solvable. Maybe. If I can't hallucinate a good solution in 5 minutes, it means its hard. Requires some hard thought and clever invention.
Sadly, there's no floating-point in the pattern matcher, so all those whizzy floating-point units in the IPU are sitting idle. Which is a shame. Really a shame.
What about PLN? Well, today's PLN, built on the pattern matcher, will run thousands of CPU cycles and then do a small handful of float-point ops. The good news: almost all PLN pattern matches are really quite very small (that's good- it should be fast) and of course all the current PLN demos fit into 300MB. I currently have no clue how to apply PLN to large datasets.
Are there other ways to think of PLN that do less searching and more multiplying? Can you swap inner and outer loops somehow? Nil might like to ponder this.
** An alternative that I personally find exciting
The language-learning pipeline I built doesn't use the pattern matcher. At all. What it does do is ... large quantities of multiply-adds. Immense quantities of them. Lots of vector and matrix multiplies. Lots of summations (multiply-adds in a loop). So its more-or-less a totally boring vector-matrix library, with one huge difference: the matrices are extremely sparse -- one-in-a-billion non-zero entries. And that means it is impossible to store those zeroes in any conventional form. One MUST store only the non-zero values. In graphs! Which is why I wrote it for the atomspace, instead of SciPy or GnuR or PyWhatever. My clustering code, the code that tries to do word-sense disambiguation, is totally bottle-necked doing multiply-adds and then a small handful of pointer-chases to find the next numbers to multiply-add.
I'm pretty sure it could run pretty well on the IPU. Maybe even really well. I think it would be "easy" to slice it to make each slice fit in 300MB. And have it work without thrashing. I'm optimistic.
** To conclude:
I would recommend porting this it the IPU immediately, except for one minor gotcha: so far, I've personally failed to convince you personally that my algos really can do word-sense disambiguation, and that the resulting grammars really are correct, and I'm distracted by other concerns and making zero progress on such a proof. Meanwhile, Anton's team is still quite lost, dazed and confused about the issues. I think they've made some good forward progress, but so far, they've only worked on the easy stuff. The hard stuff is still ahead of them, and I don't think they understand this yet, and I don't think they are prepared to tackle it. They're at the foothills of a mountain. They struggled to get up the foothills, they haven't gotten past the treeline, and don't yet realize what's up there. So, without any actual proof that my claims are true, I'm 95% sure you're not willing to commit to this (i.e. porting ultra-super-sparse matrix-math to the IPU's.)
(To be clear: that would be my proposal: ultra-super-sparse matrix-math to the IPU's. I find it convenient to store those numbers as Values on Atoms, but that is a convenience, not a necessity. So I am NOT proposing a port of the atomspace to the IPU's. Just to be clear about that. (Although I would want a bulk-copy of the floats to-from the atomspace, because the atomspace is still the correct data-structure for long-term knowledge store. ))
So that is my happy ending. We can leverage IPU's but just not in the traditional opencog architecture.
-- Linas