Atomspace & Pattern Matcher on Graphcore IPU?

47 views

Skip to first unread message

Ben Goertzel

unread,

Apr 7, 2019, 10:36:22 AM4/7/19

to Linas Vepstas, mandeep bhatia, Vitaly Bogdanov, opencog

I'm reading about Graphcore and their IPU graph processors...

I wonder if it would be a good idea to make Atomspace and the Pattern Matcher run on their IPUs using their Poplar libraries? A short rundown on the IPU and Poplar are attached to this email...

Obviously there are a lot of other priorities. What I'm asking now is, for those who know more than me about hardware, do you think that we would get dramatic speedups by porting Atomspace/PM to this hardware [understanding it would be a lot of work to do so...]

To tell for sure would require more digging obviously, but I'm trying to form a clear opinion of whether that digging is even worthwhile...

thanks

Ben

NeurIPS 2018 Poplar Graph Toolchain.pdf

--
Ben Goertzel, PhD
http://goertzel.org

"Listen: This world is the lunatic's sphere, / Don't always agree it's real. / Even with my feet upon it / And the postman knowing my door / My address is somewhere else." -- Hafiz

Linas Vepstas

unread,

Apr 15, 2019, 11:53:06 PM4/15/19

to Ben Goertzel, mandeep bhatia, Vitaly Bogdanov, opencog

Hi Ben,

** Short Answer: you might want to think about porting PLN to it. You may want to rethink how PLN works, first. And we actually have another (very?) interesting option.

** Long Answer (with a quasi-happy ending):

What they call "graph" is not the same thing as what we call "graph". What they call "graph" is actually "memory-access, multiply-add, another memory access, repeat". (see page 9, 12 of pdf you attached) By contrast, what we call a graph is "memory-access, integer compare, another memory access, repeat". Which maybe sounds similar, but its really very different.

** The IPU in review:

So, consider neural-net gradient descent, implemented via back-propagation. Those algos march over memory, in a quasi-regular way (they are vectors, after all), doing lots of multiplies and adds, and maybe a few if-statements (aka "integer compares") and a small handful of other integer operations (jump, increment some loop counter, test to see if done.) Very very few subroutine calls. The PDF says "300MB In-Processor-Memory(TM)" so it sounds like if your NN weight matrix fits in 300MB, you can do gradient descent at... RAM speed. Whatever that is. RAM speed is always a bottleneck, and it depends on how they designed it. 300MB is ballpark of L2 cache and L2-cache tends to be SRAM not DRAM and run at 1/10th the speed of the CPU core. So much much faster than DRAM, but slower than CPU. Anyway, really freaking fast if you can make it fit. If it doesn't fit, then you have a very impressive switching fabric, which is really nice, cause sooner or later, you do have to move training data in and out. When things don't fit, the switch fabric gets the missing parts into place pretty damned fast. The chart on page 15 looks honest and 100% believable to me: dark blue is the multiply-add-loop, and yellow-lightbiue is the moving-in the next block of training data (moving around the weight vectors? Whatever.)

** The MMU and why the IPU doesn't have one:

It's called an IPU and not a CPU because there's no MMU. Which is good and bad. Its good for hardware because MMU's are big, complex, bulky bottlenecks between the CPU and slow, slow memory (because DRAM memory is slow. Fact of life. Deal with it). Its bad because MMU's make programming really really easy: programmers do not have to think about the physical location of their data. Any idiot with --visual basic-- python skills can write a program. Like teenagers. Smart pre-teens. The good news is that several decades of industry experience with GPU's means that the industry possesses really clever compilers and really clever libraries can hide most of the pain of not having an MMU. Its almost fully automated. But still requires pros who know what they're doing. TL;DR: the pros only have to port tensorflow to it once. The hardware savings on MMU complexity is huge! It eliminate lots of awkward bus and data-transfer silicon design with lots of ugly timing chains (google up Spectre, Meltdown, for a flavor of modern MMU design mad skillz)

The IPU is a super-duper neural net machine, no doubt. I'm quite sure the M87-Sag A* Event-Horizon astronomers will love it too. And lots of supercomputer, weather-simulation, nuclear-bomb buffs, too.

** The pattern matcher in review:

Compare this to the pattern-matcher, which is a pure-integer, with no floating-point in it. The pattern matcher is conceptually very simple (practical considerations make it complex). It compares two graphs, side-by-side. At every vertex, it needs to compare edges -- all possible permutations of edges ("choices" actually, not permutations; there are usually fewer choices, but still more than one). Since any one of those choices might be the right one, therefore one must save (push) the current state onto the stack, and then continue exploration of the next vertex. Repeat until match or no more choices, then pop the stack once, explore the next choice. So almost all CPU cycles are spent pushing, popping, and each push-pop is at least one subroutine call, if not 5 or 8. Each push-pop is a lot of memory access, to save the current state.

Can this run fast on IPU? Sure, maybe even really fast, if it fits into 300 MB. And doing 1216 of these in parallel is kick-butt. When happens if our graphs don't fit in 300MB? I have no clue, cause I don't know where the rest of the graph is located, I don't know how to find it, I don't know how to stuff it into 300MB. Maybe this is solvable. Maybe. If I can't hallucinate a good solution in 5 minutes, it means its hard. Requires some hard thought and clever invention.

Sadly, there's no floating-point in the pattern matcher, so all those whizzy floating-point units in the IPU are sitting idle. Which is a shame. Really a shame.

What about PLN? Well, today's PLN, built on the pattern matcher, will run thousands of CPU cycles and then do a small handful of float-point ops. The good news: almost all PLN pattern matches are really quite very small (that's good- it should be fast) and of course all the current PLN demos fit into 300MB. I currently have no clue how to apply PLN to large datasets.

Are there other ways to think of PLN that do less searching and more multiplying? Can you swap inner and outer loops somehow? Nil might like to ponder this.

** An alternative that I personally find exciting

The language-learning pipeline I built doesn't use the pattern matcher. At all. What it does do is ... large quantities of multiply-adds. Immense quantities of them. Lots of vector and matrix multiplies. Lots of summations (multiply-adds in a loop). So its more-or-less a totally boring vector-matrix library, with one huge difference: the matrices are extremely sparse -- one-in-a-billion non-zero entries. And that means it is impossible to store those zeroes in any conventional form. One MUST store only the non-zero values. In graphs! Which is why I wrote it for the atomspace, instead of SciPy or GnuR or PyWhatever. My clustering code, the code that tries to do word-sense disambiguation, is totally bottle-necked doing multiply-adds and then a small handful of pointer-chases to find the next numbers to multiply-add.

I'm pretty sure it could run pretty well on the IPU. Maybe even really well. I think it would be "easy" to slice it to make each slice fit in 300MB. And have it work without thrashing. I'm optimistic.

** To conclude:

I would recommend porting this it the IPU immediately, except for one minor gotcha: so far, I've personally failed to convince you personally that my algos really can do word-sense disambiguation, and that the resulting grammars really are correct, and I'm distracted by other concerns and making zero progress on such a proof. Meanwhile, Anton's team is still quite lost, dazed and confused about the issues. I think they've made some good forward progress, but so far, they've only worked on the easy stuff. The hard stuff is still ahead of them, and I don't think they understand this yet, and I don't think they are prepared to tackle it. They're at the foothills of a mountain. They struggled to get up the foothills, they haven't gotten past the treeline, and don't yet realize what's up there. So, without any actual proof that my claims are true, I'm 95% sure you're not willing to commit to this (i.e. porting ultra-super-sparse matrix-math to the IPU's.)

(To be clear: that would be my proposal: ultra-super-sparse matrix-math to the IPU's. I find it convenient to store those numbers as Values on Atoms, but that is a convenience, not a necessity. So I am NOT proposing a port of the atomspace to the IPU's. Just to be clear about that. (Although I would want a bulk-copy of the floats to-from the atomspace, because the atomspace is still the correct data-structure for long-term knowledge store. ))

So that is my happy ending. We can leverage IPU's but just not in the traditional opencog architecture.

-- Linas

cassette tapes - analog TV - film cameras - you

Nil Geisweiller

unread,

Apr 16, 2019, 1:02:40 AM4/16/19

to ope...@googlegroups.com, Linas Vepstas, Ben Goertzel, mandeep bhatia, Vitaly Bogdanov

On 4/16/19 6:52 AM, Linas Vepstas wrote:
> What about PLN? Well, today's PLN, built on the pattern matcher, will
> run thousands of CPU cycles and then do a small handful of float-point
> ops. The good news: almost all PLN pattern matches are really quite very
> small (that's good- it should be fast) and of course all the current PLN
> demos fit into 300MB. I currently have no clue how to apply PLN to
> large datasets.

I should correct that, most forward chainer queries are small because
the forward chainer is in fact an iterative chainer, processes each rule
iteratively.

However most backward chainer queries are big because the entire
inference tree is queried (well, only the leaves but that's still a
lot). However each query is very similar, for that reason, it is likely
some general enough form of cashing/memoizing could speed that up.

Another form of memoization that could speed up the URE is unification,
fortunately I know how to do it because the unification code is mostly
functional. For the pattern matcher I don't have much clue, maybe Andre
Senna's Pattern Index can help along the way.

> Are there other ways to think of PLN that do less searching and more
> multiplying? Can you swap inner and outer loops somehow? Nil might like
> to ponder this.

Accurately calculating the second order probabilities can take a lot of
multiplying, and is very amenable to paralellization. As for swapping
the inner and outer loops, maybe using macro rules is a way to do that,
it's likely that in practice there's gonna be a number of macro rules
that are gonna be used a ton of times.

Nil

Linas Vepstas

unread,

Apr 16, 2019, 5:00:44 PM4/16/19

to Nil Geisweiller, opencog, Ben Goertzel, mandeep bhatia, Vitaly Bogdanov

On Tue, Apr 16, 2019 at 12:02 AM Nil Geisweiller <ngei...@googlemail.com> wrote:

On 4/16/19 6:52 AM, Linas Vepstas wrote:
> What about PLN? Well, today's PLN, built on the pattern matcher, will
> run thousands of CPU cycles and then do a small handful of float-point
> ops.

I'm making several "meta" claims, which perhaps I should be more specific about. First claim:

* The reason that deep-learning has been so effective and successful is that they found a way of avoiding 'useless calculations'. This is by a combination of two tricks: backpropagation and dimensional reduction.

* Backpropagation means that an order of magnitude (factors of N or NlogN or N^2) of useless float-point computations are eliminated and only the useful, non-redundant calculations are are kept.

* Dimensional reduction is the weak spot, the Achilles heel of NN's. It reduces the problem space to a size where results can be obtained relatively quickly; however, the reduced problem space is too small for human-level reasoning and language.

* Using highly-sparse matrix math instead of dimensional reduction preserves everything important, while eliminating the weaknesses of dimensional reduction. The calculation-space remains small, but the representation space remains large/huge. NN's have small computation-spaces (that's good, it makes them fast) but also small representation spaces (really bad, it destroys structure).

* I'm concerned that the current PLN architecture, of performing complex graph searches (using integer pattern matching code) followed by infrequent numeric (float-point) work is very inefficient. That is, when comparing to NN, the search-and-traversal is a waste of time and effort, and only the float-point computations matter (are meaningful). So I'm wondering if the PLN algo can be reformulated to be more backpropagation-like, which means the math becomes a kind of "inner loop", while the graph-traversal parts of it become the "outer loop". This last sentence is extremely imprecise: it is meant to be inspirational, not practical.

To rephrase: PLN (and symbolic-AI approaches in general) preserve the large/huge representation space. That's good. Algorithmically, my gut intuition is that traditional symbolic-AI algos (such as PLN) are extremely CPU-inefficient. (spend too much time graph-searching, graph-traversing, and not enough time actually computing things (i.e. multiplying and adding floats))

The attempts with the ultra-super-sparse matrix code is to retain the giant representational spaces of symbolic AI, while minimizing the graph-search efforts. This is done by working with a single, small, simple fixed graph. i.e. by making multiply-add the inner loop, where the graph is held fixed, and make the outer loop be the exploration of different graph shapes, of how bigger graphs are assembled from smaller parts. I think I have a fairly clear conception of how this works for language learning; I do not yet have an equivalent conception for reasoning. However, its important to obtain this, and for more reasons than one: I think its a better theoretical foundation, but also, it's exactly what IPU-style machines are very efficient at computing.

The "sheaves" paper is trying to explain exactly how to exchange the inner and outer loops. viz graphs are the slowly-changing things, the combinations of which sit in the outer loop (classical symbolic AI), and the weights/probabilities are rapidly updated in the inner loop (which works because everything looks regular, uniform at the local level, viz. just a vertex, and its nearest neighbors, all look "the same", simply because they are small, simple, tiny. Work with these small, tiny components, *before* these are all joined up to form some mega-graph. This is the same trick that NN deep learning is using, except that existing NN deep learning algos also collapse/blur/average away the large scale structure (which is fundamentally wrong). NN's do this because they do not know how to identify and manage that large-scale structure. They're blind to it. The weight-vector projections in mash together, average together any/all relationships that are more than two nearest-neighbor distances apart.