>> The bio-Atomspace we are experimenting with now contains only a small
>> % of the biomedical knowledge we would like it to, which is because of
>> RAM and processing speed limitations in current OpenCog
>>
>> Recent optimizations help but don't remotely come close to solving the problem
>
>
> OK. Well, that's news to me. I try to keep everyone happy, and when there aren't any comments or complaints, I assume everyone is happy. Do you have actual examples, where you are running out of RAM, and where things are going too slow? Or is this just a gut-feel issue, for which you have no actual data?
Yes, Hedra (who is working with PLN on bio-Atomspace) is hitting these
issues all the time, and because of this she limits the amount of data
imported into Atomspace and the scope of queries run against Atomspace
(e.g. filtering a query to focus on just a few genes rather than all
the genes of interest, etc.)
Nil is aware of this work and understands it in greater detail than
idea (from an OpenCog usage view not a biology view) and if you're
curious to dig in, asking him is probably the best idea...
> I cannot repeat this often enough or strongly enough: the kinds of optimizations that are performed on software systems are extremely data-dependent and algorithm dependent. It is effectively impossible to perform optimizations without having a specific use case. This is a kind-of theorem of computer science.
This is way overstated IMO. For instance optimizations made to allow
fast matrix operations on GPUs for computer graphics, turned out to be
useful for all sorts of NN and other AI algorithms. Of course things
can be further optimized for specific NNs or other AI algos, but
nonetheless the more generic optimizations made w/ computer graphics
in mind were pretty helpful for optimizing AI ...
Theorems like "no free lunch" etc. operate at a level of abstraction
and extremity that doesn't really help in practical cases, I feel...
>> The neural-symbolic grammar learning that Andres Suarez and I
>> prototyped last spring, also couldn't viably be done using OpenCog for
>> similar reasons (RAM and processing speed limitations).
>
>
> No one ever complained about RAM or processing speeds, so it's kind of unfair to just bring this up a year later. I had the impression that the theory you were developing wasn't working out; I wasn't surprised, but I never fully understood it.
The theory IMO is highly promising, and we paused that work because of
other priorities not any problems w/ the ideas nor any lack of quality
in the prototype results. However to pursue that work using
Atomspace in a straightforward way would require importing way more
data into a single Atomspace than can be done in RAM on a single
current-day machine.
> This spring, I restarted work on
https://github.com/opencog/learn -- you can review the README for the current status. I get good results. Its a big project. Things go slowly. Not enough time in the day.
Cool, I will take a look...
>> The experimentation on pattern mining from inference histories for
>> automated inference control, that Nil was doing a year ago, was
>> incredibly slow also due to Atomspace limitations.
>
>
> Ben, that is also incredibly unfair. Never-ever did you or Nil or anyone else ever complain about "atomspace limitations". So you can't just start blaming it now. If there is an actual performance problem, open a github issue, and describe it. Provide instrumentation, bottlenecks.
Some problems are too obvious and too severe for it to make sense to
take this sort of approach. Problems that clearly can't be fixed by
incremental improvements.
> I watched those projects from afar, and ... well, all I can say is "that's not how I would have done it". The fact that you had performance problems is almost surely a statement about your algorithms, and not a statement about the atomspace. The atomspace is what it is, and if you use it incorrectly, you'll get disappointing results. It's not a magic wand. It's just software, like any other kind of software.
This could have been said about all neural net AI algorithms in the
period before we had modern GPUs and their associated software tools.
But in fact many algorithms that were run in the 1980s and 1990s with
poor results, were run a couple decades later on more modern hardware
and associated software frameworks, with really exciting results.
Without any changes to the core algorithms -- though often some
parameter tweaks and straightforward network architecture improvements
(of the sort that are straightforward once you're able to iterate
quickly running experiments at the appropriate scale).
So the history of AI contains a lot of cases that contradict the sort
of assertion you're making. Often algorithms that worked poorly at
one scale, ended up working great at a more appropriate scale (where
"scale" means amount of data and also amount of processor and RAM,
appropriately deployed...)
>> It is possibly true that for each such case, one could design a
>> specialized architecture to support just that case, working around the
>> need for a general-purpose DAS in that particular case....
>
>
> You are describing things that sound like (to me) inadequate or inappropriate algorithms, and then switching the topic to DAS. You don't have to use the atomspace -- you could have done the inference mining on any one of a half-dozen map-reduce platforms out there -- many of them from the Apache.org people -- and you would not have gotten performance that is any better than what the atomspace provides.
That is not really true... the mining part could be done way faster
than is possible using current OpenCog tools. However these other
tools don't have the flexibility to do the inference part in any
non-convoluted way. And if we're going to set things up w/ closely
coupled back-and-forth between pattern mining of inference patterns,
and inference itself, it's nice if the two aspects are not implemented
in totally separate systems with a slow communication channel btw
them...
>Nothing that I saw Shujing doing with pattern mining was any different than what anyone else in the industry does when they data-mine.
Standard datamining algorithms do not look for surprising patterns in
hypergraphs.
> Given that I don't understand the "applications such as those above", I don't know how to respond. You would have to describe those applications in engineering terms, in order to understand how they could be implemented so as to run efficiently and scalably ... without an actual description of what it is, it's not a solvable problem. There's just an insufficient amount of detail.
Yeah of course my email did not contain full detail about these
applications, that would be infeasible to give in such a short space
and time.
yes, agreed. the above is understood. We were doing something
similar in HK some years ago with Mandeep's gearman based
implementation, but clearly your current system is better in various
ways.
The hard part is a ladder of requirements:
> * How do you get the shards of data onto those machines? Do you use rsync to copy files, or do you want to send them via atomspace? If you use rsync, then where will you keep the script for it?
> * Where do you keep the list of the currently-active set of 10 machines? Do you need a GUI for that? A phone app?
> * What do you do if one or more of them hasn't booted, or has crashed?
> * Are they password-protected? The atomspace is not password-protected!
>
> There are atomspace issues:
> * The simplest solution is to wait until all ten have returned results, and then join them together.
> * Another possibility is to let the results dribble in, and join them as they arrive. This is more complex, and requires more sophistication. The 10-line demo program now becomes a 100 or 200 line program.
> * What if one of the machines has crashed during processing? e.g. bad network card, failed disk, power outage?
> * Perhaps you want to load-balance, so that the slowest machine is not always the bottleneck. This requires measuring each machine to see if it is idle or not, and giving it more work if it is idle. This is non-trivial. Most engineers would do this outside of the atomspace, but you could also do it inside the atomspace if you write custom Atoms for it. Does your design require custom Atoms for load-balancing?
> * Perhaps the dataset is badly sharded, so that one of the machines is always a bottleneck. This requires not only finding the busiest machine, but then re-sharding the data. Many databases do this automatically. The conventional way in which this is done is to find a sequence of "least cruel cuts" in the Tononi sense, and move those to other machines. Find the cuts that hurt Phi the least. Talking about phi is fancy-pants buzzword-slinging, but all the people who do data-mining have a very intuitive understanding of Tononi's Phi, and have had that understanding many, many decades ago, because it's key to both software and hardware optimization. This is easy to say, but finding those cuts is hard to do. Nothing in opencog today does this automatically. However, I can imagine several possible solutions, ranging from real easy ones to really complex ones, each having pros and cons. Vendors like Oracle have had solutions for this, for decades. They've invested hundreds of man-years into it.
> * There's more. I wanted to mention concepts like "explain vacuum analyze" and "query planning" but perhaps some other day. Everyone gets to solve the query planning problem, including Hyperon. There's no free lunch.
>
> Then there are the data-design issues and meta-issues
> * Perhaps you are storing data as atoms, that should have been Values. Values are a lot faster than Atoms, but they get this performance with a set of function trade-offs.
> * Perhaps your data should not be kept in the atomspace at all. This includes audio, video live-streams, text files, medical records, and a zillion other data types.
Yes, this are all among issues that need to be solved in order to make
an effective DAS (a goal which you don't have a great affinity for, I
understand). The fact that this long list of issues (which is not
complete ofc) still remains to be addressed, means to me that your
RocksDB based prototype is not actually very close to what we would
need for a DAS.
But could one usefully build a DAS on top of your current RocksDB
code? Surely one could but it's not yet clear to me that's the
optimal approach... maybe it is...
>> Hmm, at a high level we did guess a pattern cache was going to be
>> useful -- and Senna implemented one some time ago.
>
>
> The concept of a cache is about as generic as the concept of a loop or an if statement. You are effectively saying that Senna thought of using if-statements and loops, and implemented a program that used them. That's just crazy-making!
What I mean is that Senna implemented in 2017 a Pattern Index that
allows one to create special indices to accelerate lookup of
particular sorts of patterns in Atomspace,
https://github.com/andre-senna/opencog/blob/feature_pattern_index/opencog/learning/pattern-index/README.md
It's not the same exact idea as your pattern cache, though.
-- Ben