--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37bz5CN%2BvDGXsy%3DFCcNjp_TY4ivNuDWN51LgdPCdawnSg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPE4pjDc-H%3DSKmEM%3DAb%2BfjJPZ-rSgv39e9Lwhr8zPyJoH7EnPg%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34VPBdCw-hoXH-Y8RWa-eQRAjvUGQ%2BbRNSdU8wu95dbgA%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBcKdcwq%3DBpZ9dS3p4B9-stHF3BOAj5LqngcsLL1%3DQVmMg%40mail.gmail.com.
Hmm... you are right that OpenCog hypergraphs have natural chunks
defined by recursive incoming sets. However, I think these chunks
are going to be too small, in most real-life Atomspaces, to serve the
purpose of chunking for a distributed Atomspace
I.e. it is true that in most cases the recursive incoming set of an
Atom should all be in the same chunk. But I think we will probably
need to deal with chunks that are larger than the recursive incoming
set of a single Atom, in very many cases.
What happens when the results for that (new) BindLink query are spread
among multiple peers on the network in some complex way?
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35zN4aaSrZ2Dpu4qLUL1bYfjAF_rGiS_xxg2-E-SBqY3Q%40mail.gmail.com.
>I think it's a mistake to try to think of a distributed atomspace as one super-giant, universe-filling uniform, undifferentiated blob of storage.> You don't want broadcast messages going out to the whole universe.Not sure if you intended to imply it, but the reality of the first statmentt need not require the 2nd statement. Hashes of atoms/chunks can be mapped via modulo onto hashes of peer IDs so that messages need only go to one or few peers.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPE4pjCyzOcoRAOPj7aGsj_73dAUnWovbjeaM4qjeM43hzXA6A%40mail.gmail.com.
On Jul 29, 2020, at 10:39 AM, Ben Goertzel <b...@goertzel.org> wrote:On Wed, Jul 29, 2020 at 6:35 AM Abdulrahman Semrie <hsam...@gmail.com> wrote:I think it's a mistake to try to think of a distributed atomspace as one super-giant, universe-filling uniform, undifferentiated blob of storage.
It is not clear to me why this is a mistake.
It's a mistake because making a call from machine A to machine B is
just sooooooo much slower than making a call from machine A to machine
A ...
So if you try to ignore the underlying distributed nature of a
knowledge store, and treat it as if it was a single knowledge blob
living in one location, you will wind up making a system that is very,
very, very slow...
My Webmind colleagues and I were naive enough to try this in the late
1990s using Java 1.1 ;-)
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBc5wPvNj-k-%3Dwhx-yntHkZrh%3DEfWi%2BcqjJUMksJ-5LKhA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA36esvtcgGrZ%3D4rCVMDde74TYKF1%3DS-AwLG95UYrT5Mdrg%40mail.gmail.com.
have three different statuses: Local, Remote (in RAM on some other
machine in Distributed Atomspace) or BackedUp (disk).
But since these atomspaces need to store 100GB+ data, they need to be distributed over multiple nodes, each node holding portions of each atomspace.
> I think it's a mistake to try to think of a distributed atomspace as one super-giant, universe-filling uniform, undifferentiated blob of storage.
It is not clear to me why this is a mistake.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBdhS0wXDfHMVFJ7R7vwoXn01uGPvT%3D-UT_yo6T5rtN0Gw%40mail.gmail.com.
Anyways, I think implementing something similar to Nebula db as an initial version has immediate benefits for projects that use the atomspace to store and process out-of-RAM data such the genomic data.
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA350d%2B-6FQHzXsH3c6YVqyEqdyFUXHuHk-OjxcUG_Gv32g%40mail.gmail.com.
Is there a public document somewhere describing actual, present use-cases for distributed atomspace? Ideally with some useful guesses at performance requirements, in terms of updates per second to be processed on a single node and across the cluster, and reasonably estimated hardware specs (num cores, ram, disk) per peer?
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA373AXOXpv_YNASvZa6W1qwCpKPFm80Zc-Dr0gkC1pCE7w%40mail.gmail.com.
If you think this is what I'm saying by describing Cassandra's
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35WRUonm82pMLDXqgqS7oV339o7KjTDQg4o_gWQJnE7Bw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPE4pjCNXnMgba6e6xET4x6uMXpytNykB_h7wkmv27Zpbs%3DrSQ%40mail.gmail.com.
Realizing these limitations one sees that for OpenCog purposes one
likely better off w/ more basic key-value or column stores than with
graphDBs ...
In this case the "chunk" of information that we want to grab from the
backing-store is "the sub-metagraph giving the most intensely relevant
information about gene H" ....
-- send a Pattern Matcher query to BackingStore
-- sent the Atom-chunk resulting from the query to Atomspace
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA37Agw0cg5gJX1fDffvSAjcW1kq4LdMOSuyknaEC_41F1g%40mail.gmail.com.
We could use memory-mapped files on SSD.
Wow!
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBcnROxkUgppev8cW2LuAbzqvjWXxrrWZvCgvQv3g9Q3eg%40mail.gmail.com.
Each subsystem is function F from [past] GenericAtomSpace -> [future] GenericAtomSpace. Internally there will be functions which are ~ inverse of F. Subsystem categories are mapped to GenericAtomSpace in their own way, in separate processes.
periodically clone the whole system state (for example, every day) to external storage for backup.
CREATE KEYSPACE MyKeySpace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; USE MyKeySpace; CREATE COLUMNFAMILY MyColumns (id text, Last text, First text, PRIMARY KEY(id)); INSERT INTO MyColumns (id, Last, First) VALUES ('1', 'Doe', 'John'); SELECT * FROM MyColumns;
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBfmiyKhHusd2ThoD6dAYBDdyL73CB%3DJe6w0-aX7WbX_Uw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA3554pK1ktwPmU2rzNAvNUC7U%3DAYV6StqEEkjMPofERkiw%40mail.gmail.com.
> I'll bet you a bottle of good wine or other neuroactive substance that the existing atomspace client-server infrastructure is faster than Cassandra.No, it won't be faster,
but you'll never be able to store an atomspace bigger than what you can fit in memory
on that single atomserver, and you'll never be able to perform more operations (on the canonical atomspace) in parallel than what that one atom server can support.
Obviously distributed systems have a performance penalty.
We don't build them because we need to go faster (at the level of a single process), we build them because we need to go bigger (in terms of storage space or parallel processes).
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPE4pjAhha5RGHTqKxzvpwf8_%3D7TMue2FcEASP0tECMbCjkohQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA34H0Xmq08N6nYe2xwcz8QkPigCLm1SnBNy1%3D80eed-fhQ%40mail.gmail.com.
I've been hearing people talk about the need for distributed atomspace on and off for 8+ years,
and I've never seen an answer along the lines of "you can already have a cluster, here's the documentation on how to set it up."
Does it meet the 7 business requirements in Ben's document: https://docs.google.com/document/d/1n0xM5d3C_Va4ti9A6sgqK_RV6zXi_xFqZ2ppQ5koqco/edit ?
On Aug 4, 2020, at 23:51, Linas Vepstas <linasv...@gmail.com> wrote:We could use memory-mapped files on SSD.Ohhh! I like that! This is actually a very interesting idea! And with the appropriate programmer-fu, this should not be hard to proof-of-concept, I think ... so I'm guessing, sometime before the AtomSpace starts up, replace the memory allocator by something that is allocating out of the mapped memory (I think I've seen libraries out there that simplify this).
To get scientific about it, you'd want to create a heat-map -- load up some large datasets, say, some of the genomics datasets, run one of their standard work-loads as a bench-mark, and then see which pages are hit the most often. I mean -- what is the actual working-set size of the genomics processing? No one knows -- we know that during graph traversal, memory is hit "randomly" .. but what is the distribution? It's surely not uniform. Maybe 90% of the work is done on 10% of the pages? (Maybe it's Zipfian? I'd love to see those charts...)
Hi Linas,On Aug 4, 2020, at 23:51, Linas Vepstas <linasv...@gmail.com> wrote:We could use memory-mapped files on SSD.Ohhh! I like that! This is actually a very interesting idea! And with the appropriate programmer-fu, this should not be hard to proof-of-concept, I think ... so I'm guessing, sometime before the AtomSpace starts up, replace the memory allocator by something that is allocating out of the mapped memory (I think I've seen libraries out there that simplify this).I'm glad that you like the idea. I want to try to make the poc!We may even use sparse memory-mapped files representing much much bigger virtual address space than physical RAM as well as SSD storage. All nodes have the same file as previously described. Memory management functionality will deallocate unused blocks to maintain the file sparse enough, keeping actual storage usage limited. I hope this could simplify handling of Atom's identity.
To get scientific about it, you'd want to create a heat-map -- load up some large datasets, say, some of the genomics datasets, run one of their standard work-loads as a bench-mark, and then see which pages are hit the most often. I mean -- what is the actual working-set size of the genomics processing? No one knows -- we know that during graph traversal, memory is hit "randomly" .. but what is the distribution? It's surely not uniform. Maybe 90% of the work is done on 10% of the pages? (Maybe it's Zipfian? I'd love to see those charts...)I would like to see the heat-map for realistic dataset too. That's a next step.Are you referring to this genomics dataset benchmark: https://github.com/opencog/benchmark/tree/master/query-loop or there is some bigger and better benchmark and dataset for this kind of experiments. What about https://github.com/opencog/agi-bio ?
Hi Linas,
Thank you for your interest and I’m sorry for not presenting more complete descriptions. I’m trying to figure things out.
Besides speculation about Hyperon I want to support current development concretely by doing realistic large atomspace experiments with memory-mapped files and distribution, getting acquainted with Opencog AGI platform, capabilities, challenges, requirements along the way, experience which should be valuable in whatever direction Hyperon development goes on.
On Aug 5, 2020, at 01:51, Linas Vepstas <linasv...@gmail.com> wrote:Each subsystem is function F from [past] GenericAtomSpace -> [future] GenericAtomSpace. Internally there will be functions which are ~ inverse of F. Subsystem categories are mapped to GenericAtomSpace in their own way, in separate processes.I don't understand what you are trying to say, above.
GenericAtomSpace is renamed to UniversalAtomSpace (UAS).
Function Fstep: UAS -> UAS is in UAS and it’s total (in Idris REPL, :total Fstep prints Opencog.Fstep is Total). It maps past state of UAS to the future state of UAS. Core runtime engine is applying Fstep in the infinite loop. Since new Atomese and UAS are probabilistic linear dependently-typed, resource usage is fixed and terms/atoms represent superposition of possibilities. That’s why I assumed that core engine will benefit by having working set of fixed-size and fixed-size data structures for atomic operations. I know that atoms in AtomSpace right now are higher level knowledge representations. What if in our design we mirror abstraction hierarchy of the universe as we understand it. First there is something like atoms in the sense of quantum physics, that are forming molecules, more complex patterns where cogistric phenomena could take place, next there are alife structures manifesting Rosennean complexity, all the way up to cognitive level where system has sophisticated self-model(s), existing in perfect harmony or perfect disharmony?! We manually encode prior implementation of Fstep and maybe derive ~ inverse of Fstep in initial state of UAS. Existing algorithms and theory from current Opencog codebase like PLN, MOSES, URE…, Opencog Background Publications BgPub and parts of GTGI body of work are translated to new Atomese EDSL as prior knowledge to bootstrap the system using intelligent not-yet-AGI tools. Ultimate preferred mode of interaction with the system is at cognitive level with appropriate interfaces like screen, audio for voice, TCP, HTTP… for internet, web and let’s not forget published SignularityNET services with other cognitive agents elsewhere). I guess that development of narrow-AGI applications will happen on the sub-cognitive level. If we look at “High Level Mind Diagram” from EGI1 page 146, we will be able to excerpt control on this level, to intercept, inspect and modulate what’s going on between these components. We will inspect mechanistic low level structures of UAS only for the system diagnostics and optimization of distribution strategies but not for efficient communication with the cognitive system.
For rapid engineering of OCP Hyperon and for mutual understanding between OCP Hyperon and humans, all science knowledge, mathematics and especially, as Ben’s paper “Toward Formal Model of Cognitive Synergy” BenCS suggests, category theory (CT) should be encoded in the Fstep. I’m complete newbie in CT. In this regard BaezRS is pretty cool illumination of analogues from physics, topology, logic and computation demonstrating the unifying power of CT. Some projects already try to encode CT concepts, apart from many examples from the Haskell’s ecosystem, there are LeanCT and CoqCT. Robert Rossen’s (M,R)-systems are also described in CT terms RossMR.
I was also looking at flat concatenative languages as low-level language (LLL) for the stochastic runtime. In that case new Atomese may be decompilation of LLL. I can see how pattern-mining may look like on LLL. It could be very simple built-in functionality to automatically refactor executed sub-sequences of LLL words. People say that concatenative languages are too low level for normal software development. They complain about having to do stack manipulation by hand. What if LLL permeates all layers of the cognitive stack including external use of natural language. In that case, some LLL words could be natural language words auto refactored/pattern-mined to satisfy limited bandwidth between intelligent organisms. If you wonder how non-deterministic concatenative language execution a la Prolog could look like, there is a post JoyND from the author of language Joy. For explanation of equivalence between linear language and stack machine (LLL) see “Linear Logic and Permutation Stacks” BakerLL. LLL words could act as operators. For people who code in Haskell “Compiling to categories” may be interesting ConalC2C. And also GHC is getting linear types LinHS. For relation between linear logic and linear algebra (~ GPU/TPU acceleration) see LinLA.
periodically clone the whole system state (for example, every day) to external storage for backup.Keep in mind that every now and then, the RAM layout of the AtomSpace changes. This may be due to bug fixes, new features, or simply that a new compiler decided to pack things differently (e.g. c++ std library changes, e.g. for strings, vectors, mutexes, etc).
Best regards,
Pedja
Keep in mind that every now and then, the RAM layout of the AtomSpace changes. This may be due to bug fixes, new features, or simply that a new compiler decided to pack things differently (e.g. c++ std library changes, e.g. for strings, vectors, mutexes, etc).Of course. If compiler changes data layout, export/import (backup/restore) mentioned in my previous email will have to be performed. I hope these compiler changes are not happening frequently.
On Aug 7, 2020, at 20:30, Linas Vepstas <linasv...@gmail.com> wrote:We may even use sparse memory-mapped files representing much much bigger virtual address space than physical RAM as well as SSD storage. All nodes have the same file as previously described. Memory management functionality will deallocate unused blocks to maintain the file sparse enough, keeping actual storage usage limited. I hope this could simplify handling of Atom's identity.Have you ever done anything like this before? Because some of what you wrote does not seem right; the kernel does page-faulting and page flush as needed. There's no "deallocation", there is only unmapping. Mem usage may be fragmented, but it's never going to be sparse, unless the mem allocator is broken.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPE4pjB6q3YF7%2Bc-X6S9reZVsuur4%3D%2BM64t1qe5CPz3HSQ7pqg%40mail.gmail.com.
>> Does it meet the 7 business requirements in Ben's document: https://docs.google.com/document/d/1n0xM5d3C_Va4ti9A6sgqK_RV6zXi_xFqZ2ppQ5koqco/edit ?
--
You received this message because you are subscribed to the Google Groups "opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email to opencog+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA35PqjOXv8uu8QnnAYDx%3D3KDSgoX6w-cRtcnCFLz%3DZKYPw%40mail.gmail.com.
By deallocation of unused blocks I'm referring to the hole punching. Here is the demo: https://gist.github.com/crackleware/e01519f4ec16fcba42f5c2bb6151185f I'll make better demo with much bigger memory usage to trigger and test the paging, so we know our assumptions are correct.
--linas
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBdxT_pYSeoMSMD4znFdXjvUOsXOmTr22gQj1xP195SZ9Q%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CACYTDBdYpCUcqMcEAUDtn_P4UbrCq1PrC7keJJoArFU5B%3Dq1Cw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAHrUA362wWYPdp1L5g%3DORV77XYc5aDLh4SEGtn-zPJ-JWWge4g%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/opencog/CAPE4pjC7D%2BQLwZbAKbZe2Mi3dftw9_o%2BMATbgnvoqN7hZ_wp8A%40mail.gmail.com.
Hi Linas,
On Fri, Aug 07, 2020 at 01:30:14PM -0500, Linas Vepstas wrote:
> >
> > To get scientific about it, you'd want to create a heat-map -- load up
> > some large datasets, say, some of the genomics datasets, run one of their
> > standard work-loads as a bench-mark, and then see which pages are hit the
> > most often. I mean -- what is the actual working-set size of the genomics
> > processing? No one knows -- we know that during graph traversal, memory is
> > hit "randomly" .. but what is the distribution? It's surely not uniform.
> > Maybe 90% of the work is done on 10% of the pages? (Maybe it's Zipfian?
> > I'd love to see those charts...)
>
> The "query-loop" is a subset/sample from one of the agi-bio datasets. It's
> a good one to experiment with, since it will never change, so you can
> compare before-and-after results. The agi-bio datasets change all the
> time, as they add and remove new features, new data sources, etc. They're
> bigger, but not stable.
VM page heat-map for query-loop benchmark is here:
https://github.com/crackleware/opencog-experiments/tree/c0cc508dc5757635ce6c069b20f8ae13ccf8ef8a/mmapped-atomspace
Everything is getting dirty during loading. There is a "hot" subset of pages
being referenced during processing stage. Total size of referenced pages in
processing stage is around ~150MB of 1.6GB (total allocation). Heat-map is very
crude because it groups pages in linear order which is probably bad
grouping. I may experiment with page grouping to get more informative graphs
(could be useful chunking research).
I also did several experimental runs where I used swap-space on NFS and NBD
(network block device). 2 cores, 1GB RAM, 2GB swap. Performance was not very
good (~10%). CPU is too fast for this amount of memory. :-)
Intermittent peaks are probably garbage collections.
All in all, I expect much better performance with very concurrent workloads,
hundreds of threads. When a processing thread hits a page which is not yet in
physical RAM it blocks. Request for that page from storage is queued. Other
threads continue to work and after some time they will block too waiting for
some of their pages to load. Storage layer will collect multiple requests and
deliver data in batches, introducing latency. That's why when they benchmark
SSDs there are graphs for various queue depths. Deeper queue, better throughput.
Query-loop benchmark is single-threaded. I would like to run more concurrent
workload with bigger datasets. Any suggestions?
All in all, I expect much better performance with very concurrent workloads,
hundreds of threads. When a processing thread hits a page which is not yet in
physical RAM it blocks. Request for that page from storage is queued. Other
threads continue to work and after some time they will block too waiting for
some of their pages to load. Storage layer will collect multiple requests and
deliver data in batches, introducing latency. That's why when they benchmark
SSDs there are graphs for various queue depths. Deeper queue, better throughput.
OK, so these tests are "easily" parallelized, with appropriate definition of "easy". Each search is conducted on each gene separately, so these can be run in parallel. That's the good news.
The other good news is that Atomese has several actual Atoms that run multiple threads -- one called ParallelLink, the other called JoinThreadLink. I've never-ever tried them with the pattern matcher before. I will try now ...
Xabush, what are your biggest datasets? How do you load them?
Meanwhile, I'll explore concurrency with the ParallelLink ... see if I can make that practical.
How do you run them?
