Yesterday, I put the finishing touches on the FileStorageNode. This uses the StorageNode API to read/write Atomese s-expressions to a flat file. It's fast, its compact. It's 10x faster than using plain scheme (guile) to dump Atoms: this is thanks to code originally written by Alexey Potapov and Anatoly Belikov -- I wrote a wrapper around it to use the StorageNode API.
Some stats: I tested two datasets: a MOZI biology dataset, and a natural language dataset, of 7 million and 20 million Atoms, respectively. When these are loaded into the AtomSpace (in RAM), they take up 632 and 775 bytes/Atom of RSS (operating system resident set size). This is very typical for Atoms in the AtomSpace. (I put these two datasets up at
https://linas.org/datasets/ for Amirouche.)
Dumped to a file, this becomes 55 and 154 bytes/Atom, for plain, uncompressed Atomese s-expressions. When compressed with bzip2, it shrinks to 4 and 6 bytes/Atom! Tiny! Clearly, storing searchable indexes into the AtomSpace costs a huge amount of RAM. The actual data content in typical Atoms is .. tiny.
-- Linas
--
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.