I have some points I would like to discuss/clarify regarding the discussions so far. See below,
> B-Trees with n keys have a depth of O(logn).
Nope. The depth/height of a B-Tree is roughly *O(log_bN)* (base b is not 2) where b is the fanout (number of keys per node/page).
> "popular structures for storing data"
popular data structures for storing and retrieving data on disk.
--------------------------
> "which has lead to OLAP (online analytic processing) systems, in contrast to the OLTP (online transaction processing) systems"
The OLAP/OLTP categories are very old and used in DB community. What is different now is the current scale of data/processing that forced the creation of new engines specialized on either OLTP or OLAP (e.g., VoltDB, Vertica, Hive, Presto, Cassandra, etc). See the paper "One size does not fit all" by Michael Stonebraker for the context.
Before that era, DBs like Oracle/MS Server where used for both OLTP/OLAP (curiously, the newest trend in DB research is the creation of HTAP ( Hybrid Transaction/Analytics Processing Architectures ) systems, that is, history repeats itself. :simple_smile:
> "better performance due to data locality"
Data locality has been a feature explored on modern distributed DB systems (relational, nosql, newsql). Not an exclusive feature of document DBs
> "Relational data models provide better support for relationships"
Relational data models:
* Have strong and consistent data modeling features (constraints, normal forms, ACID transactions, etc);
* Are backed by sound math theory (relational algebra, set theory, etc);
* Have a declarative, intuitive, and expressive query language that abstracts data layout on disk and access patterns;
> "sorted string tables (USED IN SEARCH ENGINES (ish))"
SST tables are used by many NoSQL systems (Cassandra, RocksDB, LevelDB, HBase, etc). Idea came from Google's Bigtable.
> "keep a memtable in memory (skip list) and write it to disk when it gets too big"*
Memtable flushes its data to disk (SST) when it hits a threshold size or at regular time intervals, whichever comes first.
> "does a merge sort and expunge dead data"
LSMT are append only Data Structures and SST tables are immutable once written on disk. Even deletes and updates preserve old entries (deletes insert a special Mark referencing the delete row in the newer SST table). Compaction reclaims space by merging files and discarding old/dead data.
> "reading?"
For each reading, LSMT read request needs to merge data from SST tables on disk (discarding dead/out of date rows) and memtable in RAM. Bloom filters let skip SST table files.
> "b-trees have to be defragmented"
Because b-trees do in-place insertion/updates/deletions. LSMT are append only.
> "Search Posting lists are effectively mini-LSMTs"
Hum... sort of. Lucene's segments are similar to SST tables. Think one posting list on RAM and one posting list on disk for every segment. Each query requires looking on both RAM and disk and merging the results. Lucene's segment merging is like LSTM compaction.
> "XML - bloated, complicated, yet lives on"
XML is verbose, simple, self documented, has comments, schemas, but the ecosystem around XML is huge and very complex. JSON is more compact, but lack schema and comments, for example.
Best,
Edward