I've been reading and watching some of the content about how MonetDB works and the general concepts make sense to me but the details aren't clicking. I'm still new to a lot of DB internal concepts so please feel free to correct me if I've stated something incorrectly :)
From what I understand:
Unlike some RDBMS's who send a tuple at a time through the query pipeline, MonetDB sends arrays of tuples into the pipeline that align with L1/L2 cache sizes so that all operations on the block of tuples happens while they are resident in the L1/L2 caches without having to swap them in and out etc. Doing this also offers the opportunity to use instructions like SIMD on the blocks of tuples. They also go to length to reduce code that introduces CPU branch miss predictions.
Someone directed me to this paper about "Implementing Operations Using SIMD Instructions" and I also found the following slide that breaks it down a bit better for me to understand:
It seems binary searches incur a lot of CPU branch miss predictions and sequential scans benefit more from branch prediction. In the slide above the Hybrid approach looks to be a pretty good in between.
So pipes in the pipeline receive arrays of tuples as input and output tuples to the next pipe in the pipeline and this is done in a column orientation? Where I start to get lost is sorting out what the tuples actually contain since they don't show any examples of what the data is that actually flows through the pipeline and the paper above shows arrays with sorted numbers but I'm not really clicking into what that really means.
Part of the paper mention a B-Tree index but for some reason some of the examples looks like sending blocks of an inverted index through the pipeline. One of the documents above show scanning for "4" using SIMD in a structure like:
1 3 4 5 7 8 10 13 14 17 19 20 23 24 25 32
What is that? Is element 3 a byte containing the value "4"? I know in Lucene (if I understood it correctly) they do some query operations simply by AND/OR huge bit sets constructed by the doc ids from the inverted index lookups.
If I was scanning an Inverted Index I would be going from a structure that looked like:
Term | Row
------------------------------
"1" 2
"3" 1, 3
"4" 1, 3
"5" 1
Where I would fetch the index for "3", and I would get row ids for rows 1 and 3. Then I might bunch these into an array of tuples and send them down the pipeline?
Can anyone connect the dots or tell me how I am completely wrong? I am completely missing something here :)