Vectorized Query Execution Pipeline

Kelly Sommers

unread,

Jan 26, 2013, 4:00:00 PM1/26/13

to distsys...@googlegroups.com

I've been reading and watching some of the content about how MonetDB works and the general concepts make sense to me but the details aren't clicking. I'm still new to a lot of DB internal concepts so please feel free to correct me if I've stated something incorrectly :)

From what I understand:

Unlike some RDBMS's who send a tuple at a time through the query pipeline, MonetDB sends arrays of tuples into the pipeline that align with L1/L2 cache sizes so that all operations on the block of tuples happens while they are resident in the L1/L2 caches without having to swap them in and out etc. Doing this also offers the opportunity to use instructions like SIMD on the blocks of tuples. They also go to length to reduce code that introduces CPU branch miss predictions.

Someone directed me to this paper about "Implementing Operations Using SIMD Instructions" and I also found the following slide that breaks it down a bit better for me to understand:

http://old.crystalnix.com/dlyfar/articles/10.1.1.6.6481.pdf

http://www.cs.toronto.edu/~ryanjohn/teaching/csc2531-f11/slides/Ioan-SIMD-DBMS.pdf

It seems binary searches incur a lot of CPU branch miss predictions and sequential scans benefit more from branch prediction. In the slide above the Hybrid approach looks to be a pretty good in between.

So pipes in the pipeline receive arrays of tuples as input and output tuples to the next pipe in the pipeline and this is done in a column orientation? Where I start to get lost is sorting out what the tuples actually contain since they don't show any examples of what the data is that actually flows through the pipeline and the paper above shows arrays with sorted numbers but I'm not really clicking into what that really means.

Part of the paper mention a B-Tree index but for some reason some of the examples looks like sending blocks of an inverted index through the pipeline. One of the documents above show scanning for "4" using SIMD in a structure like:

1 3 4 5 7 8 10 13 14 17 19 20 23 24 25 32

What is that? Is element 3 a byte containing the value "4"? I know in Lucene (if I understood it correctly) they do some query operations simply by AND/OR huge bit sets constructed by the doc ids from the inverted index lookups.

If I was scanning an Inverted Index I would be going from a structure that looked like:

Term | Row

------------------------------

"1" 2

"3" 1, 3

"4" 1, 3

"5" 1

Where I would fetch the index for "3", and I would get row ids for rows 1 and 3. Then I might bunch these into an array of tuples and send them down the pipeline?

Can anyone connect the dots or tell me how I am completely wrong? I am completely missing something here :)

Edward Ribeiro

unread,

Jan 26, 2013, 4:57:55 PM1/26/13

to distsys...@googlegroups.com

It has been a long time since I read the CWI Database Group that implements and advance MonetDB research, but the vectorized query implementation was a branch of research that gave birth to Vectorwise commercial database by Action (former Ingres). I say so because the PhD student directly in charge of this research was Marcin Zukowski and his webpage has plenty of papers and even his PhD thesis (chapter 5) on the subject:

http://marcinzukowski.name/

I myself will take a time early next week to read his paper or even his thesis. ;-)

CWI DB group papers: http://repository.cwi.nl/search/searchrepository.php?stheme=INS1&stypeoai=all

CWI DB group Homepage: http://www.cwi.nl/research-groups/Database-Architectures

Cheers,

Edward

Kelly Sommers

unread,

Jan 28, 2013, 1:50:10 AM1/28/13

to distsys...@googlegroups.com

Let us know what you get from the content because I'm still struggling to connect the dots :)

I decided to take a crack at a "SELECT * FROM Table WHERE ColumnA < ColumnB" like query with 10 million rows. I've got it executing in 114ms but I got a friend to do a similar query on MS SQL Server and it seems to be able to do it on non-indexed columns in 31ms-52ms. So I am still pretty far off from being equal let alone stomping.

Although I don't know if my SIMD operations are actually using SIMD yet and where I tried to avoid branch predictions ala the paper, I'm not sure if I am benefiting from branch prediction or not. Also not sure if I'm staying within L1/L2 cache like the goal is to be. I need to find tooling that helps figure that out.

Here's the code for the first pass I worked on this afternoon:

https://gist.github.com/4650779

What do you think? Am I on the right track?

--
You received this message because you are subscribed to the Google Groups "Distributed Systems" group.
To post to this group, send email to distsys...@googlegroups.com.
To unsubscribe from this group, send email to distsys-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jeremiah Peschka

unread,

Jan 28, 2013, 12:19:39 PM1/28/13

to distsys...@googlegroups.com

To get an apples to apples, I think you'll want parallelize your code. SQL Server is a greedy beast and will steal an entire system if allowed.

Just so you know what's going on in the RDBMS...

I generated 10,000,000 rows and ran SELECT COUNT(*) FROM TestCompare WHERE cola < colb;

After the first run, the query ends up entirelyFrom an implementation standpoint, SQL Server goes off in the background and generates database statistics on both columns and then multi-threads the query. Obviously, this isn't a SQL Server benchmark, but when I run the query on my system I can see that SQL Server chooses to parallelize the clustered index scan (based on projected IO cost) and aggregation (projected CPU cost) before gathering the results for one final aggregate. Removing the aggregate just removes aggregation operator and SQL Server still parallelizes the scan across all available CPUs. The first time the query is run, unless your friend has disabled it, SQL Server also performs some background activity to auto generate column level stats to determine the best way to physically chunk up the data for the parallel scan. Once that auto create happens, SQL Server is limited by memory and CPU speed (for purposes of test).

SQL Server doesn't use SIMD instructions, at least not as of the 2008/2008R2 family. I suspect it hasn't been changed in SQL Server 2012, either, since it would be a major engine re-write.

HTH understand what SQL Server is doing in the back end so you can make changes on your end to match/beat the performance.

---

Jeremiah Peschka - Founder, Brent Ozar Unlimited

MCITP: SQL Server 2008, MVP

Cloudera Certified Developer for Apache Hadoop

To unsubscribe from this group and stop receiving emails from it, send an email to distsys-discu...@googlegroups.com.

To post to this group, send email to distsys...@googlegroups.com.

Reply all

Reply to author

Forward