BITOP is only fast with 16 keys or fewer

60 views
Skip to first unread message

Pavel Anossov

unread,
Feb 5, 2015, 7:18:43 PM2/5/15
to redi...@googlegroups.com
Hi!

Why is the BITOP fast path only taken with 16 keys or fewer? The massive slowdown on the 17th key was very surprising.

Relevant source:

/* BITOP op_name target_key src_key1 src_key2 src_key3 ... src_keyN */
void bitopCommand(redisClient *c) {
   
char *opname = c->argv[1]->ptr;
    robj
*o, *targetkey = c->argv[2];
   
unsigned long op, j, numkeys;
    robj
**objects;      /* Array of source objects. */
   
unsigned char **src; /* Array of source strings pointers. */
   
unsigned long *len, maxlen = 0; /* Array of length of src strings,
                                       and max len. */

   
unsigned long minlen = 0;    /* Min len among the input keys. */
   
unsigned char *res = NULL; /* Resulting string. */

   
...

   
/* Lookup keys, and store pointers to the string objects into an array. */
    numkeys
= c->argc - 3;
    src
= zmalloc(sizeof(unsigned char*) * numkeys);

   
...

   
/* Compute the bit operation, if at least one string is not empty. */
   
if (maxlen) {
        res
= (unsigned char*) sdsnewlen(NULL,maxlen);
       
unsigned char output, byte;
       
unsigned long i;

       
/* Fast path: as far as we have data for all the input bitmaps we
         * can take a fast path that performs much better than the
         * vanilla algorithm. */

        j
= 0;
       
if (minlen && numkeys <= 16) {   //  <----------------------------------- Why 16?
           
unsigned long *lp[16];
           
unsigned long *lres = (unsigned long*) res;

           
/* Note: sds pointer is always aligned to 8 byte boundary. */
            memcpy
(lp,src,sizeof(unsigned long*)*numkeys);
            memcpy
(res,src[0],minlen);

           
/* Different branches per different operations for speed (sorry). */
           
if (op == BITOP_AND) {
               
while(minlen >= sizeof(unsigned long)*4) {
                   
for (i = 1; i < numkeys; i++) {
                        lres
[0] &= lp[i][0];
                        lres
[1] &= lp[i][1];
                        lres
[2] &= lp[i][2];
                        lres
[3] &= lp[i][3];
                        lp
[i]+=4;
                   
}
                    lres
+=4;
                    j
+= sizeof(unsigned long)*4;
                    minlen
-= sizeof(unsigned long)*4;
               
}
           
} else if ...
       
}

       
/* j is set to the next byte to process by the previous loop. */
       
for (; j < maxlen; j++) {
            output
= (len[0] <= j) ? 0 : src[0][j];
           
if (op == BITOP_NOT) output = ~output;
           
for (i = 1; i < numkeys; i++) {
               
byte = (len[i] <= j) ? 0 : src[i][j];
               
switch(op) {
               
case BITOP_AND: output &= byte; break;
               
case BITOP_OR:  output |= byte; break;
               
case BITOP_XOR: output ^= byte; break;
               
}
           
}
            res
[j] = output;
       
}
   
}
   
...
}


Josiah Carlson

unread,
Feb 6, 2015, 1:19:41 PM2/6/15
to redi...@googlegroups.com
I don't know why this is the case off the top of my head, but the 3 operations that can be chained together (AND, OR, XOR) are associative and commutative, so you can chunk your bitops to store in a temporary string along the way. The following Lua script should work pretty well (untested) as long as you use a temporary key that isn't holding other useful data...

-- KEYS -> {dest, temp_key, key1, key2, ...}
-- ARGV -> {operation}
local dest = table.remove(KEYS, 1)
local tkey = table.remove(KEYS, 1)
while #KEYS > 16 do
    redis.call('bitop', ARGV[1], tkey, unpack(KEYS, #KEYS-15))
    for i = 1, 16 do
        table.remove(KEYS)
    end
    table.insert(KEYS, math.max(1, #KEYS-14), tkey)
end
local result = redis.call('bitop', ARGV[1], dest, unpack(KEYS))
redis.call('del', tkey)
return result

It won't be quite as fast with the 1 extra copy for every 16 strings, but at least that copy will be from a key that should already be in the CPU cache, assuming small enough keys (which is why we insert it in the keylist where we do).

 - Josiah


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Pavel Anossov

unread,
Feb 7, 2015, 7:57:23 AM2/7/15
to redi...@googlegroups.com
That's exactly what I'm doing. I was just wondering if there was a particular reason, or if this should be reported as a bug.

Josiah Carlson

unread,
Feb 9, 2015, 4:51:39 PM2/9/15
to redi...@googlegroups.com
Looking at the changelog that introduced the optimization[1], this is intentional. Using my guessing hat, I would put this as likely being caused by two points: most users are not going to be doing those operations on more than 16 strings at a time, and more than 16 strings being scanned through at the same time will put too much pressure on currently-sized CPU caches for >16x 1M+ bitstrings.

Reply all
Reply to author
Forward
0 new messages