Speeding up range (and probably unique) for ints and chars

Scott Vokes

unread,

Jan 12, 2014, 8:09:23 PM1/12/14

to kona...@googlegroups.com

I just pushed a commit to a branch, "fast_range", that significantly
improves the speed of unique for C arrays, and for I arrays when the
max-min delta is not too large. All tests pass, and interactive
testing has not indicated any problems. Please look it over, merge if
you don't find any issues. (I think I got the code style right...)

https://github.com/kevinlawler/kona/tree/fast_range

The existing implementation did two calls to grade-up and some other
loops, but for char arrays, all that is really necessary is a lookup
table for the bytes 0x00-0xFF indicating whether they are present. A
second sweep can copy the first instance of the values over, in the
order they're encountered, and clear the flag. Much cheaper.

It works the same way with ints, except there is an additional pass to
determine the min and max values, then a lookup table with max-min+1
slots is allocated. (Since unique_C's table has a small, fixed size,
it can just be stack-allocated.) This is a lot faster, particulary
when max-min is small. If max-min is prohibitively large, instead of
allocating a huge and a very sparse lookup table, it falls back on
grade-up. A hash table or something similar could be used, though.

A slightly more complicated variant of this could speed up group, by
sweeping through and building an array of offsets to the next
instances of array elements (or -1 for NONE), then walking the linked
lists and appending the offsets to the group arrays.

Here are some benchmarks. (Times are on my mid-2011 macbook air.)

==================================================
/ benchmark char fast path for unique/range
a:"abcdefghijklmnopqrstuvwxyz"
vc:{v:,//x#,a;(-#v)?v}10000 / vector of repeated alphabet
\t ?vc
/ 15 msec vs. 1 msec

/ benchmark integer fast path for unique/range
vi:{(#v)?v:!x}100000 / large delta
\t ?vi
/ 10 msec vs. 2 msec

vi2:{(#x)?x}100000#!100 / small delta
\t ?vi2
/ 6 msec vs. 0 msec
==================================================

Also, unrelated, but would anybody be interested in running kona on
Travis CI? I can set it up. (https://travis-ci.org/)

take care,
Scott / @silentbicycle

Kevin Lawler

unread,

Jan 12, 2014, 8:50:53 PM1/12/14

to kona...@googlegroups.com

Good idea, nice improvement. The double sort always stuck out as
something that could be improved.

I would be interested in running Travis. Is there a cost I need to cover?

Kevin

> --
> You received this message because you are subscribed to the Google Groups "Kona Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kona-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Scott Vokes

unread,

Jan 12, 2014, 8:57:52 PM1/12/14

to kona...@googlegroups.com

TravisCI appears to be free for public github projects. We may need to
make minor adjustments (e.g. having the test suite runner exit with
nonzero status if any tests fail). You may need to do part of the
account setup, since kona is under your personal github account and
not an organization account, but it's a fairly quick process.

Scott

Bakul Shah

unread,

Jan 12, 2014, 9:35:25 PM1/12/14

to kona...@googlegroups.com

IIRC, ? for draw and deal were added in k4 so all of your tests fail in k3.

you can may be use something like the following (only showing the first test):

cv: _ci' 97 + 1000000 _draw 26 / allocate a vector of 10^6 random chars
\t do[100;?cv]

This gives 128ms with k, 4390ms with kona (latest).

On Jan 12, 2014, at 5:09 PM, Scott Vokes <vok...@gmail.com> wrote:

> I just pushed a commit to a branch, "fast_range", that significantly
> improves the speed of unique for C arrays, and for I arrays when the
> max-min delta is not too large. All tests pass, and interactive
> testing has not indicated any problems. Please look it over, merge if
> you don't find any issues. (I think I got the code style right...)
>
> https://github.com/kevinlawler/kona/tree/fast_range
>
> The existing implementation did two calls to grade-up and some other
> loops, but for char arrays, all that is really necessary is a lookup
> table for the bytes 0x00-0xFF indicating whether they are present. A
> second sweep can copy the first instance of the values over, in the
> order they're encountered, and clear the flag. Much cheaper.

I think just one loop on the input vector is needed. The loop body:

v = x[i]; if (!seen[x]) { u[j++] = x; seen[x] = 1;}

u is the same size as x but only [0..j) values are filled. You
don't even have to initialize u to all zeroes! seen[256] is
zeroed at the start.

But is this case worth optimizing?

> It works the same way with ints, except there is an additional pass to
> determine the min and max values, then a lookup table with max-min+1
> slots is allocated. (Since unique_C's table has a small, fixed size,
> it can just be stack-allocated.) This is a lot faster, particulary
> when max-min is small. If max-min is prohibitively large, instead of
> allocating a huge and a very sparse lookup table, it falls back on
> grade-up. A hash table or something similar could be used, though.

The seen vector above becomes a hash table when the input domain is large.
Though we need a proper hashtable.

> A slightly more complicated variant of this could speed up group, by
> sweeping through and building an array of offsets to the next
> instances of array elements (or -1 for NONE), then walking the linked
> lists and appending the offsets to the group arrays.

Linked list requires allocation but you can use a vector. Something
like the following looop body:

v = x[i];
if (!count[v]) { start[j++] = v; }
count[v]++; // group element vector size
index[i] = link[v];
link[v] = i;

At the end j = size of the group vector. To finish up you loop on it:

group = allocate j entry vector of type 0
for (k = 0; k < j; k++) {
v = start[k];
group[k] = allocate count[v] element int vector
/* now fill it *from the end* since link[v] points the last index to x containing v */
}

As for the unique case, vectors indexed by v can be replaced with
a hashtable. [I think we can generalize dicts to allow any key.
And I think we need to another vector for hash indices].

> Here are some benchmarks. (Times are on my mid-2011 macbook air.)
>
> ==================================================
> / benchmark char fast path for unique/range
> a:"abcdefghijklmnopqrstuvwxyz"
> vc:{v:,//x#,a;(-#v)?v}10000 / vector of repeated alphabet
> \t ?vc
> / 15 msec vs. 1 msec
>
> / benchmark integer fast path for unique/range
> vi:{(#v)?v:!x}100000 / large delta
> \t ?vi
> / 10 msec vs. 2 msec
>
> vi2:{(#x)?x}100000#!100 / small delta
> \t ?vi2
> / 6 msec vs. 0 msec
> ==================================================
>
> Also, unrelated, but would anybody be interested in running kona on
> Travis CI? I can set it up. (https://travis-ci.org/)
>
> take care,
> Scott / @silentbicycle
>

Scott Vokes

unread,

Jan 13, 2014, 12:28:39 AM1/13/14

to kona...@googlegroups.com

> IIRC, ? for draw and deal were added in k4 so all of your tests fail in k3.
> you can may be use something like the following (only showing the first test):
> cv: _ci' 97 + 1000000 _draw 26 / allocate a vector of 10^6 random chars
> \t do[100;?cv]
> This gives 128ms with k, 4390ms with kona (latest).

Well, I'm trying to speed up kona's existing interface. I don't have k3
or k4 executables on hand. (And, for IP reasons, maybe I shouldn't...)

> I think just one loop on the input vector is needed. The loop body:
>
> v = x[i]; if (!seen[x]) { u[j++] = x; seen[x] = 1;}
>
> u is the same size as x but only [0..j) values are filled. You
> don't even have to initialize u to all zeroes! seen[256] is
> zeroed at the start.
>
> But is this case worth optimizing?

I did it in multiple passes because I figured that there was more
benefit in having a short, appropriately sized vector of unique values
rather than having an equally long one artifcially truncated. It seems
to me that unique/range is typically used to do things like grab the
small set of unique keys that appear many times throughout database
column. Feel free to convince me otherwise, though.
But, seen[256] _does_ need to be zeroed, it's stack allocated and
likely to contain remnants of other stack frames. For the int
variant, it looks like memory is zeroed when returned to the pool,
but not when initially allocated. Am I misinterpreting unpool and
the like in km.c?
And, while char vectors are probably low priority to optimize, it's
not making anything else slower, and adding that code allowed some
to be deleted elsewhere. The int optimization seems more clearly
beneficial, due to the many ways k uses ints.

> The seen vector above becomes a hash table when the input domain is large.
> Though we need a proper hashtable.

Sure. Are you talking about behavior in k3/k4/? I guess I don't
have enough context. I could set up a temporary hash table later,
at least for the sparse integer case.

> Linked list requires allocation but you can use a vector. Something
> like the following looop body:
>
> v = x[i];
> if (!count[v]) { start[j++] = v; }
> count[v]++; // group element vector size
> index[i] = link[v];
> link[v] = i;

Building linked lists in a vector is what I had in mind, yeah.

Scott

Bakul Shah

unread,

Jan 13, 2014, 1:44:17 AM1/13/14

to kona...@googlegroups.com

On Jan 12, 2014, at 9:28 PM, Scott Vokes <vok...@gmail.com> wrote:

>> IIRC, ? for draw and deal were added in k4 so all of your tests fail in k3.
>> you can may be use something like the following (only showing the first test):
>> cv: _ci' 97 + 1000000 _draw 26 / allocate a vector of 10^6 random chars
>> \t do[100;?cv]
>> This gives 128ms with k, 4390ms with kona (latest).
> Well, I'm trying to speed up kona's existing interface. I don't have k3
> or k4 executables on hand. (And, for IP reasons, maybe I shouldn't...)

What I meant is _draw (or _deal) is more portable.

>> I think just one loop on the input vector is needed. The loop body:
>>
>> v = x[i]; if (!seen[x]) { u[j++] = x; seen[x] = 1;}
>>
>> u is the same size as x but only [0..j) values are filled. You
>> don't even have to initialize u to all zeroes! seen[256] is
>> zeroed at the start.
>>
>> But is this case worth optimizing?

Sorry, I was talking about just the char array case.

> I did it in multiple passes because I figured that there was more
> benefit in having a short, appropriately sized vector of unique values
> rather than having an equally long one artifcially truncated. It seems
> to me that unique/range is typically used to do things like grab the
> small set of unique keys that appear many times throughout database
> column. Feel free to convince me otherwise, though.

My logic is, for short arrays it doesn't matter. For very large arrays,
it is better to walk them *once* rather than twice. And the output will always
be shorter or the same (and can never exceed the domain of the element
type -- so 256 for chars). For ints etc. you can allocate in smaller chunks
if space is a concern but my view is that assuming an array of the same
size can be allocated simplifies things (and if you don't have that space,
you run into other issues very quickly).

> But, seen[256] _does_ need to be zeroed, it's stack allocated and
> likely to contain remnants of other stack frames. For the int
> variant, it looks like memory is zeroed when returned to the pool,
> but not when initially allocated. Am I misinterpreting unpool and
> the like in km.c?

I think one should not rely on the allocator zeroing memory as often
the array will be immediately filled with something more useful.

> And, while char vectors are probably low priority to optimize, it's
> not making anything else slower, and adding that code allowed some
> to be deleted elsewhere. The int optimization seems more clearly
> beneficial, due to the many ways k uses ints.
>
>> The seen vector above becomes a hash table when the input domain is large.
>> Though we need a proper hashtable.
> Sure. Are you talking about behavior in k3/k4/? I guess I don't
> have enough context. I could set up a temporary hash table later,
> at least for the sparse integer case.

No, just Kona. Currently dictionaly lookup does a linear search.
This is fine for short dictionaries but not for large dictionaries.
As an example, I create a dictionary of 10K symbols mapping to
10K random numbers.

Kona:
\t x:10000#?`$'100000 5#_ci'97+500000 _draw 26
1328
\t d:.+(x;10000 _draw 100)
54
\t d[x]
2109

K3:
\t x:10000#?`$'100000 5#_ci'97+500000 _draw 26
85
\t d:.+(x;10000 _draw 100)
187
\t d[x]
115

Now the same for 20K symbols (K3 only):

\t x:20000#?`$'100000 5#_ci'97+500000 _draw 26
82
\t d:.+(x;20000 _draw 100)
827
\t d[x]
520

As you can see, the lookup time quadrupled (the dictionary
creation time gets larger faster! Interesting....).

For general group and unique operations, search should dominate.
So I was saying two things:

1. Generalize dictionaries -- allow any type as key, not just syms
2. For large dictionaries add something more to create a proper
hash table (so that lookup becomes approx O(1)).

This may be worth it if you are doing lots of database type of
operations. But 2. has to maintain current semantics.

Reply all

Reply to author

Forward