small challenge for the q gods

222 views
Skip to first unread message

k.os.tao

unread,
Dec 10, 2010, 10:52:26 PM12/10/10
to Kdb+ Personal Developers
given a large list of string phrases, /hundreds of thousands

k:("how fast";" ";"perform this";"job";...)

and given another, much larger, list of string phrases, /millions

v:("how fast";"can q";"perform this task";"?";"!";...)

how does one, with maximum performance, calculate the sum of all
occurrences of each k within all of v?

for example, in the above case, given k[0 1 2 3] and v[0 1 2 3 4], it
should yield:

"how fast" 1
" " 4
"perform this" 1
"job" 0

alternatively, v could be viewed as a single delimited string

to complicate things, the chars "*[]" are found within the elements of
both k,v

have tried something like the following:

v0:" "sv v
t:flip`phrase`num!(k;{count ss[v0;x]}peach k)

using various roundabout methods for dealing with "*[]" in k,v

need to significantly improve performance

Attila Vrabecz

unread,
Dec 13, 2010, 6:41:43 AM12/13/10
to personal...@googlegroups.com
i think this is not a nice problem for q if the builtin ss is not good enough for you
you should write this directly in C
see the multi-pattern search part at http://en.wikipedia.org/wiki/Rabin-karp
  Attila



--
You received this message because you are subscribed to the Google Groups "Kdb+ Personal Developers" group.
To post to this group, send email to personal...@googlegroups.com.
To unsubscribe from this group, send email to personal-kdbpl...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/personal-kdbplus?hl=en.


Jack Andrews

unread,
Dec 13, 2010, 11:22:56 PM12/13/10
to personal...@googlegroups.com
hi atilla, k.os.tao,

if q is not so strong here, lex (or flex) will do a good
job at matching these sort of patterns. from q, you can
generate a lex file (call it l.l) that looks like:

%{
long n=0;
main()
{for(;yylex()>0;);
printf("counted %ld matches\n",n);
return 0;
}
%}
%%
"how fast" n++;
"perform this" n++;
"job" n++;
.
/**** EOF *****/

and build your scanner and run with test input:
$ flex -o n.c n.l && gcc n.c -o n -lfl && echo "how fast does it job
question?" |./n

counted 2 matches


ta, jack.

k.os.tao

unread,
Dec 14, 2010, 4:15:11 PM12/14/10
to Kdb+ Personal Developers
attila, jack - thank you for your suggestions

i am enjoying exploring this task in q and am learning a lot in the
process.
i have defined various methods that work, with big speed improvements,
but which fail on larger lists due to the 32-bit address space limit.

a couple additional questions:

1. is there a faster way to get the following result?
{(key g)!count each value g:group x} where x is a large list

2. what is the best way to serialize to a large dictionary of
dictionaries to disk?
It is too large to generate it within memory all at once.
I am hoping to do something analogous to splaying and/or partitioning
with tables.


On Dec 13, 11:22 pm, Jack Andrews <effb...@gmail.com> wrote:
> hi atilla, k.os.tao,
>
> if q is not so strong here, lex (or flex) will do a good
> job at matching these sort of patterns.  from q, you can
> generate a lex file (call it l.l) that looks like:
>
> %{
> long n=0;
> main()
> {for(;yylex()>0;);
>  printf("counted %ld matches\n",n);
>  return 0;}
>
> %}
> %%
> "how fast" n++;
> "perform this" n++;
> "job" n++;
> .
> /**** EOF *****/
>
> and build your scanner and run with test input:
> $ flex -o n.c n.l && gcc n.c -o n -lfl && echo "how fast does it job
> question?" |./n
>
> counted 2 matches
>
> ta, jack.
>
> On Mon, Dec 13, 2010 at 10:41 PM, Attila Vrabecz
>
> <attila.vrab...@gmail.com> wrote:
>
> > i think this is not a nice problem for q if the builtin ss is not good enough for you
> > you should write this directly in C
> > see the multi-pattern search part athttp://en.wikipedia.org/wiki/Rabin-karp
> >   Attila
> >> For more options, visit this group athttp://groups.google.com/group/personal-kdbplus?hl=en.

Attila Vrabecz

unread,
Dec 14, 2010, 4:23:17 PM12/14/10
to personal...@googlegroups.com
q)count each group x

is not really faster, but less code
and by just looking at this it should be relatively obvious
that it won't be easy to find a faster expression if any

an idea would be doing something like this
q)l:(`int$())!`int$()
q)l[x]+:1

but that is actually slower


you cannot splay dictionaries
but you could just write down a two-column table and use that

Attila Vrabecz

unread,
Dec 15, 2010, 5:30:31 AM12/15/10
to personal...@googlegroups.com
correction: depending on the density of x
l[x]+:1 might be actually faster
but not by much

ie
x:1000000?100
vs
x:1000000?10000
  Attila

k.os.tao

unread,
Dec 15, 2010, 12:30:05 PM12/15/10
to Kdb+ Personal Developers
attila, thanks for the code shortening suggestion...its amazing how
much one coding change can simplify the entire routine once factored
in properly

Aaron Davies

unread,
Dec 15, 2010, 9:09:46 PM12/15/10
to personal...@googlegroups.com
yes, getting a good handle on the looping behavior of dictionaries and tables is very helpful in reducing code size

for a function f and a dictionary d, "f each d" matches "(key d)!f each value d" -- a dictionary of the keys and the function applied to the values

for a table t, "f each t" matches "f each(cols t)!/:flip get flip t" -- a table of the same column names with the function applied to each of the "row dictionaries"

("f each't" thus matches "(cols t)!/:f each'flip get flip t", etc.)

keys#dictionary is another handy one most people find quite late -- for a dictionary d and a list of keys k, "k#d" matches "k!d k"
-- 
Aaron Davies

k.os.tao

unread,
Dec 16, 2010, 3:09:32 PM12/16/10
to Kdb+ Personal Developers
aaron, thanks for the tips, they are greatly appreciated.

the more i code in q, the more i want to learn k. i found an old k
reference manual (v2, from 1998!), is there a more recent version? if
not, any idea if the old one is still useful, or has the syntax
changed too much since then?
> aaron.dav...@gmail.com

Attila Vrabecz

unread,
Dec 16, 2010, 3:17:52 PM12/16/10
to personal...@googlegroups.com, Kdb+ Personal Developers
It is close enough, there is nothing else - and it is very well written
k is much more fun and beatiful than q
You will have fun

Attila

Aaron Davies

unread,
Dec 16, 2010, 9:17:17 PM12/16/10
to personal...@googlegroups.com
> the more i code in q, the more i want to learn k. i found an old k
> reference manual (v2, from 1998!), is there a more recent version? if
> not, any idea if the old one is still useful, or has the syntax
> changed too much since then?

it's vaguely useful, but a lot of things have changed in k4

your best bet is to start reading q.k and working out k equivalents to
everything you write in q

(and conversely whenever you break inside a k function stop and try to
figure out exactly how it works)

some things convert very easily, e.g. "count each group x" from above is "#:'=x"

some things are harder :)

k.os.tao

unread,
Dec 31, 2010, 12:45:33 PM12/31/10
to Kdb+ Personal Developers
attila, jack, aaron:

thanks for your comments and suggestions. i hope everyone enjoyed the
winter solstice festivities.

regarding this task, after restructuring the approach I have been able
to cut the run-time down from 5 1/2 hours (when using brute force ss
in parallel in q) to about 50 seconds (using the below function f),
and absent dreaded wsfull decapitation. the original brute force
attempt made by a friend in .net was estimated to take 2 weeks to
complete.

in the above benchmark numbers i am calculating all starting indices
for exact matches of approximately 300,000 keyphrases (word sequences
with counts 1-30) within a 100MB xml document which, after stripping
out the xml values, is equivalent to a single string (or symbol
vector) containing about 20 million words.

find the function below:

f:{((key d)i)!r i:where not(0#0N)~/:r:{(1+neg count x)+{y where y in
1+x}over x}peach value d:(x where all each x in key gk)#gk:(k
where(k:distinct raze x)in key gv)#gv:group y};

/f produces a dictionary of (keyphrases from x where exact matches
exist in y)->(lists of starting indices for the matches), where
/x is the lookup keyphrase list, as a list of symbol (word) lists,
and
/y is the lookup value list, as a single list of symbols (words)

<< i believe the above result to be correct, although i have not
rigorously checked its correctness. >>

the original brute force method in q looked like:

{v ss x}peach k;
/k is a list of strings of words separated by spaces (kephrases)
/v is a single string of words separated by spaces (lookup values)

i am sure the function f could be optimized further with code
improvements, logical refinements, further use of parallel
calculation, etc.

running a similar routine over more cores (here 4 were used) and where
parallel dictionary lookup does not lead to wsfull decapitation, it
could run much, much faster.

<< a tangent: how cool would it be if one could natively run k
operations over nvidias massively multicore gpu cards? gpeach anyone?
other (inferior, in my opinion) interpreters (e.g. mathematica,
matlab) have some gpu functionality in place. >>

my conclusion thus far: the k language is pure genius. it allows one
to perform abstract mathematical operations naturally and concisely
with tremendous
performance and scalability and a near zero system footprint. while
its primary usage seems to be in finance for high performance datafeed
and database operations, I see no reason why it could not be used for
anything - data mining, scientific research, etc.. modifying the above
function f to do regular expressions and general functional string
operations should be fairly straightforward.

Happy New Year everyone,

`k.os
Reply all
Reply to author
Forward
0 new messages