Question: what do array language q and shen have in common?

fuzzy wozzy

unread,

Sep 24, 2016, 4:01:10 AM9/24/16

to Shen

according to the overview chapter of a book, q for mortals, free preview on amazon,
q uses list as the fundamental data construct...

some might say that shen doesn't have, and doesn't use arrays...
maybe shen professional version might have built-in ararys, but not the free version,

qi had built-in arrays, called upon with a command "(make-array)" or some such,
but even the free shen has list, being of lisp language, thus, even the free shen can
do array programming with or without built-in arrays,

shen may not have this or that yet, but in a sense shen is like a shapeshifter as well as
a paradigmshifter, it may not have ABC features, but that doesn't mean that it won't have them,
thus lacking in nothing... prime example I could think of is how one day someone lamented or inquired
if shen does object oriented programming, at the time it didn't, and dr. tarver wrote a few lines of shen
to introduce object oriented paradigm to shen, in one afternoon, I believe, so...

willi thought it was the most ridiculous thing he'd heard, doing array programming using list at the time it was suggested,
but now it turns out that the array language q, one of the most advanced array languages out there,
has been happily using list for their array programming all along...

Neal Alexander

unread,

Sep 24, 2016, 1:19:40 PM9/24/16

to Shen

Lists can be implemented in a number of ways, all with different random access complexity. Q doesn't use cons lists with O(n) index operations right?

fuzzy wozzy

unread,

Sep 25, 2016, 4:52:06 AM9/25/16

to Shen

I am quite clueless when it comes to determining O(n) of indexing operation for a given language,
does it have to do with randomly performing query on a database to determine how much time it takes?

fuzzy wozzy

unread,

Sep 25, 2016, 4:52:07 AM9/25/16

to Shen

once a shenturian by the screen name of "jdp" asserted that using shen list for arrays is not very efficient,
giving an example of comparing shen list vs. shen vectors when using the function "from", as in,
(from 2 [a b c d]) = c,

at first, it appeared as though shen vector was faster, to query (from) on a database of 10000 or so elements,
and it was claimed that list-array did not even match up to O(n) while shen vector did, but when stripped of
all extraneous functions prepended to the vector example, which list-array did not need, list-array's performance exceeded
all expectations -- for example, 19 millionth element from a lis tof 20 million elements took less than 4 seconds on an
old and clunky computer with no concurrency... that may or may not be up to par with q's version of list-array performance,

dr. tarver himself admitted after the above comparison that tail-recursion in shen can be very fast, I don't know what all the exact
reasons for such speed might be, one gets the impression that it might be even faster than regular common lisp, by far...

so, yes, even with a powerful platform such as shen, one can force it to give a slower performance, by refusing to use
tail recursion, or using it in an ineffective way,

q is not an open source language, its source code may only be available to its most trusted clients,
even that doesn't seem very likely, there was a project that got started, called kona, to emulate k language in
an open source environment, q's predecessor, but the project doesn't seem very active for the time being...
but if q's consing and tail recursion method is faster than shen, kudos to arthur whitney

fuzzy wozzy

unread,

Sep 25, 2016, 4:52:07 AM9/25/16

to Shen

https://github.com/kevinlawler/kona

does the fib example give negative numbers if you give higher number than 45, such as "fib 46"?
I don't know what the code is doing, but very strange... maybe I need to upgrade my computer...

shen can give fibonacci value for the nth number for (fibo n) in something like 2 seconds or less,
when n < 50000, on the repl, otherwise it goes into stack overflow...

Samuel Falvo II

unread,

Sep 25, 2016, 1:11:19 PM9/25/16

to Shen

On Sunday, September 25, 2016 at 1:52:07 AM UTC-7, fuzzy wozzy wrote:

at first, it appeared as though shen vector was faster, to query (from) on a database of 10000 or so elements,
and it was claimed that list-array did not even match up to O(n) while shen vector did, but when stripped of
all extraneous functions prepended to the vector example, which list-array did not need, list-array's performance exceeded
all expectations -- for example, 19 millionth element from a lis tof 20 million elements took less than 4 seconds on an
old and clunky computer with no concurrency... that may or may not be up to par with q's version of list-array performance,

If you're just hitting element 19000000, then you're taking 4 seconds too long. Most likely due to cache thrashing:

Assume a perfect CPU that is able to walk a linear list one node every clock cycle. To reach the end of your list, you'll need 20e6 clock cycles.
At 2.5GHz, this process would take 8ms to complete. That's milliseconds.
You stated it takes 4 seconds to walk 19e6 elements. Close enough to 20e6 for my purposes. Divide 4.000 / 0.008 = a factor of 500.0 disparity in expected latency

In other words, Shen is taking about 500 times longer than expected compared against a perfect memory subsystem at 2.5GHz, which is a reasonable clock speed for today's CPUs, as most are between 2.4GHz and 2.6GHz. Recall that access to L3 cache can take as long as 400 cycles per cache miss. This is well within the ballpark, and is the most likely culprit. I'll attribute the excess factor of 100 to the sum of Shen runtime overhead and operating system overhead (since comparisons and loop overheads take non-zero time, as does task switch overheads in a multitasking OS too). I'm going to wager your program performance is being destroyed by cache thrashing.

dr. tarver himself admitted after the above comparison that tail-recursion in shen can be very fast, I don't know what all the exact
reasons for such speed might be, one gets the impression that it might be even faster than regular common lisp, by far...

Tail-call optimization does two things:

1. It converts a subroutine call into an unconditional jump, thus not consuming any stack in the process.
2. It re-uses the same memory used to accept parameters to pass new parameters to the next iteration. This is good because it means no hits to external memory due to cache misses (well, not quite; let's just say it's very highly improbable). Many good-quality compilers can even convert TCO code into SSA or CPS form, identify common registers, and avoid passing parameters in memory all-together in some restricted cases.

In other words, TCO code and imperative code are exactly identical. It's just that TCO code is somewhat easier to reason about because it preserves functional semantics of the language it's used in.

Thus, it can never be faster than the equivalent imperative code (given identical code generators for both), since the instruction sequences generated by the compiler will be the same. But it definitely can be a whole lot more convenient to code and debug.

q is not an open source language, its source code may only be available to its most trusted clients,

J, another array-based language, is much closer to APL, and thus has true arrays and no lists as Shen programmers would define them. However, it does support boxing things up, so you can create (manually) an array of boxes, which is what Q and K use natively under the hood. J is also open-source.

While Arthur Whitney is a well-respected programmer and for good reason, we also need to give credit where it's really due: Kenneth E. Iverson, inventor of APL and J (APL's heir), and close working associate with Arthur.

Samuel Falvo II

unread,

Sep 25, 2016, 1:11:50 PM9/25/16

to Shen

On Sunday, September 25, 2016 at 1:52:06 AM UTC-7, fuzzy wozzy wrote:

I am quite clueless when it comes to determining O(n) of indexing operation for a given language,
does it have to do with randomly performing query on a database to determine how much time it takes?

Big-O Notation is what you need to research to learn more. Wikipedia has a decent explanation. I'll only provide an executive summary here.

Let's say you implement a dialect of Shen that stores lists in a user-configurable data structure (pretend this dialect of Shen is used for a comp-sci project and this reconfigurability is used for benchmarking purposes in your research).

If you want to find an element in a traditional cons-list, that is O(n), because (regardless of how complex the software backing it) you visit each item in the list one by one, linearly.

If you use a vector, that is O(1), again regardless of how many CPU instructions are used, because it works by indexing (if you know the ordinal position already) or hashing (if you don't). You calculate the index of the vector, then directly dereference the structure to get at the data. There is no faster means of accessing data. That's because this directly translates to the underlying CPU architecture's memory reference instructions. For example, on x86 architectures, you can use a CPU instruction like MOV EAX,[EBX+ECX*4] to grab any 32-bit element mentioned in ECX, starting at base address EBX in memory, and put it into EAX. If this is in cache, this is a 4-cycle operation. With CPU's running at 3GHz to 6GHz (in the case of IBM mainframes), you can easily see why nothing can beat an array for access speed!

There are algorithms which are O(log_k n) too. We could, for example, decide to store an association list internally as a skip-list or B-tree. When looking up an association, the runtime might start in the middle of the data set[1], and based on a comparison there, decide to exclude some (large!) portion of the data set, choosing to focus only on what's left. This happens recursively until either all data is exhausted or until a match is found.

There are some algorithms which are O(n^2), such as Bubble Sort, because you end up needing to visit data elements with sufficient frequency that as your data set doubles (say), you'll find you need four times the visits to arrive at a solution. At work not long ago, a coworker had to fix an algorithm that inadvertently turned out to be O(n^n)!

Finally, note that big-O notation can apply independently to run-time and to resource consumption as well. Algorithms can be very simple, using O(1) memory, but taking O(n^2) run-time (e.g., bubble-sort), or the reverse can happen as well (e.g., CPU page tables and other forms of tries). Then there are "constant factors" to worry about; given two O(1) algorithms, which is faster? (E.g., arrays versus hash tables are both O(1), but hash tables has additional hashing overhead.)

And then there are hidden hardware factors too.

If your working set is below 32KiB in size, for example, and is "hot" in L1 cache, you may find that a bubble sort or linear search is actually faster than a qsort or binary tree walk for your needs. This is because when data is in L1 cache, access time costs only 4 CPU clock cycles per datum, versus about 60 to 400 plus 4/datum when it's pushed out to L2/L3. If the data is in SDRAM, cost goes up astronomically from there, and it becomes variable in latency due to other devices competing for access to memory. I've heard of estimates ranging from 800 cycles to over 2000, depending on configuration of hardware used. This is especially relevant if you intend on running data-heavy applications on a machine with a lot of network I/O, such as a database server. This effect actually gets worse as CPU clock speeds and number of cores increases. SDRAM access periods seems fairly well stuck in the 35 to 70ns range, so the faster the CPU is, the more clocks will be wasted waiting for SDRAM to deliver its first byte of data.

Q, being a derived work from APL, implements its "lists" as arrays of boxed objects under the hood. That is, a "list" like [1 2 3 4] will take up 4 words of memory, plus some fixed overhead for garbage collection and other metadata. Access to the vector contents is done by direct indexing, both for access performance as well as superior cache utilization. This is true as well for [1 2 "HELLO" ["wor" "ld!"]].

The runtime is smart enough to optimize a vector's representation as well. For example, that same list [1 2 3 4] can easily be seen to be a vector of integers, so Q stores it in memory as four unboxed integers, one after another. This is why Q's killer application, financial database queries, is so much faster than any of its competitors that I'm aware of. Storing tables in a column-major manner allows for queries to run at CPU/memory bandwidth speeds, since it fully exploits cache locality.

As long as you store information as cons cells in memory, you'll never get to these levels of performance. Following a lengthy list can thrash your L1 cache if your list nodes are not stored efficiently. Even if they're placed adjacent in memory, you'll still only pack half as much data in a cache line, leading to increased thrashing. Thankfully, Shen's memory model doesn't care about the underlying memory representation. This concern is delegated right to KLambda. I have memories of a few papers in existence documenting how to conveniently abstract cons-lists as vectors with no loss of generality, a nice benefit coming from the Lisp machine R&D in the mid-80s.

fuzzy wozzy

unread,

Sep 25, 2016, 4:36:41 PM9/25/16

to Shen

thank you for the thoughtful answer, I'll have to read it several times, I think...
I would have to agree that 4 seconds to find 19000000th element in a list of 20 million elements is too long,
it was a while ago I timed it, so I timed it again just now, and it's actually 2.247 seconds,

(time (find 19000000 (iter0 20000000)))
2.247 secs

(iter0), formerly called (it0), creates a list of iteration from 0 to 19999999, which takes 1.731 seconds, so the
actual time (find) spends to get to 19000000th element is 0.515 seconds at the most, that's probably 0.515 seconds too long as well,
but more to the concern, to me at least, shen list goes into stack overflow if the list size increases larger than 25 million or so,

so this might sound silly, but one way to avoid frequent stack overflows while eliminating the 4-second barrier at the same time
might be to break a list of 20 million elements into a list of 20 lists of 1 million elments, or some such, with the first element in each of
20 lists being a flag of some sort, to show that the nth element one is searching for is in that particular list...

so, instead of,
list a = [0 1 2 3 4 5 ... 19999997 19999998 19999999]

it would be something like,

list a = [[1 0 1 2 3 4 5 ... 999999]
            [2 1000000 1000001 1000002 ... 1999999]
            [3 2000000 2000001 ................ 2999999]
            ...................................
            [20 19000000 19000001 ........... 19999999]]

then the first element of each list can be used to flag the correct list to find the nth element in,
without having to walk through each element in the mini list, this might not avoid stack overflow much,
but it might make the 4 second barrier obsolete, but I haven't tried it yet, just a quick thought...

how large a database does q have to work with to find a 4 second barrier?

fuzzy wozzy

unread,

Sep 25, 2016, 4:36:43 PM9/25/16

to Shen

actually, indexing each sublist is not necessary, for instance, with the above list a, but without index values,

list a = [ [0 1 2 3 4 5 ... 999999]
[1000000 1000001 ... 1999999]
..........................
[19000000 19000001 ... 19999999]]

to query,

(find 19000324 (a))

first you take 19 from 19000324 and call the 19th element in list a,

(find 19 (a)) \\ this gives the sublist [19000000 19000001 ... 19999999]

then take the rest of the numbers, which is 000324, or 324, and call the 324th element in the sublist called by (find 19 (a)),

(find 324 (find 19 (a))) \\ this gives the answer without having to walk through non-relevant prior sublists...

this makes it easy to estimate the clock cycle time required, since 19 + 324 = 343 consings,
then, 20000000 clock cycles takes 8 ms, then 343 clock cycles might take 1.372e-4 ms, or 0.0001372 ms
plus the shen overhead and other things... (is that how it's calculated? probably easy to time it, but...)

this was all talked about in detail some years ago, but no one seemed too interested, only thing it did was make some people
uneasy that list, not vectors, should be used for array programming...

fuzzy wozzy

unread,

Sep 26, 2016, 8:02:27 AM9/26/16

to Shen

well, I tried out the list of sublists... very little difference, clocking in at 2.05 seconds...

(time (from 999998 (from 19 (a)))) where list a is a list of 20 sublists each with 1 million elements...
so I guess the list in free shen is about 64.3 times slower than it needs to be for querying on a list of 20 million elements,
the shen professional is supposed to be lightning fast, using vectors, hundreds of thousands of array calculations per second,
no one probably cares about shen list for arrays, probably never did, rightfully so, perhaps...

fuzzy wozzy

unread,

Sep 26, 2016, 8:02:27 AM9/26/16

to Shen

the absence of time difference in the previous post is due to all the 20 sublists being generated,
even when calling the last sublist with (from 19 (a)), in the course of going from 0 to 19, all the
sublists are created, thus no time difference, but here is a way, if one must, to create only the sublist
called for, in this case the 19th sublist, these are things I forgot and am re-remembering...

(define i -> (iter0 1000000)) \\ a list of 1 million elements

\\ each element in list a is defined as a list of 1 million elements, calling (a0) would be activating
\\ a list of 1 million elements...
(define a -> [a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19])

\\ now define each sublist (unless they're already defined)
(define a0 -> (i))
(define a1 -> (i))
(define a2 -> (i)) \\ let's define the rest the same all the way to a18
...................
(define a18 -> (i))
(define a19 -> (reverse (i))) \\ 999998th element of (a19) is "1"

\\ then to query the 999998th element of the sublist 19 without generating or activating all the other sublists in the process,

(time (from 999998 ( (from 19 (a)) )))   \\ (from 19 (a)) returns the 19th element from (a), which is a19
                                                      \\ ( (from 19 (a)) ), by adding "()", you are calling (a19), thus activating the list defined by a19
run time: 0.343 secs
1   \\ the answer is as expected but took a lot less time than before.

you could define from a0 to a100 (or a1000?), each with 20 million elements, that's like having a list of 2 billion elements,
and still can query a few times before stack overflow kicks in...

fuzzy wozzy

unread,

Sep 26, 2016, 8:02:27 AM9/26/16

to Shen

somehow I got stuck saying (find) when I meant (from), I don't know what (find) function is,
but (from) was posted here a couple of years ago, and I could post it here if anyone wants
to try out the example, same goes for (iter0)...

fuzzy wozzy

unread,

Sep 26, 2016, 8:02:28 AM9/26/16

to Shen

I forgot to multiply the above answer by 64.3 since (from) took 0.515 seconds instead of 8 ms, (/ 0.515 0.008) = 64.3,
so multiplying 0.0001372 ms by 64.3 gives 0.00882196 ms,

I think that, to fully calculate the time, one should find the time it takes to find the next to last element in a list of 20 million elements,
then compare the time to 8 ms, and when you break the large list into a list of sublists, again the next to last element in the
19th sublist should be queried...

Mark Tarver

unread,

Sep 26, 2016, 8:07:07 AM9/26/16

to Shen

Well I think we've covered this enough. Using lists for DBs is not bonkers but flat lists are bonkers for large DBs. Trying to access the nth element quickly is what vectors were made for.

Mark

fuzzy wozzy

unread,

Sep 26, 2016, 11:06:58 AM9/26/16

to Shen

old news to most shenturians, but by using the above method for querying, the search time remains constant, at about 0.34 seconds,
in the above example, whether you're querying on a (combined) list of 20 million elements, or 20 trillion...

Reply all

Reply to author

Forward