* Paul Dietz | The point you are missing is that the car (and caar) you load for the | alist are not for the current cell, but for previous cells. There's no | dependency between this iteration's cdr, car, and caar. You can delay | the car and caar because they're not on the critical path.
Who do you think you are kidding?
| In contrast, there *is* a dependency between the cdr and the cddr for the | plist, and this dependency cannot be gotten rid of.
But that is just like the dependency between the car and the caar!
Please work this out for plists, too. You _will_ be suprrised.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
Erik Naggum <e...@naggum.net> writes: > * "Paul F. Dietz" > | No, exactly the same thing cannot be done with plists.
> Nonsense. Have you worked this out for plists? Yes or no?
> | In the plist, there is only one 'off the critical path' car > | per *two* cdrs. There simply isn't enough extra work to fit > | into the dead space after each load on the critical path.
> Nonsense.
Can't someone just write a simple test and time it and then publish the various code sequences in question for review? Most things in the world are too big and too abstract for such an exercise, but this is a rare case that is small enough and concrete enough that you could just test it by brute force. It might waste less time than a "did not" "did so" "did not" "did so infinity" kind of back and forth interchange.
Erik Naggum wrote: > | The point you are missing is that the car (and caar) you load for the > | alist are not for the current cell, but for previous cells. There's no > | dependency between this iteration's cdr, car, and caar. You can delay > | the car and caar because they're not on the critical path.
> Who do you think you are kidding?
I am not kidding. I admit I am failing to penetrate your confusion.
I suggest you look up the compiler optimization called 'software pipelining'. What I'm suggesting it just an application of that well-known technique, which stretches out and overlaps iterations so as to avoid stalls. In this case the alist algorithm can be more effectively pipelined because it has more work off the critical path than the plist algorithm.
> | In contrast, there *is* a dependency between the cdr and the cddr for the > | plist, and this dependency cannot be gotten rid of.
> But that is just like the dependency between the car and the caar!
They are not the same, Mr. Naggum. The dependency between the CAR and the CAAR is not on the critical path. We can wait several iterations between computing (CAR X) (for some cell X) before we compute (CAAR X). We *cannot* do the same thing for the CDR and the CDDR.
> Can't someone just write a simple test and time it and then publish > the various code sequences in question for review? Most things in the > world are too big and too abstract for such an exercise, but this is a > rare case that is small enough and concrete enough that you could just > test it by brute force. It might waste less time than a "did not" > "did so" "did not" "did so infinity" kind of back and forth interchange.
I wasted far too much time today doing just that. On my Pentium Pro 199Mhz (if the label is to be believed) I put in Erik's `alist-get' and `plist-get' implementation. I tested them on an 11 element alist or plist (as appropriate) and timed how long it took to fetch the 9th and 10th elements. I subtracted these two to get an estimate of how long it takes for one iteration. (safety 0) (speed 3)
alist-get: 40ns per iteration plist-get: 40ns per iteration
It appears that reality agrees with Erik.
This makes sense to me: Every iteration on an alist does 2 CARs and 1 CDR, while every iteration on a plist does 2 CDRs and 1 CAR. There is no reason to think a call to CAR is more or less expensive than a call to CDR. Given that this process is probably bound by the memory bandwith, 3 memory cycles is the probably the best one can do, no matter what order you do them it. And since the pentium pro can do out-of-order execution (I believe the pro can, I know the later ones can) seems likely that a fairly optimal execution sequence is happening in either case.
* Paul Dietz | I am not kidding. I admit I am failing to penetrate your confusion.
Oh, I see. You already know the one true answer, so bothering to read what I post is not necessary to you.
| I suggest you look up the compiler optimization called 'software | pipelining'.
Arrogant dipshit.
| What I'm suggesting it just an application of that well-known technique, | which stretches out and overlaps iterations so as to avoid stalls.
CAN YOU WORK IT OUT FOR PLISTS, TOO? YES OR NO?
| In this case the alist algorithm can be more effectively pipelined | because it has more work off the critical path than the plist algorithm.
No, this is simply _wrong_. Just get it, will you?
| They are not the same, Mr. Naggum. The dependency between the CAR and | the CAAR is not on the critical path.
Really?
| We can wait several iterations between computing (CAR X) (for some cell | X) before we compute (CAAR X). We *cannot* do the same thing for the CDR | and the CDDR.
Yes, it is _exactly_ the same thing. Why do you keep claiming this only for alists? You have not worked it out for plists, so you have no clue what you are talking about. What you have done for alists, can be done for plists. Just try it out, and you will see how simple this really is!
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
Erik Naggum <e...@naggum.net> writes: > * Paul Dietz > | The point you are missing is that the car (and caar) you load for the > | alist are not for the current cell, but for previous cells. There's no > | dependency between this iteration's cdr, car, and caar. You can delay > | the car and caar because they're not on the critical path.
> Who do you think you are kidding?
> | In contrast, there *is* a dependency between the cdr and the cddr for the > | plist, and this dependency cannot be gotten rid of.
> But that is just like the dependency between the car and the caar!
> Please work this out for plists, too. You _will_ be suprrised.
Simple example that does NOT agree with real hardware, but which should illustrate the _potential_ for more parallelism in the alist case:
Suppose you can "spawn" as many sub-threads as you like at any point, but following a pointer has a fixed cost. In the plist case, you need at least 2n time steps to traverse n key-value pairs because the fetch operations (mostly CDRs) are all data-dependend on their respective predecessors. In the alist case, one thread could "run down" the spine of the alist in roughly n steps, spawning a new thread at every CAR. The subthreads then independently check to see if their respective key is the one that we are looking for. This latter work can effectively overlap with the guy running down the spine, so the total time can be less than in the plist case.
On real hardware, _if_ the key comparison is cheap (e.g., EQ), then each of those subthreads will live only very shortly, so the amount of effective parallelism that needs to be supported by the hardware is bounded (and rather small).
Obviously, as Joe's tests demonstrate, getting any real advantage out of this on a real system is hard. It would require tight coding, hardware that can actually do something like the above, at least in some sense, and would probably still not matter for relatively short lists.
By the way, this is all pretty silly anyway. If I had to implement a dictionary data structure where the number of keys is large enough so that the above would matter, then I would *definitely* use something else that has better theoretical (and practical) complexity. (If there is some natural ordering on keys, then for example, red-black trees come to mind.)
Erik Naggum wrote: > | What I'm suggesting it just an application of that well-known technique, > | which stretches out and overlaps iterations so as to avoid stalls.
> CAN YOU WORK IT OUT FOR PLISTS, TOO? YES OR NO?
Yes I can. In fact, I've already told you the result. For plists you can't pipeline the loads as effectively.
This is a *trivial* consequence of the structure of the dependency graph of the loads. Both graphs have ~3 N nodes, but the length of the critical path in the alist search is N vs. 2N for the plist.
> | They are not the same, Mr. Naggum. The dependency between the CAR and > | the CAAR is not on the critical path.
> Really?
Yes, really (except for the final cell).
> Yes, it is _exactly_ the same thing. Why do you keep claiming this only > for alists? You have not worked it out for plists, so you have no clue > what you are talking about.
As Mr. Pitman requested, let's see the code. Here's a software pipelined implementation of (assoc ... :test #'eq), concentrating on the case of a non-null key. You will notice that in the main loop (not counting epilogue code), between any load and the first use of its result you will find at least two other loads. You can't do as well with plist search -- there aren't as many other loads to fit between the cdrs.
If we're on a machine in which (1) load latency dominates, and (2) three loads can be in progress at once, then this algorithm will take time N+O(1) (in units of load latency). *Any* plist algorithm would take time 2 N in this model of computation -- you can't start one critical path load until the previous one is completed, and there are 2 N such loads.
(defun assoc-eq (key alist) (if (null key) (assoc-eq-nil alist) ;; Software pipelined version of assoc :test #'eq on a non-nil key (prog (k1 k2 p x1 x2 y1 y2)
;; Loop prologue -- fill the pipeline ;; I haven't optimized this prologue
;; ;; At this point, the following invariants hold: for some i, ;; ;; (eq p (nth i alist)) ;; (eq k (car p)) ;; (eq y1 (nth (1+ i) alist)) ;; (eq x1 (nthcdr (+ i 2) alist)) ;; (consp x1) ;;
(setq x2 (cdr x1) ; critical path load y2 (car x1) k2 (car y1)) (when (eq k1 key) (return p)) (when (atom x2) ;; Fell off end of list -- flush the pipeline (return (cond ((eq k2 key) y1) ((eq (car y2) key) y2) (t nil)))) (setq p y1)
;; The loop has been unrolled once so that the first ;; use of y2 is delayed. ;; The following is identical to the previous ;; except that ?1 & ?2 variables have been swapped (setq x1 (cdr x2) ; critical path load y1 (car x2) k1 (car y2)) (when (eq k2 key) (return p)) (when (atom x1) ;; Fell off end of list -- flush the pipeline (return (cond ((eq k1 key) y2) ((eq (car y1) key) y1) (t nil)))) (setq p y2)
(go loop))))
(defun assoc-eq-nil (alist) ;; This can be implemented similarly, but we also must check ;; that the element of alist is not nil. ;; For now, this is just a placeholder. (assoc nil alist :test #'eq))
> If the machine is such that memory bandwidth is more > important than memory latency, then of course the point > I was making does not apply.
As I've said, my understanding of this area is weak, but perhaps then Joe would have had better luck if he were searching a circular list, since then the memory in question might be in a cache and would not dominate the time spent? This, in turn, might allow you to loop out of control more efficiently with circular alists than with circular plists, if I understand the claim. ... Just an idea.
Perhaps not as useless as it sounds. The C crowd often makes sales over the Lisp crowd when they claim that although they can't make some of the kinds of tools we regularly do, they can at least do it faster... (Maybe they don't put it quite that way, though.)
<pit...@world.std.com> wrote: > Erik Naggum <e...@naggum.net> writes:
> > * "Paul F. Dietz" > > | No, exactly the same thing cannot be done with plists.
> > Nonsense. Have you worked this out for plists? Yes or no?
> > | In the plist, there is only one 'off the critical path' car > > | per *two* cdrs. There simply isn't enough extra work to fit > > | into the dead space after each load on the critical path.
> > Nonsense.
> Can't someone just write a simple test and time it and then publish > the various code sequences in question for review? Most things in the > world are too big and too abstract for such an exercise, but this is a > rare case that is small enough and concrete enough that you could just > test it by brute force. It might waste less time than a "did not" > "did so" "did not" "did so infinity" kind of back and forth interchange.
Using MCL 4.3.1:
? (setf l1 '(a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8)) (A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8) ? (setf l2 '((a . 1) (b . 2) (c . 3) (d . 4) (e . 5) (f . 6) (g . 7) (h . 8))) ((A . 1) (B . 2) (C . 3) (D . 4) (E . 5) (F . 6) (G . 7) (H . 8)) ? (getf l1 'h) 8 ? (assoc 'h l2) (H . 8) ? (time (dotimes (i 1000000) (assoc 'h l2))) ;Compiler warnings : ; Undeclared free variable L2, in an anonymous lambda form. (DOTIMES (I 1000000) (ASSOC 'H L2)) took 192 milliseconds (0.192 seconds) to run. NIL ? (time (dotimes (i 1000000) (getf l1 'h))) ;Compiler warnings : ; Undeclared free variable L1, in an anonymous lambda form. (DOTIMES (I 1000000) (GETF L1 'H)) took 740 milliseconds (0.740 seconds) to run. Of that, 10 milliseconds (0.010 seconds) were spent in The Cooperative Multitasking Experience. 200 bytes of memory allocated. NIL ?
In article <3C9A8338.DCD81...@interaccess.com>, "Paul F. Dietz"
<di...@interaccess.com> wrote: > Joe Marshall wrote:
> > It appears that reality agrees with Erik.
> If the machine is such that memory bandwidth is more > important than memory latency, then of course the point > I was making does not apply.
Right. Was your benchmark walking over a long a/p-list or iterating multiple times over a short one? When doing the latter on a G4 so everything is in cache, plists were much slower (740 ms vs 191 ms for 1000000 iterations over an 8-element a/p-list). (Transcript was posted in the previous thread.)
> One reason is that memory utilization is no longer the tight issue it > used to be; another thing the mention is that CDR-coding might > actually slow down certain common pointer traversal idioms.
But with the ever increasing gap in speed between cpus and memory, it might become an issue again. I'm almost sure someone will reinvent CDR-coding, and think of it as the best thing since sliced bread.
g...@jpl.nasa.gov (Erann Gat) writes: > Using MCL 4.3.1:
> ? (setf l1 '(a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8)) > (A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8) > ? (setf l2 '((a . 1) (b . 2) (c . 3) (d . 4) (e . 5) (f . 6) (g . 7) (h . 8))) > ((A . 1) (B . 2) (C . 3) (D . 4) (E . 5) (F . 6) (G . 7) (H . 8)) > ? (getf l1 'h) > 8 > ? (assoc 'h l2) > (H . 8) > ? (time (dotimes (i 1000000) (assoc 'h l2))) > ;Compiler warnings : > ; Undeclared free variable L2, in an anonymous lambda form. > (DOTIMES (I 1000000) (ASSOC 'H L2)) took 192 milliseconds (0.192 seconds) > to run. > NIL > ? (time (dotimes (i 1000000) (getf l1 'h))) > ;Compiler warnings : > ; Undeclared free variable L1, in an anonymous lambda form. > (DOTIMES (I 1000000) (GETF L1 'H)) took 740 milliseconds (0.740 seconds) to run. > Of that, 10 milliseconds (0.010 seconds) were spent in The Cooperative > Multitasking Experience. > 200 bytes of memory allocated. > NIL
We should look at this carefully, Erann. It is possible you are measuring the wrong thing here. Let me suggest the possible problem:
First, I assume that MCL is compiling the forms that it is running. If not, then that is the first place to start, since we shouldn't want to see the interpreter running. However, I am assuming that no interpreter is running here, otherwise I would have expected the &key processing in the assoc call to vastly outweigh the &optional processing of the getf.
However, I would still recommend that you put each of these dotimes forms into a defun and compile it (if necessary) and then disassemble it, to see what is _really_ being run.
I suspect that assoc is being optimized, by being rewritten to call some internal function (let's call it assoc-eql), because it is being called with no keyword arguments. It is even likely that there is no function call at all, and the assoc is simply being expanded in open code.
But the getf is likely not being optimized. I supect this because of the nature of the Stanford Benchmarks (i.e. the Gabriel benchmarks), which do contain calls to assoc but which do not contain any calls to getf. Getf, therefore was never one that any of us vendors cared to optimize, because it was never under microoptimization scrutiny.
If my guesses are correct, then what you have is a highly optimized version of assoc (possibly even open-coded) and a non-optimized getf, which at least has optional processing.
-- Duane Rettig Franz Inc. http://www.franz.com/ (www) 1995 University Ave Suite 275 Berkeley, CA 94704 Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)
* Matthias Blume | Suppose you can "spawn" as many sub-threads as you like at any point, | but following a pointer has a fixed cost.
Are you trying to make me pretend that spawning a sub-thread is free and following pointer has costs? This is the part I simply reject.
| In the plist case, you need at least 2n time steps to traverse n | key-value pairs because the fetch operations (mostly CDRs) are all | data-dependend on their respective predecessors. In the alist case, one | thread could "run down" the spine of the alist in roughly n steps, | spawning a new thread at every CAR. The subthreads then independently | check to see if their respective key is the one that we are looking for. | This latter work can effectively overlap with the guy running down the | spine, so the total time can be less than in the plist case.
I see that several people have been working this out on a mythical computer, not extant hardware. I am frankly completely unimpressed with the arguments that have been made so far. One does not optimize theoretical code for theoretical computers.
| On real hardware, _if_ the key comparison is cheap (e.g., EQ), then each | of those subthreads will live only very shortly, so the amount of | effective parallelism that needs to be supported by the hardware is | bounded (and rather small).
Well, at least they have to live long enough to grab the cons cell pointed to by the car and the contents o the car of that cell. Now, it is all well and good to have a CPU that can pipeline anything you want, but we are actually trying to talk to the same _memory_, right? And that is not the memory in either L1 or L2 cache, right? How many requests can the _memory_ pipeline from data that are going to be very close? I mean, my CPU has 32-byte cache lines, meaning it will usually not do any less than 32-byte fetches, but the crucial thing is to prefetch _both_ car and cdr of all cells visited in order to obtain maximal memory performance. To make this work, one would have to do massive analysis of the addresses that are actually fetched. This is actually possible as long as one has to fetch from real memory instead of L1 or L2 caches, or even L3 caches.
In order to make this a testable condition, there is no point in scanning some short list. The list has to be at least 1 million entries in length for the test conditions to ensure that all data cache lines are flushed. To make that work, there would be some savings in cdr'ing ahead to fill half the L1 cache so the cons cells, which I assumme will occupy 2 adjacent words, will be in the L1 cache when the car visits it. However, in total, there will be exactly as many memory references with a plist and an alist, because they use exactly as many cons cells.
I maintain that on, e.g., the IA32, which I have spent some time learning how to optimize the living daylight out of, you will never see any difference in performance between an alist and a plist. There is no magic in this, just the willingness not to "suppose" one can do something that no processor actually does.
| Obviously, as Joe's tests demonstrate, getting any real advantage out of | this on a real system is hard. It would require tight coding, hardware | that can actually do something like the above, at least in some sense, | and would probably still not matter for relatively short lists.
None of the compiled code I have seen has used prefetch instructions, and the pipeline on, e.g., IA32, is not long enough to see these actualy memory fetch coming. Therefore, current tests exhibit nothing of the kind we would like to prove in this exercise.
In order to measure the impact of any savings, the best approach is probably to have a forerunner that visits both cars and cdrs in an alist and the cdrs in a plist. It should probably be running at _least_ 8 cons cells ahead of their use, but something like 100 will probably make more sense, in order to give the memory time to fill up the cache lines. This also indicates that there is very significant savings to be had from starting the storage of any object on cache line boundaries.
But suppose your list is short enough that all cons cells can fit in L1 cache. The best optimization approach is simply to call length on it to move all the cons cells into L1 cache, and then to make sure that they stay there between every time you need to scan the list. If you manage to expire the cache lines between calls to either assoc or getf, there is no point in this exercise at all, since the next time it is needed, you will have to go out and prefetch it, anyway, and since both assoc and getf must return the first occurrence, the likelihood of prefetching memory that you will never use is pretty high.
| By the way, this is all pretty silly anyway. If I had to implement a | dictionary data structure where the number of keys is large enough so | that the above would matter, then I would *definitely* use something else | that has better theoretical (and practical) complexity.
Yup. I have been trying to argue that assoc is more expensive because it is more way more general and may do a _lot_ more work than getf, which could be as simple as my plist-get. To get a user call to assoc be as tight as my alist-get requires a lot of effort in both compiler and user code.
I think, however, that I have demonstrated exhaustively that there is no performance difference between alist and plist and that performance should never enter the picture when deciding which to use. For instance, performance is likely to be good for get-properties and destructuring using keyword arguments, neither of which have a parallell for alists.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
* "Paul F. Dietz" | This is a *trivial* consequence of the structure of the dependency graph | of the loads. Both graphs have ~3 N nodes, but the length of the | critical path in the alist search is N vs. 2N for the plist.
This is just plain wrong. Please understand that we are doing assoc, not member, on alists, OK? The car of an alist is just as much on the critical path as the cdr of a plist. This is frighteningly obvious if you spend the time to __study it rather than make up your mind and then spend your time to "prove" it.
| As Mr. Pitman requested, let's see the code. Here's a software | pipelined implementation of (assoc ... :test #'eq), concentrating | on the case of a non-null key.
What does "concentrating on the case of a non-null key" mean?
| You will notice that in the main loop (not counting epilogue code), | between any load and the first use of its result you will find at least | two other loads. You can't do as well with plist search -- there aren't | as many other loads to fit between the cdrs.
I have asked you to work this out for plists, too, but you keep posting only code relevant to alists. I am not interested in what you have to say about alists as long as you make _no_ effort to show what you can do similarly with plists. Can you please understand this?
| If we're on a machine in which (1) load latency dominates, and (2) three | loads can be in progress at once, then this algorithm will take time | N+O(1) (in units of load latency). *Any* plist algorithm would take time | 2 N in this model of computation -- you can't start one critical path | load until the previous one is completed, and there are 2 N such loads.
*sigh*. Yes, that is precisely what you can do. There is no difference.
Work it out similarly for plists. CAN YOU DO THIS, PLEASE!? Christ!
Your code is extremely unlikely to have any effect, by the way. If you have to grab data from external memory, you need _way_ more than two intervening loads. An L1 cache hit is basically free, so the point is to get stuff into the L1 cache. This does not happen any differently in your code than in a straight version of alist-get and plist-get that I posted.
I wonder why you are so stubborn about something that is clearly wrong.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
* Duane Rettig <du...@franz.com> | We should look at this carefully, Erann. It is possible you are | measuring the wrong thing here. [...] : | If my guesses are correct, then what you have is a highly optimized | version of assoc (possibly even open-coded) and a non-optimized getf, | which at least has optional processing.
This is why I wrote my own alist-get and plist-get and even wrote the tagbody out in full and tweaked it considerably. The overhead of using assoc and getf with all their (hugely different) generality completely wipes out any savings in memory access unless the lists are _really_ long, as in: _way_ beyond the threshold at which to switch to hashtables.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
> ? (setf l1 '(a 1 b 2 c 3 d 4 e 5 f 6 g 7 h 8)) > (A 1 B 2 C 3 D 4 E 5 F 6 G 7 H 8) > ? (setf l2 '((a . 1) (b . 2) (c . 3) (d . 4) (e . 5) (f . 6) (g . 7) (h . 8))) > ((A . 1) (B . 2) (C . 3) (D . 4) (E . 5) (F . 6) (G . 7) (H . 8)) > ? (getf l1 'h) > 8 > ? (assoc 'h l2) > (H . 8) > ? (time (dotimes (i 1000000) (assoc 'h l2))) > ;Compiler warnings : > ; Undeclared free variable L2, in an anonymous lambda form. > (DOTIMES (I 1000000) (ASSOC 'H L2)) took 192 milliseconds (0.192 seconds) > to run. > NIL > ? (time (dotimes (i 1000000) (getf l1 'h))) > ;Compiler warnings : > ; Undeclared free variable L1, in an anonymous lambda form. > (DOTIMES (I 1000000) (GETF L1 'H)) took 740 milliseconds (0.740 seconds) to run. > Of that, 10 milliseconds (0.010 seconds) were spent in The Cooperative > Multitasking Experience. > 200 bytes of memory allocated. > NIL
* Duane Rettig | We should look at this carefully, Erann. It is possible you are | measuring the wrong thing here.
Not only that, but the "Cooperative Multitasking Experience" seems like a fairly suspicious thing to include in such a test. One would hope that at least a sufficient number of samples to produce statistically significant results would have been included. (I seem to recall a lot of noise about statistical significance.) The above really says nothing -- for all we know, the system could have been swapping like mad, and that would be system time instead of user time, but such is not distinguished. A suffiently large sample would have removed such unintended 'spikes" from the data set.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
> If we're on a machine in which (1) load latency dominates, and > (2) three loads can be in progress at once, then this algorithm > will take time N+O(1) (in units of load latency). *Any* plist > algorithm would take time 2 N in this model of computation -- you > can't start one critical path load until the previous one is > completed, and there are 2 N such loads.
OK, I see it now (although I think that the alist time is N+2). The 2N load latency for the plist was the critical piece that I was missing. I even had to draw myself a picture in order to see it well. Perhaps it will help others to see my picture also.
What is critical is that the machine has the ability to do three memory references at once.
Note that the picture has the same number of cons cells in each list type, so you would think that parallel loads could be done one both equally efficiently. However, given tN where N is the ordinal of a best-case memory cycle, the best we would be able to do go all the way through the 4-element plist (with a failure to find anything) is the time it takes to get to key3 (k3) which occurs at t7 (or 8 cycles). The alist could get k3 by t5 (or 6 cycles). Note also that the alist is using all three loads at once, whereas the plist is only doing two loads max at any one time. This explains the experimental data that I and others got (although I never posted any of my data). If the machine can only handle two loads at once, there will be no difference in timings, because one of the three tN loads for the alist will always be stalled.
t0 t1 t2 t3 ----- ----- ----- ----- | | |->| | |->| | |->| | | ----- ----- ----- ----- | | | | v t1 v t2 v t3 v t4 ----- ----- ----- ----- |k0|v0| |k1|v1| |k2|v2| |k3|v3| ----- ----- ----- ----- t2 t3 t4 t5
-- Duane Rettig Franz Inc. http://www.franz.com/ (www) 1995 University Ave Suite 275 Berkeley, CA 94704 Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)
In article <41yedf67z....@beta.franz.com>, Duane Rettig <du...@franz.com> wrote: > We should look at this carefully, Erann. It is possible you are > measuring the wrong thing here. Let me suggest the possible problem:
> First, I assume that MCL is compiling the forms that it is running.
Yes, it does.
[Snip]
These are good points nonetheless. Here's the disassembly. I'll leave it as an excercise for the reader to figure out which is which :-)
It's interesting, though, that the builting assoc function being called is named *assq, which historically might imply :test #'eq instead of the default :test #'eql. If you have supplied the list to it as a constant alist, it could have implied that :test #'eq was good enough, since all of the keys are symbols. I'm surprised, though, that it didn't just unroll the loop inline. But perhaps it would have if you had compiled with higher speed/safety ratio.
> (DOTIMES (I 1000000) (FOO)) took 782 milliseconds (0.782 seconds) to run. > Of that, 11 milliseconds (0.011 seconds) were spent in The Cooperative > Multitasking Experience. > 40 bytes of memory allocated. > NIL > ? (time (dotimes (i 1000000) (baz))) > (DOTIMES (I 1000000) (BAZ)) took 241 milliseconds (0.241 seconds) to run. > NIL
So it seems like your test isn't checking against the potential speed of assoc vs getf, but only the current default speed of the implementation, probably due to optimizations for the Gabriel Benchmarks.
-- Duane Rettig Franz Inc. http://www.franz.com/ (www) 1995 University Ave Suite 275 Berkeley, CA 94704 Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)
> > Load car and cdr in parallel, load caar when car is available,
> [ plist ]
> > Load car and cdr in parallel, load cddr when cdr is available.
> > Am I missing something? These seem identical to me.
> The point you are missing is that the car (and caar) you > load for the alist are not for the current cell, but for > previous cells. There's no dependency between this iteration's > cdr, car, and caar. You can delay the car and caar because > they're not on the critical path.
(ian ponders for some hours) So you're saying, for X being "here", the memory pattern of a-list can go like:
* Duane Rettig | What is critical is that the machine has the ability to do three memory | references at once.
That is indeed a critical piece of information. I have reasoned based on explicit prefetch instructions because the pipeline is not long enough to _sustain_ three outstanding requests, especially considering the destructive effect of taken branches on the pipeline, and considering the amount of time it takes to fetch from external memory compared to the on-chip L1 cache, which will hold the entire cons cell in the same cache line, unles allocation is really screwed up. Unrolling loops has the significant disadvantage that execution needs more memory references.
I have severe problems seeing how you can possibly achieve code with three outstanding requests at all times. Not to mention the severe shortage of registers on IA32. Are there any CPUs that can _actually_ achieve this? If none, it is an interesting exercise in How Things Should Be, but not in optimizing code for real processors. Sadly, I am far more interested in actual CPUs than imaginary ones, but since acquiring a working knowledge of a processor family is pretty much depth-first, I have only working knowledge of one current architecture, and I am probably not up to speed on what IA32 can do, either. From what I can find in the documentation, however, I am unable to write the code for this architecture that will _actually_ achieve more than two concurrent accesses to implement assoc (or my simplified alist-get) even though the CPU is supposed to be able to handle four.
There are two obvious reasons for this: (1) memory accesses that end up with a L1 cache hit are not worth optimizing further, and (2) if you have grabbed one of the car and the cdr of a cons cell, the other access is free. If you have to hit the L2 cache or worse, have to request it from real memory, you still get a 64-bit unit back from the memory subsystem. In other words, if you keep moving two cdrs ahead in a plist, the car is there for free in a plist, whereas the alist must access both car and cdr of the given cell. Yes, there is savings to be had in pipelining this, but it has no practical edge over that in plists, because we are not optimizing just the processor pipeline, we have to request the memory and have something that takes some time to get, before this matters.
I sort of regret having to be incredibly concrete and low-level, but that is where these things actually happen. It is very convenient to optimize some abstract access model, but it has to be implemented and actually be possible to generate instructions for it such that it actually matters. I have yet to see that be true.
I guess what I keep saying is that optimizing for a real processor is not some theoretical task that can ignore the actual processor.
You alluded to measurements that indicated a material difference, Duane, and I would just love to see that and the code that produced it on various processors.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
* Duane Rettig | It's interesting, though, that the builting assoc function being | called is named *assq, which historically might imply :test #'eq | instead of the default :test #'eql.
But (eql <whatever> <symbol>) is identical to (eq <whatever> <symbol>), so there is no point in using assoc with eql when you know that you search for something for which eql and eq are identical. I note that Allegro CL in all versions I have cared to look at, collapse eql to eq when the type of one of the operands is not of type number or character.
| If you have supplied the list to it as a constant alist, it could have | implied that :test #'eq was good enough, since all of the keys are | symbols. I'm surprised, though, that it didn't just unroll the loop | inline. But perhaps it would have if you had compiled with higher | speed/safety ratio.
Lispworks generally does not inline even car or cdr as I have seen it, either, while Allegro CL makes them a single instruction. (Good job!)
| So it seems like your test isn't checking against the potential speed of | assoc vs getf, but only the current default speed of the implementation, | probably due to optimizations for the Gabriel Benchmarks.
It is instructive to see the cost of benchmarks affect program design by almost "misrepresenting" the case for a particular design choice.
/// -- In a fight against something, the fight has value, victory has none. In a fight for something, the fight is a loss, victory merely relief.
Duane Rettig wrote: > It's interesting, though, that the builting assoc function being > called is named *assq, which historically might imply :test #'eq > instead of the default :test #'eql. If you have supplied the list > to it as a constant alist, it could have implied that :test #'eq > was good enough, since all of the keys are symbols.
The key he was searching for was a symbol, so :test #'eq was acceptable regardless of the keys in the alist.
Erik Naggum wrote: > | This is a *trivial* consequence of the structure of the dependency graph > | of the loads. Both graphs have ~3 N nodes, but the length of the > | critical path in the alist search is N vs. 2N for the plist.
> This is just plain wrong. Please understand that we are doing assoc, not > member, on alists, OK? The car of an alist is just as much on the > critical path as the cdr of a plist. This is frighteningly obvious if > you spend the time to __study it rather than make up your mind and then > spend your time to "prove" it.
Mr. Naggum, no, it is not just plain wrong. *You* are just plain wrong. *Study* that algorithm I posted. Build the dependency graph of the loads. You will see that the maximum path length in the alist graph is N+O(1) vs. 2N+O(1) for the plist.
> | As Mr. Pitman requested, let's see the code. Here's a software > | pipelined implementation of (assoc ... :test #'eq), concentrating > | on the case of a non-null key.
> What does "concentrating on the case of a non-null key" mean?
NIL is a special case, since an alist can contain NILs as well as conses. The test (eq (car pair) key) would incorrectly succeed on such elements if key were nil. The algorithm therefore handles the null key case separately, and I did not optimize that case.
> | *Any* plist algorithm would take time > | 2 N in this model of computation -- you can't start one critical path > | load until the previous one is completed, and there are 2 N such loads.
> *sigh*. Yes, that is precisely what you can do. There is no difference.
I've presented a trivial proof that you cannot (it's right in front of you there.)
How about you show me an algorithm that does plist search in better than 2N+O(1) load latency?
> I wonder why you are so stubborn about something that is clearly wrong.