David Bakhash <ca...@mit.edu> writes: > Is there a way to get sizeof information about an object? i.e. how > would I find out how many bits it takes to store something?
Lisp is not about this. It is not a language for describing memory usage--it is a language for describing what you want to do and for allowing the implementation to manage memory. Most commercial implementations give you access to other languages if you want the other behavior, but Lisp is defined in ways that allow the implementation to optimize or rearrange memory behind your back. Hash tables are an excellent example that in some implementations they change shape internally and even use different lookup tactics depending on what you store. For example, they might observe that you've stored integers very densely and substitute an array. Or they might use a simple alist rather than a hashing mechanism if the cross-over point for using the hash function at all has not been exceeded. In light of this, what does it mean to ask the size of an object? Suppose I said it's called sizeof--show me a program you'd want to write that uses the answer that would come back...
> > Is there a way to get sizeof information about an object? i.e. how > > would I find out how many bits it takes to store something?
> exceeded. In light of this, what does it mean to ask the size of an > object? Suppose I said it's called sizeof--show me a program you'd > want to write that uses the answer that would come back...
okay, kent. You asked for it...
first off, I'm writing a compression algorithm whose optimizing cost function is nothing more than actual memory storage. Thus, at the end of the day, I want it to find an encoding for a hunk of data which minimizes _litterally_ the amount of memory used. But this is not so simple. An example...
suppose I were trying to compress some text, and I noticed that it looked like:
if I used the length of the encodings as the only contributor to cost, then I'd probably end up with a symbol set like:
'("a" "b" "c" "d" "abc")
in which case, the encoding of the above string would be:
'(4 4 0 1 3)
But, in building the encoding table, if I knew the _overhead_ to creating a whole new string (i.e. (sizeof "")) and used that in the cost function, then who knows? the symbol set might end up looking like:
'("a" "b" "c" "d" "ab")
and, the encoding for the above string would be:
'(4 2 4 2 4 3)
So, I'm not actually using it for the coding -- i.e. it's probably not gonna be in the code; it's just there to give me an idea of the over overhead of certain low-level data structures.
David Bakhash <ca...@mit.edu> writes: > first off, I'm writing a compression algorithm whose optimizing cost > function is nothing more than actual memory storage. Thus, at the end > of the day, I want it to find an encoding for a hunk of data which > minimizes _litterally_ the amount of memory used. But this is not so > simple. An example...
And you're assuming, I guess that (a) the system isn't also optimizing memory storage for you and that (b) there isn't sharing involved. To understand MY point of view, you have to understand that the language I have worked to create is not one that you have to give up as compilers get smarter. It might make sense in a string context to assume that strings are pretty much laid out as you expect, although I doubt if you tried hard you could find any place in the spec where it said they were. If I do (make-array (1+ most-positive-fixnum)) you might assume it will blow out for lack of space, but what if it just makes a sparse array? Is the implementation forbidden from this? No. I think it's purely an issue of commercial competition. Things people want will succeed and things people don't will fail, but the language is not trying to force an implementation not to provide services to the user, so the language doesn't say "if you ask for a big array, it will take correspondingly much space". Rather, it says "if you ask for an array of a given size, you'll get back an object which when you use AREF will get you the elements. But AREF is not obliged to do that as a memory access, and indeed implementations can differ on what they do storage-wise if you ask for element type (unsigned-byte 7).
On the topic of commecial implementation, btw, one implementation a while back chose to implement various potentially destructive operations as non-destructive (i.e., NREVERSE just used REVERSE, etc.) since the side-effect is not required. That was conforming. It also wasn't a popular design choice. I'd be surprised if they didn't end up changing it because the market demanded it. But if they didn't, then presumably it was a good commercial choice for their chosen market base. Indeed, whether an implementation even offers a compiler (or an interpreter) is not defined. Some offer both. Some only one or the other. The semantics of the operations are defined, not how the implementation must work. The market sorts out what the good and bad choices are.
Anyway, what you say makes some sort of vague sense of you ask for size-of (I like it with a hyphen, btw) some string, but then, you can pretty much use length. And to the extent that you can't, you have to ask what the stuff is that is not accessible to length. What if I have an implementation that uses a certain size header but compiles new code periodically and then updates the header size to be more compact in a case that a particular kind of header is being used a lot? You're assuming that now and for all time, headers are decided at compile time, but need they be? And if they're not, what happens if you store the objects in an object and then the system behind your back changes something?
You probably think I'm talking science fiction, but am I? Let's look at cdr-coding. Someone made an observation a long time ago that lists have a lot of NIL's in them, and on the lisp machine a lot of storage is saved by instead of storing things in symmetric words, storing things in a word with 2 bits of info saying what the cdr is. Four bit configurations are possible, and they decode to: "the next word is the cdr", "the next word points to the cdr", "the cdr is nil" (with no "other word" needed anywhere), and "you're in error if you think this thing has a cdr". To the user who doesn't know any of this, these all behave exactly as a lisp you don't know... except that generally I bet storage comes out AUTOMATICALLY more compact than in other implementations. Yet I bet if you ran your portable storage optimizer on it, you'd risk pessimizing the storage. For example, in a cdr-coded implementation, doing a RPLACD sometimes conses because sometimes there isn't room enough in the cdr to store the thingyou need and words have to get shuffled around to make room. You'd think this was fatal, but the statistical claim made is that even though some programs do RPLACD and that sometimes causes consing, it still doesn't matter because most programs don't, and the benefits outweigh the risk.
Cdr coding is just one of many mechanisms that I think a system can do to optimize things IF the user doesn't get it in his head that the precious moments of his life are best spent trying to accidentally work against the system's ability to do this. You can't have two non-cooperating parties "optimizing" something because they will destroy each other's good work. And the idea in Lisp is that the user is supposed to work on the domain things and leave the memory layout to the system. That may be a bad claim, but I think it is fundamental to Lisp's design and a key reason why Lisp programmers stay "above the fray" and focused on their domain problems when oher attacks at the same domain problems in other languages get bogged down in memory leaks, etc.
It's not that you can't model memory optimization. It's not that you can't contact a vendor and find out how they lay out memory and assure yourself that you're not working against THAT vendor. But the language as structured portably is not about telling vendors how to do their job, and so when you get done, you should NOT assume that other implementations will lay out stuff similarly.
By the way, memory layout is not the only thing to optimize, and this is another reason you should consider not trying to second-guess it. Paging behavior may be another thing to optimize, and you have no idea how that is done. EVEN IF I told you that an object took up 2 bytes, you still wouldn't know if th ose two bytes are contiguous, split into separate spaces to take advantage of some parallel architecture, broken over a page boundary, etc. Sometimes these matter. But never portably. Because architectures vary. Also, certain size objects might work well for certain operations because they are tuned to a particular internal machine instruction--getting a more compact data layout may get you worse performance in some cases. Again, Lisp makes no claims about any of this, so trying to write solutions to these imagined problems in a portable way is meaningless. Implementation-specific answers may make sense, but then all you need is an implementation-specific tool that matches the need of that implementation. Whether said tool should be portably available, I'm doubtful.
Suppose you and I share a townhouse where we have a wall in common. If board to make walls is sold in whole-wall sizes, and if I ask you how much board it takes to make a housing unit, is the answer 4 [walls] becuase that's what it would take to start a new house, 3 because that's what it takes to add a unit, 3.5 becuase that's what it costs to share expenses, or...? If you're used to free-standing houses, you might assume the answer is simple, but it isn't always.
I hope this helps you see my point of view. Not trying to say it makes your problem any less real, and not trying to say there aren't vendors out there who can help you on a per-vendor basis, but from the point of view of the Language, I just don't understand how it could offer you better than what it does without crippling implementations in their attempt to become better over time. C, by contrast, pretty much can't become better over time (without making you change your programs) and so is happy to offer you what you want in a clear way.
* David Bakhash <ca...@mit.edu> | I'm using ACL 5.0, on MS Windows. But all I need is an estimate on this: | | (size-of "") | | and any implementation of Lisp on any system will probably give me a good | enough idea. It's just an parameter for an algorithm of mine to work. | it's not the crux, by any means.
* Nick Levine <n...@harlequin.co.uk> | Ugh, looks like hard work.
sure. SIZE-OF should be hard work. we don't want to encourage people to do simple stuff like (time (make-string n)) for increasing values of N when what they are doing is entirely wrong to begin with. optimizing bad behavior is not a good idea. moreover, I'm sure he's happier trusting machine addresses than the output from the TIME macro. incidentally, TIME reports 32 other bytes too many in Allegro CL unless tuned with
(setq excl::time-other-base 32)
(thanks to Duane Rettig <du...@franz.com> for this tidbit.)
Apart from cdr-coding, could someone give another example of similar techniques (that would make the hypothetical (SIZE-OF FOO) return different values for the same FOO at different times)? (Asking just out of curiosity.) I can also imagine an implementation that always returns the same (EQ) pointer for the one and only empty string (""), and/or that could be optimised for short strings so that the representation (and `size of') strings of length 1 is different from that of strings of length n.
(2) sizeof makes sense, SIZE-OF doesn't
Anyway, C is so low level with respect to memory and memory representation of data objects that sizeof makes perfect sense there---but not with Lisp. (The C standard goes to great lengths in describing a specific storage model while avoiding actual machine dependence---seen anything like that in ANSI CL?)
Besides, if we were to have SIZE-OF in Lisp (I am not saying we should), should its value return just the `payload' or also count the system bits? E.g., with fixnums in 32 bits with 3 tag bits, would the `size' of a fixnum be 29 bits or 32 bits? We could have a holy war on this issue... And this question simply is not applicable to the languages that have sizeof like C and C++.
(3) ROOM might also help
Also, if you really must have an idea about the (current!) `size' of an object without resorting to implementation-specific^1 means (which every implementation provides, or at least documents how things are represented internally), you could also use ROOM, but you'd better allocate a sufficient number of similar objects before the second^2 call to ROOM as it may also count memory occupied by system objects that get allocated and freed behind the curtain. _____________________________ ^1 ROOM is standard (but its results are of course implementation- specific) ^2 I mean, of course, to: call ROOM; do the allocation; call ROOM
Good luck, Vassil.
Vassil Nikolov <vniko...@poboxes.com> www.poboxes.com/vnikolov (You may want to cc your posting to me if I _have_ to see it.) LEGEMANVALEMFVTVTVM (Ancient Roman programmers' adage.)
-----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Replying to vassil's message, but not to him per se. He just gave me a nice place to attach some thoughts...
vniko...@poboxes.com writes: > (1) Other techniques?
> Apart from cdr-coding, could someone give another example of similar > techniques (that would make the hypothetical (SIZE-OF FOO) return > different values for the same FOO at different times)?
The future is a long time. Taking surveys of existing implementations is not always a good measure of what could come, and the if you abuse the results, you can risk changing the future. That said, I'll indulge you...
I think it's possible that self-adaptive tables would reimplement themselves based on a usage count or if they survive a certain level of gc. But I can't point to a specific implementation that does.
Another situation is the implementation of integers and floats in a number of implementations. It is possible that a number can have a size and then disappear. Uh, this is hard to explain because it can't show itself in any test such as you ask about, and yet it is a point of philosophy that troubles me. Consider: (let* ((x 1) (y 1) (z (+ x y)) ) x) What is (size-of x)? 1? What about (size-of y)? 1? Would it surprise you if I told you the size of the constants area of this program can be 1, not 2? (I hope not.) If you were to measure the size, of course, it would change. But if you try had to see into my head and what it is that troubles ME about size-of, it's captured in this example. SIZE-OF is about a representation presupposing a specific implementation. But I don't ask LISP for a specific implementation--I ask it to solve my problem. I want it to have every aspect that I have not pinnned available for optimization, to the point of translating this program to (let ((z (+ 1 1))) z) to the point of translating this program further to (let ((z 2)) z) to the point of translating this program to 2 Consequently, even asking whether there IS an X is troubling.
it can even recede to zero size if it's something like (+ x 1) and can be reduced in an instruction set to move r, x(1) in some instruction set where the "addition" can be handled by clever indexing.
You could say that the issue is only what size-of can get its hands on, but not every size-of computation is just about numbers you can get your hands on. It might be using size-of in a meta-program like a macro to count distinct constants by recursive descent of a program at static analysis time, and you'd have to know that this was wrong...
This question reminds me of the age-old philosophical question of whether people know that 2+2=4. Well, they probably do. If they die and you cut open their brain, you can probably, with enough technology, find the place where 2+2=4 is manifestly represented as a fact. But can you find 237+118=355? That's not at all so clear. What does it mean to "know" this? Sometimes it means it can be computed on demand. And so when you ask how much storage it takes for these facts, you might find that some facts are not stored at all. When you teach someone a high-level concept that can substitute for memory, I like to believe there is at least the potential to gc the many random facts that it supports, since the high-level concept will more compactly access a more general result. If not, people are ill architected. And with computers, where we don't rely on chance but on design for the architecture, I like to believe that we design in the right to drop irrelevant detail and I like to believe we've dealt with specific storage issues as irrelevant.
There was endless fussing in the design of CL over my original choice to refer to symbol-function and symbol-value as "slots" in a symbol. A certain vocal and influential set of people on the committee eventually prevailed in getting me to not say this because they felt it would imply that the storage must be allocated even if not needed as a contiguous part of the object. Heck, I don't assume that just because CLOS objects have slots that they are allocated, or contiguous. That's up to an implementation. If it wants to grow a slot on demand and no operator can discern this, that's fine by me. At least from the language design point of view. A user might say it's not ok performance-wise, and that's a market issue. But I was forced to change the wording to a more neutral term to avoid confusion of a kind that I morally object to ever being considered possible. (It's not that I don't think people get confused in the way these people expected; it's that I don't elieve in catering to people who have that confusion. If you think Lisp memory management is about dictating storage layout, I think you're just confused.)
What it is to be a slot is that the slot accessor gets the value and the slot setter changes what the accessor gets. What it is to be an object of a type is that you seamlessly integrate as an appropriate argument to all operations on your type. Everything else is fiction. Creating new operators like SIZE-OF does not just test what you did, it changes reality and requires reimplementation. The memory you want to maasure is the Shroedinger's Cat of the Lisp world, and you're killing the poor fuzzy little thing by peeking inside. (You meaning whoever wants this operator; not necessarily you, Vassil.)
> I can also imagine an implementation that always > returns the same (EQ) pointer for the one and only empty string (""),
Actually, of course, (make-array 10 :fill-pointer 0) is an empty string, but I guess you mean the only empty string of fixed length and element-size that is not actually adjustable, does not have a fill pointer, etc. Heh. But I know what you mean. I think this is a good example in spite of the caveats one has to make in order to make it work. I just didn't want anyone walking away thinking there was only one empty string lest they break their programs.
> and/or that could be optimised for short strings so that the > representation (and `size of') strings of length 1 is different > from that of strings of length n.
Well, indeed. Or you could gamble that most strings were all-lowercase or all-uppercase alphabetic, and use a specialized implementation of them that you had to morph if you ever did a setf of them. Indeed, tricks like this may well be used in high-quality commercial implementations trying to manage unicode, since you don't want to pay the price of a double-byte character if the person is not using them.
> (2) sizeof makes sense, SIZE-OF doesn't
> Anyway, C is so low level with respect to memory and memory > representation of data objects that sizeof makes perfect sense > there---but not with Lisp. (The C standard goes to great > lengths in describing a specific storage model while avoiding > actual machine dependence---seen anything like that in ANSI CL?)
Yes, exactly. C is the language you use to lay out memory. It is useful to have some language that does control memory as an implementation substrate for others, but it's also useful to rise above the constant, caring more about what's above.
> Besides, if we were to have SIZE-OF in Lisp (I am not > saying we should), should its value return just the `payload' > or also count the system bits? E.g., with fixnums in 32 bits > with 3 tag bits, would the `size' of a fixnum be 29 bits or > 32 bits? We could have a holy war on this issue... And this > question simply is not applicable to the languages that have > sizeof like C and C++.
> (3) ROOM might also help
> Also, if you really must have an idea about the (current!) `size' > of an object without resorting to implementation-specific^1 means > (which every implementation provides, or at least documents > how things are represented internally), you could also use > ROOM, but you'd better allocate a sufficient number of > similar objects before the second^2 call to ROOM as it may also > count memory occupied by system objects that get allocated > and freed behind the curtain.
yes, but don't forget to count sharing. Remember that ROOM also captures the myriad cases of '(#1=(A B) #1#)) which if written to a file and re-read would take a different size. ROOM is very coarse-grained.
This is a good place to say "what are you trying to do and why are you trying to do it this way". Saying "i'm trying to optimize memory use" begs the question. For whom? What does your spec look like? Are you building a memory optimization tool that takes no parameters? Oughtn't it take parameters that allow the user to parameterize it for each of the myriad possible design decisions that the implementation might have taken, and if you (again, the generic "you") don't offer args like :cdr-coded-implementation-p, then aren't you assuring that your "portable tool" will be meaningless even if you can "port" it?
* vnikolov wrote: > Apart from cdr-coding, could someone give another example of similar > techniques (that would make the hypothetical (SIZE-OF FOO) return > different values for the same FOO at different times)? (Asking just > out of curiosity.) I can also imagine an implementation that always > returns the same (EQ) pointer for the one and only empty string (""), > and/or that could be optimised for short strings so that the > representation (and `size of') strings of length 1 is different > from that of strings of length n.
Hash tables with rehashing on GC. CLOS objects that have had their class definitions changed but that the system is doing some cleverness on that only reclaims some of the slot storage lazily, perhaps on GC.
The HP48 calculator, which runs a language that is lisp-related though not actually a lisp really, has two representations for things like small real numbers which use different amounts of storage! I'm not sure if it ever fully-automatically converts between representations, but it may do.