Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

data size

409 views
Skip to first unread message

harry

unread,
Nov 8, 2001, 7:07:22 PM11/8/01
to
if you put data into dictionary, how do we calculate the data size?

key:data

example:

dict = {"key1":1,2,3,4; "key2": "hello", [1,2,3]}

how much is the data structure will cost? how to calculate it?

if in c++ : char array1[8] will cost 8 x 8 bytes = 64 bytes


harry

unread,
Nov 8, 2001, 7:28:18 PM11/8/01
to
actually, i think char data type in c is 1 byte (8bit)

"harry" <hart...@telusplanet.net> wrote in message
news:_6FG7.33159$i4.51...@news0.telusplanet.net...

Michael Hudson

unread,
Nov 9, 2001, 6:56:11 AM11/9/01
to
"harry" <hart...@telusplanet.net> writes:

> if you put data into dictionary, how do we calculate the data size?

Why do you want to know? It's not usually an interesting question to
ask in Python.

> key:data
>
> example:
>
> dict = {"key1":1,2,3,4; "key2": "hello", [1,2,3]}

> how much is the data structure will cost?

Err, lets see, a (small) dict is about 9*4 + 8*(3*4) = 124 bytes
the four character strings will be about 5*4 + 4 = 24 bytes
the list object will be about 30 bytes I think, similar for the tuple.

Small integers are preallocated and so cost nothing.

So the total for the above object is probably a couple of hundred
bytes.

I'm not sure how accurate the above is -- and I don't really care --
but it should give you an idea.

> how to calculate it?

Don't :) Memory is cheap.

Cheers,
M.

--
Never meddle in the affairs of NT. It is slow to boot and quick to
crash. -- Stephen Harris
-- http://home.xnet.com/~raven/Sysadmin/ASR.Quotes.html

Martin von Loewis

unread,
Nov 9, 2001, 8:14:21 AM11/9/01
to
"harry" <hart...@telusplanet.net> writes:

> if you put data into dictionary, how do we calculate the data size?

This is very difficult, and I suggest you don't really want to know.

> example:
>
> dict = {"key1":1,2,3,4; "key2": "hello", [1,2,3]}

This is invalid syntax, I'll assume that you meant

{"key1":[1,2,3,4], "key2": "hello", "key3": [1,2,3]}

>
> how much is the data structure will cost? how to calculate it?

Notice that dict is now a collection of 11 objects:
- four strings
- four integers: 1,2,3,4
- two lists: [1,2,3,4] and [1,2,3]
- one dictionary

You may think that the integers should be counted twice, but you
actually have the same integer objects in each list.

Do you want the siye of just the dictionary, or of all the objects
together? What Python version? What operating system/microprocessor?
What C library (to account for the overhead of malloc/free).

Regards,
Martin

harry

unread,
Nov 9, 2001, 1:50:54 PM11/9/01
to

"Martin von Loewis" <loe...@informatik.hu-berlin.de> wrote in message
news:j4k7x08...@informatik.hu-berlin.de...

> This is invalid syntax, I'll assume that you meant
>
> {"key1":[1,2,3,4], "key2": "hello", "key3": [1,2,3]}
>

> Notice that dict is now a collection of 11 objects:
> - four strings
> - four integers: 1,2,3,4
> - two lists: [1,2,3,4] and [1,2,3]
> - one dictionary
>
> You may think that the integers should be counted twice, but you
> actually have the same integer objects in each list.
>
> Do you want the siye of just the dictionary, or of all the objects
> together? What Python version? What operating system/microprocessor?
> What C library (to account for the overhead of malloc/free).

I would like to know the size of all objects together (the dictionary and
the contents). I am using Python 2.0 on Windows98 SE.
And I don't think I use any C library. (or you're refering to something that
i'm not aware of using)

Thanks!


Steve Holden

unread,
Nov 9, 2001, 4:37:28 PM11/9/01
to
"harry" <hart...@telusplanet.net> wrote in message
news:iAVG7.35100$i4.57...@news0.telusplanet.net...

He is probably asking you which C library your version of Python was
compiled with. But you don't need to know that, either.

All Python dictionaries are a standard 2.5 cm by 3.6cm. Integers have no
width and are all 1.2 cm in length. Strings are all 2 mm times the number of
characters, except Unicode strings, which are 4 mm times the number of
characters.

regards
Steve
--
http://www.holdenweb.com/

harry

unread,
Nov 9, 2001, 6:16:07 PM11/9/01
to

> He is probably asking you which C library your version of Python was
> compiled with. But you don't need to know that, either.
>
> All Python dictionaries are a standard 2.5 cm by 3.6cm. Integers have no
> width and are all 1.2 cm in length. Strings are all 2 mm times the number
of
> characters, except Unicode strings, which are 4 mm times the number of
> characters.

could you explain further about the metric standard you're using. this is
the first time a size of data structure is measured using meters instead of
byte/bit.
i need the information for my post-mortem of my assignment to explain why
using python data structure would be efficient. yes, i'm only a studemt who
is still need to learn lots of stuffs.
thanks.


Cliff Wells

unread,
Nov 9, 2001, 6:22:54 PM11/9/01
to

Steve's statement is not entirely correct: it depends what font you use for
your data. I often optimize memory usage by selecting a smaller font for my
data structures. It can take a little longer for the Python interpreter to
render the data into memory, but a speed/size tradeoff is fairly common.

I hope this helps answer your question.

Regards,


--
Cliff Wells
Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308
(800) 735-0555 x308

Jeff Shannon

unread,
Nov 9, 2001, 7:26:28 PM11/9/01
to

harry wrote:

Aha! We finally have the reason that you need to know the size. :) Steve's
metric measurements were (unless I'm seriously mistaken) a sarcastic attempt to
point out that 99.9% of the time, the amount of space taken up by a dict is
totally irrelevant. And indeed, in your case, it makes *no* difference to the
program, it's only important in justifying it to an outside individual (in this
case, your instructor). The point that everyone was trying to make is, there
are *very* few instances when it will matter whether a given dict takes up 100
bytes, or 1000 bytes. In a day when computer memory typically measures in the
hundreds of millions of bytes (and if that runs out, there's always virtual
memory...), a few hundred bytes here and there are insignificant. The real
efficiency gain in using a python data structure, is that you've got a very
flexible and reliable component that took you all of about 5 minutes to use
(compared with how many hours to code a C++ object that'll do the same?).
Unless you're working in embedded systems and the like, developer time is *far*
more important than memory size or even (in most cases) execution speed.

Jeff Shannon
Technician/Programmer
Credit International

Ken Seehof

unread,
Nov 10, 2001, 6:50:13 AM11/10/01
to

Okay, so everyone's explained why you don't care what the answer to
your question is :-). Actually, the size of a data structure does sometimes
matter, specifically when you are dealing with particularly huge quantities
of data. For example, python genetic molecular simulators usually store
the entire human genome in a dictionary in memory. Don't they? :-)

It is generally more difficult to analytically figure out memory usage in
python than in c, so what I do in this kind of situation is do the empirical
thing.

>>> def makedict(x):
... d = {}
... for i in xrange(x):
... d[random.randint(100,1000000000)] =
random.randint(100,1000000000)
... return d

>>> a = makedict(1024*1024)
>>> b = makedict(1024*1024)
>>> del a
>>> del b
>>> ... etc....

By watching my memory monitor, I determined that the dictionary costs
about 32 bytes per entry (give or take a byte). It doesn't really matter
much what the bytes are used for, but if you are in the mood to get
analitical...

#define PyObject_HEAD \
int ob_refcnt; \
struct _typeobject *ob_type;

A python object is 8 bytes plus data. That's 12 bytes per integer
(note that integers are 0 bytes for -3 < n < 100). I'd expect the hash
table to cost about 3 pointers per entry for a well-balanced hash table.
That's 12 bytes. So an entry in our dictionary (2 integers and a hash
index entry) should be 36 bytes. So, I'm wondering, where'd the
extra four bytes go? (well maybe I'm just being sloppy...)

Remember to multiply all of your results by 2.7 mg/farad.

- Ken Seehof
kse...@neuralintegrator.com

Martin von Loewis

unread,
Nov 10, 2001, 8:08:57 AM11/10/01
to
"harry" <hart...@telusplanet.net> writes:

> > {"key1":[1,2,3,4], "key2": "hello", "key3": [1,2,3]}
>
> > Notice that dict is now a collection of 11 objects:
> > - four strings
> > - four integers: 1,2,3,4
> > - two lists: [1,2,3,4] and [1,2,3]
> > - one dictionary
> >
>

> I would like to know the size of all objects together (the dictionary and
> the contents). I am using Python 2.0 on Windows98 SE.
> And I don't think I use any C library. (or you're refering to something that
> i'm not aware of using)

Ok, let's take it step-by-step. In 2.0, a string is defined as

typedef struct {


int ob_refcnt;
struct _typeobject *ob_type;

int ob_size;
long ob_shash;
PyObject *ob_sinterned;
char ob_sval[1];
} PyStringObject;

Given that int, long, and pointer types are all 4 bytes on your
machine, and given that the strings are null-terminated in memory, the
string sizes are 25 bytes each. However, you cannot add them together,
because you have to account for the malloc overhead.

Assuming you use the Microsoft Visual C Runtime (msvcrt.dll), malloc
will add 8 bytes and round-up to a paragraph (16 bytes). This will
give 48 bytes per string, or 192 bytes for all strings together. Of
course, if the strings appear in your source code, these 192 bytes are
consumed only once: the dictionary will share them with the source
code.

Next, let's look at the integers. They are defined as

typedef struct {


int ob_refcnt;
struct _typeobject *ob_type;

long ob_ival;
} PyIntObject;

giving 12 bytes per int object. In 2.0, ints are allocated in blocks
of 1000 bytes, to avoid the malloc overhead. So the four ints together
consume 49.15 bytes (assuming that the overhead of 24 bytes per
integer block is equally distributed over each of the 83 integers). Of
course, if these are the *only* integers, they consume 1024 bytes
together, since the entire integer block must be accounted (with 79
wasted integers).

Next, the two lists: It is defined as

typedef struct {
struct _gc_head *gc_next;
struct _gc_head *gc_prev;
int gc_refs;


int ob_refcnt;
struct _typeobject *ob_type;

int ob_size;
PyObject **ob_item;
} PyListObject;

since lists are garbage-collected. Since a list consumes two memory
blocks, the list object itself is 28 bytes. Accounting for the malloc
overhead on your system, it is 48 bytes. The lists themselves consume
as many pointers as they have elements (atleast initially), so the two
lists consume 12 and 16 bytes for the elements. Considering the malloc
overhead, each consumes 32 bytes. Together, the to lists consume 180
bytes.

Finally, the dictionary. It is

struct dictobject {
struct _gc_head *gc_next;
struct _gc_head *gc_prev;
int gc_refs;


int ob_refcnt;
struct _typeobject *ob_type;

int ma_fill;
int ma_used;
int ma_size;
int ma_poly;
dictentry *ma_table;
dictentry *(*ma_lookup)(dictobject *mp, PyObject *key, long hash);
};

So the dictionary itself will consume 44 bytes; considering the malloc
overhead, that will be 64 bytes. The ma_table consists of structures

typedef struct {
long me_hash;
PyObject *me_key;
PyObject *me_value;
} dictentry;

i.e. 12 bytes per entry. With three keys, the dictionary will have
eight entries in 2.0, giving 96 bytes of entries. With malloc
overhead, this comes to 112 bytes.

Counting this all together, we get 372 bytes without allocation
overhead, 597.15 bytes with malloc overhead, and 1572 bytes with both
malloc and integer-preallocation overhead; somebody correct me if I'm
wrong.

Hope this helps,
Martin

P.S. Of course, assuming that these are the only allocations done, you
have to account the overhead of the Microsoft small block allocator;
it obtains, via VirtualAlloc, always a page for small blocks. So these
objects together consume 4096 bytes, minimum. That is probably the
number you get for any other language, as well, since it is the
minimum amount of memory that a Windows process can allocate from the
operating system.

Michael Hudson

unread,
Nov 10, 2001, 10:47:57 AM11/10/01
to
Martin von Loewis <loe...@informatik.hu-berlin.de> writes:

> i.e. 12 bytes per entry. With three keys, the dictionary will have
> eight entries in 2.0, giving 96 bytes of entries.

Actually, I think in 2.0 the dict will have four entries. I realise
this isn't what the code claim{s,ed}, but for the sake of pedantry...

(I only remember this because it led to obscure bugs in the
compiler in 2.1 betas...)

generally-in-awe-at-your-thoroughness-ly y'rs,
M.

--
I never realized it before, but having looked that over I'm certain
I'd rather have my eyes burned out by zombies with flaming dung
sticks than work on a conscientious Unicode regex engine.
-- Tim Peters, 3 Dec 1998

Martin von Loewis

unread,
Nov 11, 2001, 7:47:59 AM11/11/01
to
Michael Hudson <m...@python.net> writes:

> generally-in-awe-at-your-thoroughness-ly y'rs,

Well, thanks for reading it. After an hour of researching all these
pieces, I felt like an idiot doing it just so somebody can report a
number to his instructor... I probably could have picked any number,
as long as the unit is bytes, not meters...

Regards,
Martin

0 new messages