How can I create customized classes that have similar properties as 'str'?

3 views
Skip to first unread message

Licheng Fang

unread,
Nov 24, 2007, 5:31:02 AM11/24/07
to
I mean, all the class instances that equal to each other should be
reduced into only one instance, which means for instances of this
class there's no difference between a is b and a==b.

Thank you.

Licheng Fang

unread,
Nov 24, 2007, 5:36:25 AM11/24/07
to
I find myself frequently in need of classes like this for two reasons.
First, it's efficient in memory. Second, when two instances are
compared for equality only their pointers are compared. (I think
that's how Python compares 'str's.

Bjoern Schliessmann

unread,
Nov 24, 2007, 6:00:25 AM11/24/07
to

If you only want that if "a == b" is True also "a is b" is True,
overload the is_ attribute of your class. Personally, I don't see
any advantage in this.

Regards,


Björn

--
BOFH excuse #352:

The cables are not the same length.

Bjoern Schliessmann

unread,
Nov 24, 2007, 6:05:28 AM11/24/07
to
Licheng Fang wrote:
> I find myself frequently in need of classes like this for two
> reasons. First, it's efficient in memory.

Are you using millions of objects, or MB size objects? Otherwise,
this is no argument.

BTW, what happens if you, by some operation, make a == b, and
afterwards change b so another object instance must be created?
This instance management is quite a runtime overhead.

> Second, when two instances are compared for equality only their
> pointers are compared.

I state that the object management will often eat more performance
than equality testing. Except you have a huge number of equal
objects. If the latter was the case you should rethink your program
design.

> (I think that's how Python compares 'str's.

Generally not. In CPython, just very short strings are created only
once.

>>> a=" "
>>> b=" "
>>> a is b
True
>>> a=" "
>>> b=" "
>>> a is b
False

Regards,


Björn

--
BOFH excuse #430:

Mouse has out-of-cheese-error

Licheng Fang

unread,
Nov 24, 2007, 6:44:59 AM11/24/07
to
On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-

mail-0306.20.chr0n...@spamgourmet.com> wrote:
> Licheng Fang wrote:
> > I find myself frequently in need of classes like this for two
> > reasons. First, it's efficient in memory.
>
> Are you using millions of objects, or MB size objects? Otherwise,
> this is no argument.

Yes, millions. In my natural language processing tasks, I almost
always need to define patterns, identify their occurrences in a huge
data, and count them. Say, I have a big text file, consisting of
millions of words, and I want to count the frequency of trigrams:

trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

I can save the counts in a dict D1. Later, I may want to recount the
trigrams, with some minor modifications, say, doing it on every other
line of the input file, and the counts are saved in dict D2. Problem
is, D1 and D2 have almost the same set of keys (trigrams of the text),
yet the keys in D2 are new instances, even though these keys probably
have already been inserted into D1. So I end up with unnecessary
duplicates of keys. And this can be a great waste of memory with huge
input data.

>
> BTW, what happens if you, by some operation, make a == b, and
> afterwards change b so another object instance must be created?
> This instance management is quite a runtime overhead.
>

I probably need this class to be immutable.

> > Second, when two instances are compared for equality only their
> > pointers are compared.
>
> I state that the object management will often eat more performance
> than equality testing. Except you have a huge number of equal
> objects. If the latter was the case you should rethink your program
> design.
>

Yeah, counting is all about equal or not.

> > (I think that's how Python compares 'str's.
>
> Generally not. In CPython, just very short strings are created only
> once.
>
> >>> a=" "
> >>> b=" "
> >>> a is b
> True
> >>> a=" "
> >>> b=" "
> >>> a is b
>

Wow, I didn't know this. But exactly how Python manage these strings?
My interpretator gave me such results:

>>> a = 'this'
>>> b = 'this'
>>> a is b
True
>>> a = 'this is confusing'
>>> b = 'this is confusing'
>>> a is b
False

Bjoern Schliessmann

unread,
Nov 24, 2007, 7:40:40 AM11/24/07
to
Licheng Fang wrote:
> On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-

>> BTW, what happens if you, by some operation, make a == b, and


>> afterwards change b so another object instance must be created?
>> This instance management is quite a runtime overhead.
>
> I probably need this class to be immutable.

IMHO you don't need a change of Python, but simply a special
implementation (probably using metaclasses/singletons).



> Wow, I didn't know this. But exactly how Python manage these
> strings?

I don't know (use the source, Luke). :) Or perhaps there is a Python
Elder here that knows?

Regards,


Björn

--
BOFH excuse #165:

Backbone Scoliosis

samwyse

unread,
Nov 24, 2007, 7:54:43 AM11/24/07
to
On Nov 24, 5:44 am, Licheng Fang <fanglich...@gmail.com> wrote:
> Yes, millions. In my natural language processing tasks, I almost
> always need to define patterns, identify their occurrences in a huge
> data, and count them. [...] So I end up with unnecessary

> duplicates of keys. And this can be a great waste of memory with huge
> input data.

create a hash that maps your keys to themselves, then use the values
of that hash as your keys.

>>> store = {}
>>> def atom(str):
global store
if str not in store:
store[str] = str
return store[str]

>>> a='this is confusing'
>>> b='this is confusing'
>>> a == b
True
>>> a is b
False
>>> atom(a) is atom(b)
True

Marc 'BlackJack' Rintsch

unread,
Nov 24, 2007, 8:42:48 AM11/24/07
to
On Sat, 24 Nov 2007 13:40:40 +0100, Bjoern Schliessmann wrote:

> Licheng Fang wrote:
>> On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
>

>> Wow, I didn't know this. But exactly how Python manage these
>> strings?
>
> I don't know (use the source, Luke). :) Or perhaps there is a Python
> Elder here that knows?

AFAIK strings of length 1 and strings that would be valid Python
identifiers are treated this way.

Ciao,
Marc 'BlackJack' Rintsch

Licheng Fang

unread,
Nov 24, 2007, 11:35:01 AM11/24/07
to

Thanks. Then, is there a way to make python treat all strings this
way, or any other kind of immutable objects?

Bruno Desthuilliers

unread,
Nov 24, 2007, 12:15:36 PM11/24/07
to
Licheng Fang a écrit :

> I mean, all the class instances that equal to each other should be
> reduced into only one instance, which means for instances of this
> class there's no difference between a is b and a==b.

Here's a Q&D attempt - without any garantee, and to be taylored to your
needs.

_values = {} #id(instance) => value mapping
_instances = {} #hash(value) => instance mapping

class Value(object):
def __new__(cls, value):
try:
return _instances[hash(value)]
except KeyError:
instance = object.__new__(cls)
_values[id(instance)] = value
_instances[hash(value)] = instance
return instance

@apply
def value():
def fget(self):
return _values[id(self)]
def fset(self, ignore):
raise AttributeError("%s.value is read only" % type(self))
def fdel(self):
raise AttributeError("%s.value is read only" % type(self))
return property(**locals())


HTH

samwyse

unread,
Nov 24, 2007, 12:23:25 PM11/24/07
to
On Nov 24, 10:35 am, Licheng Fang <fanglich...@gmail.com> wrote:
> Thanks. Then, is there a way to make python treat all strings this
> way, or any other kind of immutable objects?

The word generally used is 'atom' when referring to strings that are
set up such that 'a == b' implies 'a is b'. This is usually an
expensive process, since you don't want to do it to strings that are,
e.g., read from a file. Yes, it could be done only for string
constants, and some languages (starting with LISP) do this, but that
isn't what you (or most people) want. Whether you realize it or not,
you want control over the process; in your example, you don't want to
do it for the lines read from your file, just the trigrams.

The example that I gave does exactly that. It adds a fixed amount of
storage for each string that you 'intern' (the usual name given to the
process of generating such a string. Let's look at my code again:

>>> store = {}
>>> def atom(str):
global store
if str not in store:
store[str] = str
return store[str]

Each string passed to 'atom' already exists. We look to see if copy
already exists; if so we can discard the latest instance and use that
copy henceforth. If a copy does not exist, we save the string inside
'store'. Since the string already exists, we're just increasing its
reference count by two (so it won't be reference counted) and
increasing the size of 'store' by (an amortized) pair of pointers to
that same string.

samwyse

unread,
Nov 24, 2007, 12:38:58 PM11/24/07
to
On Nov 24, 5:44 am, Licheng Fang <fanglich...@gmail.com> wrote:
> Yes, millions. In my natural language processing tasks, I almost
> always need to define patterns, identify their occurrences in a huge
> data, and count them. Say, I have a big text file, consisting of
> millions of words, and I want to count the frequency of trigrams:
>
> trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]

BTW, if the components of your trigrams are never larger than a byte,
then encode the tuples as integers and don't worry about pointer
comparisons.

>>> def encode(s):
return (ord(s[0])*256+ord(s[1]))*256+ord(s[2])

>>> def trigram(s):
return [ encode(s[i:i+3]) for i in range(0, len(s)-2)]

>>> trigram('abcde')
[6382179, 6447972, 6513765]

Steven D'Aprano

unread,
Nov 24, 2007, 4:59:50 PM11/24/07
to
On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:

> On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
> mail-0306.20.chr0n...@spamgourmet.com> wrote:
>> Licheng Fang wrote:
>> > I find myself frequently in need of classes like this for two
>> > reasons. First, it's efficient in memory.
>>
>> Are you using millions of objects, or MB size objects? Otherwise, this
>> is no argument.
>
> Yes, millions.


Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????

Tens of megabytes is not a lot of data.

If the average word size is ten characters, then one million words takes
ten million bytes, or a little shy of ten megabytes. Even if you are
using four-byte characters, you've got 40 MB, still a moderate amount of
data on a modern system.


> In my natural language processing tasks, I almost always
> need to define patterns, identify their occurrences in a huge data, and
> count them. Say, I have a big text file, consisting of millions of
> words, and I want to count the frequency of trigrams:
>
> trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]
>
> I can save the counts in a dict D1. Later, I may want to recount the
> trigrams, with some minor modifications, say, doing it on every other
> line of the input file, and the counts are saved in dict D2. Problem is,
> D1 and D2 have almost the same set of keys (trigrams of the text), yet
> the keys in D2 are new instances, even though these keys probably have
> already been inserted into D1. So I end up with unnecessary duplicates
> of keys. And this can be a great waste of memory with huge input data.

All these keys will almost certainly add up to only a few hundred
megabytes, which is a reasonable size of data but not excessive. This
really sounds to me like a case of premature optimization. I think you
are wasting your time solving a non-problem.

[snip]


> Wow, I didn't know this. But exactly how Python manage these strings? My
> interpretator gave me such results:
>
>>>> a = 'this'
>>>> b = 'this'
>>>> a is b
> True
>>>> a = 'this is confusing'
>>>> b = 'this is confusing'
>>>> a is b
> False


It's an implementation detail. You shouldn't use identity testing unless
you actually care that two names refer to the same object, not because
you want to save a few bytes. That's poor design: it's fragile,
complicated, and defeats the purpose of using a high-level language like
Python.


--
Steven.

Steven D'Aprano

unread,
Nov 24, 2007, 5:17:56 PM11/24/07
to
On Sat, 24 Nov 2007 04:54:43 -0800, samwyse wrote:

> create a hash that maps your keys to themselves, then use the values of
> that hash as your keys.
>
>>>> store = {}
>>>> def atom(str):
> global store
> if str not in store:
> store[str] = str
> return store[str]

Oh lordy, that's really made my day! That's the funniest piece of code
I've seen for a long time! Worthy of being submitted to the DailyWTF.

Samwyse, while I applaud your willingness to help, I think you actually
need to get some programming skills before doing so. Here's a hint to get
you started: can you think of a way to optimize that function so it does
less work?

--
Steven.

Steven D'Aprano

unread,
Nov 24, 2007, 5:19:38 PM11/24/07
to
On Sat, 24 Nov 2007 12:00:25 +0100, Bjoern Schliessmann wrote:

> Licheng Fang wrote:
>> I mean, all the class instances that equal to each other should be
>> reduced into only one instance, which means for instances of this class
>> there's no difference between a is b and a==b.
>
> If you only want that if "a == b" is True also "a is b" is True,
> overload the is_ attribute of your class. Personally, I don't see any
> advantage in this.

No advantage? That's for sure. There is no is_ attribute of generic
classes, and even if there was, it would have no special meaning.

Identity testing can't be overloaded. If it could, it would no longer be
identity testing.


--
Steven.

George Sakkis

unread,
Nov 24, 2007, 5:58:50 PM11/24/07
to
On Nov 24, 4:59 pm, Steven D'Aprano

<st...@REMOVE-THIS-cybersource.com.au> wrote:
> On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
> > On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
> > mail-0306.20.chr0n...@spamgourmet.com> wrote:
> >> Licheng Fang wrote:
> >> > I find myself frequently in need of classes like this for two
> >> > reasons. First, it's efficient in memory.
>
> >> Are you using millions of objects, or MB size objects? Otherwise, this
> >> is no argument.
>
> > Yes, millions.
>
> Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
> megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????
>

Comments like these make one wonder if your real life experience with
massive data matches even the one tenth of your self-importance and
need to be snarky in most of your posts.

To the OP: yes, your use case is quite valid; the keyword you are
looking for is "memoize". You can find around a dozen of recipes in
the Cookbook and posted in this list; here's one starting point:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717.

HTH,
George

Bjoern Schliessmann

unread,
Nov 24, 2007, 6:02:02 PM11/24/07
to
Steven D'Aprano wrote:

> No advantage? That's for sure. There is no is_ attribute of
> generic classes, and even if there was, it would have no special
> meaning.

Argl, I confused the operator module's attributes with objects ;)

Regards,


Björn

--
BOFH excuse #378:

Operators killed by year 2000 bug bite.

Steven D'Aprano

unread,
Nov 24, 2007, 7:42:46 PM11/24/07
to
On Sat, 24 Nov 2007 14:58:50 -0800, George Sakkis wrote:

> On Nov 24, 4:59 pm, Steven D'Aprano
>
> <st...@REMOVE-THIS-cybersource.com.au> wrote:
>> On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
>> > On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
>> > mail-0306.20.chr0n...@spamgourmet.com> wrote:
>> >> Licheng Fang wrote:
>> >> > I find myself frequently in need of classes like this for two
>> >> > reasons. First, it's efficient in memory.
>>
>> >> Are you using millions of objects, or MB size objects? Otherwise,
>> >> this is no argument.
>>
>> > Yes, millions.
>>
>> Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
>> megabytes!!!!1! How will a PC with one or two gigabytes of RAM
>> cope?????
>>
>>
> Comments like these make one wonder if your real life experience with
> massive data matches even the one tenth of your self-importance and need
> to be snarky in most of your posts.

I cheerfully admit to never needing to deal with "massive data".

However, I have often needed to deal with tens and hundreds of megabytes
of data, which IS NOT MASSIVE amounts of data to deal with on modern
systems. Which was my point.


> To the OP: yes, your use case is quite valid; the keyword you are
> looking for is "memoize". You can find around a dozen of recipes in the
> Cookbook and posted in this list; here's one starting point:
> http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717.

This has nothing, absolutely NOTHING, to do with memoization. Memoization
trades off memory for time, allowing slow functions to return results
faster at the cost of using more memory. The OP wants to save memory, not
use more of it.

--
Steven.

Hrvoje Niksic

unread,
Nov 24, 2007, 7:38:51 PM11/24/07
to
samwyse <sam...@gmail.com> writes:

> create a hash that maps your keys to themselves, then use the values
> of that hash as your keys.

The "atom" function you describe already exists under the name
"intern".

Steven D'Aprano

unread,
Nov 24, 2007, 8:12:32 PM11/24/07
to

Not really. intern() works very differently, because it can tie itself to
the Python internals. Samwyse's atom() function doesn't, and so has no
purpose.


In any case, I'm not sure that intern() actually will solve the OP's
problem, even assuming it is a real and not imaginary problem. According
to the docs, intern()'s purpose is to speed up dictionary lookups, not to
save memory. I suspect that if it does save memory, it will be by
accident.

From the docs:
http://docs.python.org/lib/non-essential-built-in-funcs.html

intern( string)
Enter string in the table of ``interned'' strings and return the interned
string - which is string itself or a copy. Interning strings is useful to
gain a little performance on dictionary lookup - if the keys in a
dictionary are interned, and the lookup key is interned, the key
comparisons (after hashing) can be done by a pointer compare instead of a
string compare. Normally, the names used in Python programs are
automatically interned, and the dictionaries used to hold module, class
or instance attributes have interned keys. Changed in version 2.3:
Interned strings are not immortal (like they used to be in Python 2.2 and
before); you must keep a reference to the return value of intern() around
to benefit from it.


Note the words "which is string itself or a copy". It would be ironic if
the OP uses intern to avoid having copies of strings, and ends up with
even more copies than if he didn't bother.

I guess he'll actually need to measure his memory consumption and see
whether he actually has a memory problem or not, right?


--
Steven.

Hrvoje Niksic

unread,
Nov 24, 2007, 8:48:30 PM11/24/07
to
Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> writes:

> On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
>
>> samwyse <sam...@gmail.com> writes:
>>
>>> create a hash that maps your keys to themselves, then use the values of
>>> that hash as your keys.
>>
>> The "atom" function you describe already exists under the name "intern".
>
> Not really. intern() works very differently, because it can tie itself to
> the Python internals.

The exact implementation mechanism is subtly different, but
functionally intern is equivalent to the "atom" function.

> In any case, I'm not sure that intern() actually will solve the OP's
> problem, even assuming it is a real and not imaginary
> problem. According to the docs, intern()'s purpose is to speed up
> dictionary lookups, not to save memory. I suspect that if it does
> save memory, it will be by accident.

It's not by accident, it follows from what interning does. Interning
speeds up comparisons by returning the same string object for the same
string contents. If the strings you're working with tend to repeat,
interning will save some memory simply by preventing storage of
multiple copies of the same string. Whether the savings would make
any difference for the OP is another question.

> From the docs:
> http://docs.python.org/lib/non-essential-built-in-funcs.html
>
> intern( string)
> Enter string in the table of ``interned'' strings and return the interned

> string - which is string itself or a copy. [...]


>
> Note the words "which is string itself or a copy". It would be ironic if
> the OP uses intern to avoid having copies of strings, and ends up with
> even more copies than if he didn't bother.

That's a frequently misunderstood sentence. It doesn't mean that
intern will make copies; it simply means that the string you get back
from intern can be either the string you passed it or another
(previously interned) string object that is guaranteed to have the
same contents as your string (which makes it technically a "copy" of
the string you passed to intern).

George Sakkis

unread,
Nov 25, 2007, 12:55:55 AM11/25/07
to
On Nov 24, 7:42 pm, Steven D'Aprano

> > To the OP: yes, your use case is quite valid; the keyword you are
> > looking for is "memoize". You can find around a dozen of recipes in the
> > Cookbook and posted in this list; here's one starting point:
> >http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413717.
>
> This has nothing, absolutely NOTHING, to do with memoization. Memoization
> trades off memory for time, allowing slow functions to return results
> faster at the cost of using more memory. The OP wants to save memory, not
> use more of it.

If you bothered to click on that link you would learn that memoization
can be used to save space too and matches OP's case exactly; even the
identity tests work. Self-importance is bad enough by itself, even
without the ignorance, but you seem to do great in both.

George

Peter Otten

unread,
Nov 25, 2007, 4:39:38 AM11/25/07
to
Steven D'Aprano wrote:

>>>>> store = {}
>>>>> def atom(str):
>> global store
>> if str not in store:
>> store[str] = str
>> return store[str]
>
> Oh lordy, that's really made my day! That's the funniest piece of code
> I've seen for a long time! Worthy of being submitted to the DailyWTF.

Here's a script to show atom()'s effect on memory footprint:

$ cat atom.py
import sys
data = [1]*1000
items = []
cache = {}
if "-a" in sys.argv:
def atom(obj):
try:
return cache[obj]
except KeyError:
cache[obj] = obj
return obj
else:
def atom(obj):
return obj
try:
while 1:
items.append(atom(tuple(data)))
except MemoryError:
print len(items)
$ ulimit -v 5000
$ python atom.py
226
$ python atom.py -a
185742

So if you are going to submit Sam's function make sure to bundle it with
this little demo...

Peter

Licheng Fang

unread,
Nov 25, 2007, 5:42:36 AM11/25/07
to
On Nov 25, 5:59 am, Steven D'Aprano <st...@REMOVE-THIS-

cybersource.com.au> wrote:
> On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
> > On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
> > mail-0306.20.chr0n...@spamgourmet.com> wrote:
> >> Licheng Fang wrote:
> >> > I find myself frequently in need of classes like this for two
> >> > reasons. First, it's efficient in memory.
>
> >> Are you using millions of objects, or MB size objects? Otherwise, this
> >> is no argument.
>
> > Yes, millions.
>
> Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
> megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????
>
> Tens of megabytes is not a lot of data.
>
> If the average word size is ten characters, then one million words takes
> ten million bytes, or a little shy of ten megabytes. Even if you are
> using four-byte characters, you've got 40 MB, still a moderate amount of
> data on a modern system.

I mentioned trigram counting as an illustrative case. In fact, you'll
often need to define patterns more complex than that, and tens of
megabytes of text may generate millions of them, and I've observed
they quickly ate up the 8G memory of a workstation in a few minutes.
Manipulating these patterns can be tricky, you can easily waste a lot
of memory without taking extra care. I just thought if I define my
pattern class with this 'atom' property, coding efforts could be
easier later.

Steven D'Aprano

unread,
Nov 25, 2007, 5:53:50 AM11/25/07
to
On Sun, 25 Nov 2007 10:39:38 +0100, Peter Otten wrote:

> So if you are going to submit Sam's function make sure to bundle it with
> this little demo...

Well Peter, I was going to reply with a comment about not changing the
problem domain (tuples of ints to trigrams from a text file for natural
language processing, that is, three character alphanumeric strings), and
that if you re-did your test with strings (as I did) you would see
absolutely no difference. What I was going to say was "Tuples aren't
interned. Short strings that look like identifiers are. Jumping through
hoops to cache things which are already cached is not productive
programming."

But then I dug a little deeper, and disassembled the code I was running,
and discovered that I was being fooled by the Python compiler's constant-
folding, and if I took steps to defeat the optimizer, the effect I was
seeing disappeared, and I got the same results as you.

Well. So I've learned something new: Python doesn't intern strings in the
way I thought it did. I don't quite know *how* it decides which strings
to intern and which ones not to, but at least I've learnt that what I
thought was true is not true.

So I offer my apology to Samwyse, the caching code isn't as redundant and
silly as it appears, and humbly tuck into this nice steaming plate of
crow.

Somebody pass the salt please.

--
Steven.

MonkeeSage

unread,
Nov 25, 2007, 8:08:36 AM11/25/07
to
On Nov 24, 6:42 pm, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:

> This has nothing, absolutely NOTHING, to do with memoization. Memoization
> trades off memory for time, allowing slow functions to return results
> faster at the cost of using more memory. The OP wants to save memory, not
> use more of it.

Not to beat a dead horse, but memoization can significantly minimize
memory usage, given a large data set with redundant elements (as the
OP seems to suggest [e.g., calculating the deltas of trigrams in a
natural language]). For example, if the data set has 1/3 redundant
elements, then the un-memoized version requires 1/3 more memory,
because it needs to store 1/3 more unique copies of the element,
whereas the memoized version has only to store unique elements once.
Every instance of an element which is already in the cache requires
only the cache lookup (speed), rather than the creation of a new
object (memory). So the trade-off is actually speed for memory rather
than the other way around. Of course, it all depends on the nature of
the data; but for large, redundant data sets, memoization is
definitely a win when it comes to memory optimization.

Regards,
Jordan

Colin J. Williams

unread,
Nov 25, 2007, 8:30:17 AM11/25/07
to Steven D'Aprano, pytho...@python.org
Steven D'Aprano wrote:
> On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
>
>> samwyse <sam...@gmail.com> writes:
>>
>>> create a hash that maps your keys to themselves, then use the values of
>>> that hash as your keys.
>> The "atom" function you describe already exists under the name "intern".
>
> Not really. intern() works very differently, because it can tie itself to
> the Python internals. Samwyse's atom() function doesn't, and so has no
> purpose.
>
>
> In any case, I'm not sure that intern() actually will solve the OP's
> problem, even assuming it is a real and not imaginary problem. According
> to the docs, intern()'s purpose is to speed up dictionary lookups, not to
> save memory. I suspect that if it does save memory, it will be by
> accident.
>
>>From the docs:
> http://docs.python.org/lib/non-essential-built-in-funcs.html

Yes, but it seems that buffer remains
with Python 3000.

Colin W.

Colin J. Williams

unread,
Nov 25, 2007, 8:30:17 AM11/25/07
to pytho...@python.org, pytho...@python.org
Steven D'Aprano wrote:
> On Sun, 25 Nov 2007 01:38:51 +0100, Hrvoje Niksic wrote:
>
>> samwyse <sam...@gmail.com> writes:
>>
>>> create a hash that maps your keys to themselves, then use the values of
>>> that hash as your keys.
>> The "atom" function you describe already exists under the name "intern".
>
> Not really. intern() works very differently, because it can tie itself to
> the Python internals. Samwyse's atom() function doesn't, and so has no
> purpose.
>
>
> In any case, I'm not sure that intern() actually will solve the OP's
> problem, even assuming it is a real and not imaginary problem. According
> to the docs, intern()'s purpose is to speed up dictionary lookups, not to
> save memory. I suspect that if it does save memory, it will be by
> accident.
>
>>From the docs:
> http://docs.python.org/lib/non-essential-built-in-funcs.html

Yes, but it seems that buffer remains
with Python 3000.

Colin W.
>

Steven D'Aprano

unread,
Nov 26, 2007, 9:45:51 PM11/26/07
to
On Sun, 25 Nov 2007 02:42:36 -0800, Licheng Fang wrote:

> I mentioned trigram counting as an illustrative case. In fact, you'll
> often need to define patterns more complex than that, and tens of
> megabytes of text may generate millions of them, and I've observed they
> quickly ate up the 8G memory of a workstation in a few minutes.
> Manipulating these patterns can be tricky, you can easily waste a lot of
> memory without taking extra care. I just thought if I define my pattern
> class with this 'atom' property, coding efforts could be easier later.

I'm just not getting the same results as you when I try this. I'm finding
that with no extra effort at all, it just works.

The size of your corpus is not important. Neither is the complexity of
how you generate the patterns. What's important is the number of patterns
you produce, and "millions" isn't that huge a number, even without
interning the strings.

Obviously I'm not running your code, but I can build a dict with millions
of patterns, from hundreds of megabytes of text, on a PC with just 1GB of
memory and not run into any serious problems.

I've just processed roughly 240MB of random emails, generating n-grams up
to length 5. The emails include binary attachments and HTML etc., so I'm
getting lots of patterns that don't normally exist in natural languages
(e.g. 71 occurrences of 'qqq', and 5 of 'qqqq'). As I said, my PC has
only 1GB, and that's being shared with about a dozen other apps (including
such memory hogs as Firefox).

Results? 64939962 patterns found, of which 17031467 are unique. There's
paging, yes, and my PC runs a little slow when I try to switch from one
running application to another, but nothing unusable. Opening a dozen
YouTube videos at once impacts performance worse.

I can't think what you're doing to use up 8GB of RAM for merely
"millions" of strings, even if you are keeping two, three, ten redundant
copies. Assuming an average length of twenty bytes per pattern (you said
trigrams, but I'd rather over-estimate than under), and even assuming
that only half the 8GB are available to Python, you should be able to
store something of the order of one hundred million patterns easily:

4 bytes for a pointer plus 20 bytes for the string = 24 bytes

4*1024**3 / 24 = 178,956,970

(This is a ballpark figure. Real Python strings will have additional
overhead.)

If you're running into problems with merely millions of patterns, then
you're doing worse by probably two orders of magnitude.

I don't think that the problem lies where you think it does. If you have
a dict with millions of keys, it doesn't matter how many times each
pattern exists in the corpus because the key only exists *once* in the
dict. Duplicate the dict, or generate it from scratch even, and at worse
you double the memory used by the keys. And probably not even that.

The only thing I can think of that might explain why you're using so much
memory is if you are generating *all* the patterns up front, say in a
list, before adding them to the dict:

# Generate one massive list of patterns containing many duplicates
patterns = make_patterns(text)
# returns a massive list like ['fre', 'req', 'equ', 'que' ...]
d = {}
for pattern in patterns:
d[pattern] = d.get(pattern, 0) + 1


Notice that the real killer in the above algorithm is that you need
enough storage, not just for the unique patterns, but for EVERY separate
occurrence of each pattern. Nothing to do with how dicts operate, and
everything to do with the algorithm you (hypothetically) are using.

If that's what you're doing, then no wonder you're running out of memory.
With 200MB of text, you have 209715198 trigrams in your list. The
pointers alone will take almost a gigabyte, assuming 32-bit pointers.

If this is your approach, interning the strings won't save you. You
almost certainly should change to a lazy approach, and use a generator to
make each pattern as needed, then thrown away:

def make_patterns(s, n=3):
"""Yield n-grams."""
if len(s) <= n:
yield s
else:
for i in range(len(s)-n+1):
yield s[i:i+n]

d = {}
fp = open('corpus', 'r')
for line in fp:
for word in line.split():
for pattern in make_patterns(word.strip()):
d[pattern] = d.get(pattern, 0) + 1

Obviously I haven't seen your code and don't actually know what you are
doing. If my 1GB machine can deal with a dict of 17 million unique keys
from a corpus of 240MB with no special memory management, your 8GB
machine should too.

--
Steven.

Licheng Fang

unread,
Nov 27, 2007, 8:16:28 AM11/27/07
to
On Nov 27, 10:45 am, Steven D'Aprano

My task is identifying sequences of tokens (phrases) that are possible
tranlations of each other from a bilingual corpus. I need to check all
the subsequences of a sentence to get the possible phrase pairs. This
makes the problem different from n-gram counting in that the number of
possible phrases doesn't grow linearly with n, but approximately with
n^2. (n being the sentence length.) My first version of the program
consumes almost twice as much memory as the current one because I
discovered in doing different statistics I was regenerating the
patterns, and the counting dictionaries ended up with duplicated
pattern keys (a == b, yet a is not b). Wouldn't it be convenient if I
can restrict the pattern class to not generate identical instances? So
I can avoid such subtle but significant bugs.

>
> 4 bytes for a pointer plus 20 bytes for the string = 24 bytes
>
> 4*1024**3 / 24 = 178,956,970
>
> (This is a ballpark figure. Real Python strings will have additional
> overhead.)

>
> If you're running into problems with merely millions of patterns, then
> you're doing worse by probably two orders of magnitude.
>
> I don't think that the problem lies where you think it does. If you have
> a dict with millions of keys, it doesn't matter how many times each
> pattern exists in the corpus because the key only exists *once* in the
> dict. Duplicate the dict, or generate it from scratch even, and at worse
> you double the memory used by the keys. And probably not even that.
>
> The only thing I can think of that might explain why you're using so much
> memory is if you are generating *all* the patterns up front, say in a
> list, before adding them to the dict:
>
> # Generate one massive list of patterns containing many duplicates
> patterns = make_patterns(text)
> # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
> d = {}
> for pattern in patterns:
> d[pattern] = d.get(pattern, 0) + 1
>

No, I wasn't doing that.
BTW, do you think the pattern counting code can avoid hashing the
pattern twice? Is there a way to do that when the dictionary values
are of a primitive type?

Chris Mellon

unread,
Nov 27, 2007, 12:18:49 PM11/27/07
to pytho...@python.org

Implement __hash__ and __eq__ on your pattern class. If the same
pattern compares equal and hashes the same then it will be a "matching
key" as far as the dict is concerned and will only be stored once.
This is probably cheaper than explicit interning anyway (you don't
need to search an intern table).


> > The only thing I can think of that might explain why you're using so much
> > memory is if you are generating *all* the patterns up front, say in a
> > list, before adding them to the dict:
> >
> > # Generate one massive list of patterns containing many duplicates
> > patterns = make_patterns(text)
> > # returns a massive list like ['fre', 'req', 'equ', 'que' ...]
> > d = {}
> > for pattern in patterns:
> > d[pattern] = d.get(pattern, 0) + 1
> >
>
> No, I wasn't doing that.
> BTW, do you think the pattern counting code can avoid hashing the
> pattern twice? Is there a way to do that when the dictionary values
> are of a primitive type?
>

Hashing isn't really an expensive operation. On strings it's even
cached on the object. If you implement your own __hash__ method you
can do the same, but I wouldn't bother unless you benchmark it and
discover that hashing is a bottleneck.

zooko

unread,
Dec 3, 2007, 9:51:57 AM12/3/07
to
On Nov 24, 4:44 am, Licheng Fang <fanglich...@gmail.com> wrote:
>
> Yes, millions. In my natural language processing tasks, I almost

> always need to define patterns, identify their occurrences in a huge
> data, and count them. Say, I have a big text file, consisting of
> millions of words, and I want to count the frequency of trigrams:

I have some experience with this, helping my wife do computational
linguistics.

(I also have quite a lot of experience with similar things in my day
job, which is a decentralized storage grid written in Python.)

Unfortunately, Python is not a perfect tool for the job because, as
you've learned, Python isn't overly concerned about conserving
memory. Each object has substantial overhead associated with it
(including each integer, each string, each tuple, ...), and dicts add
overhead due to being sparsely filled. You should do measurements
yourself to get results for your local CPU and OS, but I found, for
example, that storing 20-byte keys and 8-byte values as a Python dict
of Python strings took about 100 bytes per entry.

Try "tokenizing" your trigrams by defining a dict from three unigrams
to a sequentially allocated integer "trigram id" (also called a
"trigram token"), and a reverse dict which goes from a trigram id to
the three unigrams. Whenever you create a new set of three Python
objects representing unigrams, you can pass them through the first
mapping to get the trigram id and then free up the original three
Python objects. If you do this multiple times, you get multiple
references to the same integer object for the trigram id.

My wife and I tried this, but it still wasn't compact enough to
process her datasets in a mere 4 GiB of RAM.

One tool that might help is PyJudy:

http://www.dalkescientific.com/Python/PyJudy.html

Judy is a delightfully memory-efficient, fast, and flexible data
structure. In the specific example of trigram counting (which is also
what my wife was doing), you can, for example, assign each to each
unigram an integer, and assuming that you have less than two million
unigrams you can pack three unigrams into a 64-bit integer... Hm,
actually at this point my wife and I stopped using Python and rewrote
it in C using JudyTrees. (At the time, PyJudy didn't exist.)

If you are interested, please e-mail my wife, Amber Wilcox-O'Hearn and
perhaps she'll share the resulting C code with you.

Regards,

Zooko Wilcox-O'Hearn

samwyse

unread,
Dec 6, 2007, 3:35:04 PM12/6/07
to
On Nov 24, 6:38 pm, Hrvoje Niksic <hnik...@xemacs.org> wrote:

D'oh! That's what I get for not memorizing "Non-essential Built-in
Functions".

In my defense, however, my function will work with anything that can
be used as a dictionary key (strings, tuples, frozen sets, etc), not
just character strings; thus we return to the original:

>>> a=(1,2,3)
>>> b=(1,1+1,1+1+1)
>>> a == b
True
>>> a is b
False
>>> atom(a) is atom(b)
True
>>> intern(a) is intern(b)
Traceback (most recent call last):
File "<pyshell#33>", line 1, in ?
intern(a) is intern(b)
TypeError: intern() argument 1 must be string, not tuple

Reply all
Reply to author
Forward
0 new messages