Re: Strategy for determing difference between 2 very large dictionaries

Roger Binns

unread,

Dec 24, 2008, 2:26:49 AM12/24/08

to pytho...@python.org

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

pyt...@bdurham.com wrote:
> Feedback on my proposed strategies (or better strategies) would be
> greatly appreciated.

Both strategies will work but I'd recommend the second approach since it
uses already tested code written by other people - the chances of it
being wrong are far lower than new code.

You also neglected to mention what your concerns are or even what "very
large" is. Example concerns are memory consumption, cpu consumption,
testability, utility of output (eg as a generator getting each result on
demand or a single list with complete results). Some people will think
a few hundred entries is large. My idea of large is a working set
larger than my workstation's 6GB of memory :-)

In general the Pythonic approach is:

1 - Get the correct result
2 - Simple code (developer time is precious)
3 - Optimise for your data and environment

Step 3 is usually not needed.

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAklR5DUACgkQmOOfHg372QSuWACgp0xrdpW+NSB6qqCM3oBY2e/I
LIEAn080VgNvmEYj47Mm7BtV69J1GwXN
=MKLl
-----END PGP SIGNATURE-----

Gabriel Genellina

unread,

Dec 24, 2008, 2:46:04 AM12/24/08

to pytho...@python.org

En Wed, 24 Dec 2008 05:16:36 -0200, <pyt...@bdurham.com> escribió:

> I'm looking for suggestions on the best ('Pythonic') way to
> determine the difference between 2 very large dictionaries
> containing simple key/value pairs.
> By difference, I mean a list of keys that are present in the
> first dictionary, but not the second. And vice versa. And a list
> of keys in common between the 2 dictionaries whose values are
> different.
> The 2 strategies I'm considering are:
> 1. Brute force: Iterate through first dictionary's keys and
> determine which keys it has that are missing from the second
> dictionary. If keys match, then verify that the 2 dictionaries
> have identical values for the same key. Repeat this process for
> the second dictionary.
> 2. Use sets: Create sets from each dictionary's list of keys and
> use Python's set methods to generate a list of keys present in
> one dictionary but not the other (for both dictionaries) as well
> as a set of keys the 2 dictionaries have in common.

I cannot think of any advantage of the first approach - so I'd use sets.

k1 = set(dict1.iterkeys())
k2 = set(dict2.iterkeys())
k1 - k2 # keys in dict1 not in dict2
k2 - k1 # keys in dict2 not in dict1
k1 & k2 # keys in both

> Using the set
> of keys in common, compare values across dictionaries to
> determine which keys have different values (can this last step be
> done via a simple list comprehension?)

Yes; but isn't a dict comprehension more adequate?

[key: (dict1[key], dict2[key]) for key in common_keys if
dict1[key]!=dict2[key]}

(where common_keys=k1&k2 as above)

--
Gabriel Genellina

Chris Rebert

unread,

Dec 24, 2008, 2:53:38 AM12/24/08

to Gabriel Genellina, pytho...@python.org

On Tue, Dec 23, 2008 at 11:46 PM, Gabriel Genellina
<gags...@yahoo.com.ar> wrote:
<snip>

>
> Yes; but isn't a dict comprehension more adequate?
>
> [key: (dict1[key], dict2[key]) for key in common_keys if
> dict1[key]!=dict2[key]}

That initial [ should be a { of course.

Cheers,
Chris

--
Follow the path of the Iguana...
http://rebertia.com

Malcolm Greene

unread,

Dec 24, 2008, 3:04:57 AM12/24/08

to pytho...@python.org

Hi Roger,

By very large dictionary, I mean about 25M items per dictionary. Each
item is a simple integer whose value will never exceed 2^15.

I populate these dictionaries by parsing very large ASCII text files
containing detailed manufacturing events. From each line in my log file
I construct one or more keys and increment the numeric values associated
with these keys using timing info also extracted from each line.

Some of our log files are generated by separate monitoring equipment
measuring the same process. In theory, these log files should be
identical, but of course they are not. I'm looking for a way to
determine the differences between the 2 dictionaries I will create from
so-called matching sets of log files.

At this point in time, I don't have concerns about memory as I'm running
my scripts on a dedicated 64-bit server with 32Gb of RAM (but with
budget approval to raise our total RAM to 64Gb if necessary).

My main concern is am I applying a reasonably pythonic approach to my
problem, eg. am I using appropriate python techniques and data
structures? I am also interested in using reasonable techniques that
will provide me with the fastest execution time.

Thank you for sharing your thoughts with me.

Regards,
Malcolm

----- Original message -----
From: "Roger Binns" <rog...@rogerbinns.com>
To: pytho...@python.org
Date: Tue, 23 Dec 2008 23:26:49 -0800
Subject: Re: Strategy for determing difference between 2 very large
dictionaries

--
http://mail.python.org/mailman/listinfo/python-list

pyt...@bdurham.com

unread,

Dec 24, 2008, 3:23:00 AM12/24/08

to Gabriel Genellina, pytho...@python.org

Hi Gabriel,

Thank you very much for your feedback!

> k1 = set(dict1.iterkeys())

I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
to using an iterator vs. a list as the basis for creating a set? I
understand that an iterator makes sense if you're working with a large
set of items one at a time, but if you're creating a non-filtered
collection, I don't see the advantage of using an iterator or a list.
I'm sure I'm missing a subtle point here :)

>> can this last step be done via a simple list comprehension?

> Yes; but isn't a dict comprehension more adequate?

>
> [key: (dict1[key], dict2[key]) for key in common_keys if
> dict1[key]!=dict2[key]}

Cool!! I'm relatively new to Python and totally missed the ability to
work with dictionary comprehensions. Yes, your dictionary comprehension
technique is much better than the list comprehension approach I was
struggling with. Your dictionary comprehension statement describes
exactly what I wanted to write.

Regards,
Malcolm

----- Original message -----
From: "Gabriel Genellina" <gags...@yahoo.com.ar>
To: pytho...@python.org
Date: Wed, 24 Dec 2008 05:46:04 -0200
Subject: Re: Strategy for determing difference between 2 very large
dictionaries

Yes; but isn't a dict comprehension more adequate?

[key: (dict1[key], dict2[key]) for key in common_keys if
dict1[key]!=dict2[key]}

(where common_keys=k1&k2 as above)

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Marc 'BlackJack' Rintsch

unread,

Dec 24, 2008, 3:30:41 AM12/24/08

to

On Wed, 24 Dec 2008 03:23:00 -0500, python wrote:

> Hi Gabriel,
>
> Thank you very much for your feedback!
>
>> k1 = set(dict1.iterkeys())
>
> I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
> to using an iterator vs. a list as the basis for creating a set? I
> understand that an iterator makes sense if you're working with a large
> set of items one at a time, but if you're creating a non-filtered
> collection, I don't see the advantage of using an iterator or a list.
> I'm sure I'm missing a subtle point here :)

`keys()` creates a list in memory, `iterkeys()` does not. With
``set(dict.keys())`` there is a point in time where the dictionary, the
list, and the set co-exist in memory. With ``set(dict.iterkeys())`` only
the set and the dictionary exist in memory.

Ciao,
Marc 'BlackJack' Rintsch

pyt...@bdurham.com

unread,

Dec 24, 2008, 3:39:56 AM12/24/08

to Marc 'BlackJack' Rintsch, pytho...@python.org

Hi Marc,

> `keys()` creates a list in memory, `iterkeys()` does not. With
> ``set(dict.keys())`` there is a point in time where the dictionary, the
> list, and the set co-exist in memory. With ``set(dict.iterkeys())``
> only the set and the dictionary exist in memory.

Perfect explanation.

Thank you!
Malcolm

----- Original message -----
From: "Marc 'BlackJack' Rintsch" <bj_...@gmx.net>
To: pytho...@python.org
Date: 24 Dec 2008 08:30:41 GMT
Subject: Re: Strategy for determing difference between 2 very large
dictionaries

Ciao,
Marc 'BlackJack' Rintsch
--
http://mail.python.org/mailman/listinfo/python-list

Peter Otten

unread,

Dec 24, 2008, 3:46:51 AM12/24/08

to

Gabriel Genellina wrote:

> En Wed, 24 Dec 2008 05:16:36 -0200, <pyt...@bdurham.com> escribió:

[I didn't see the original post]

If you are not interested in the intermediate results and the dictionary
values are hashable you can get the difference by

>>> a = dict(a=1, b=2, c=3)
>>> b = dict(b=2, c=30, d=4)
>>> dict(set(a.iteritems()) ^ set(b.iteritems()))
{'a': 1, 'c': 3, 'd': 4}

Peter

James Stroud

unread,

Dec 24, 2008, 3:57:35 AM12/24/08

to

For the purpose of perpetuating the annoying pedantry that has made
usenet great:

http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com

pyt...@bdurham.com

unread,

Dec 24, 2008, 4:08:36 AM12/24/08

to pytho...@python.org, James Stroud

Hi James,

> For the purpose of perpetuating the annoying pedantry that has made
> usenet great:
>
> http://docs.python.org/dev/3.0/whatsnew/3.0.html#views-and-iterators-instead-of-lists

Great tip! Thank you!

Malcolm

Gabriel Genellina

unread,

Dec 24, 2008, 4:10:16 AM12/24/08

to pytho...@python.org

En Wed, 24 Dec 2008 06:23:00 -0200, <pyt...@bdurham.com> escribió:

> Hi Gabriel,
>
> Thank you very much for your feedback!
>
>> k1 = set(dict1.iterkeys())
>
> I noticed you suggested .iterkeys() vs. .keys(). Is there any advantage
> to using an iterator vs. a list as the basis for creating a set? I

You've got an excelent explanation from Marc Rintsch. (Note that in Python
3.0 keys() behaves as iterkeys() in previous versions, so the above code
is supposed to be written in Python 2.x)

>>> can this last step be done via a simple list comprehension?
>
>> Yes; but isn't a dict comprehension more adequate?
>>
>> [key: (dict1[key], dict2[key]) for key in common_keys if
>> dict1[key]!=dict2[key]}
>
> Cool!! I'm relatively new to Python and totally missed the ability to
> work with dictionary comprehensions. Yes, your dictionary comprehension
> technique is much better than the list comprehension approach I was
> struggling with. Your dictionary comprehension statement describes
> exactly what I wanted to write.

This time, note that dict comprehensions require Python 3.0 -- so the code
above won't work in Python 2.x.
(It's not a good idea to mix both versions in the same post, sorry!)
You might use instead:

dict((key,(dict1[key],dict2[key])) for key in ...)

but it's not as readable.

--
Gabriel Genellina

pyt...@bdurham.com

unread,

Dec 24, 2008, 4:23:21 AM12/24/08

to Peter Otten, pytho...@python.org

Hi Peter,

<quote>

If you are not interested in the intermediate results and the dictionary
values are hashable you can get the difference by

>>> a = dict(a=1, b=2, c=3)
>>> b = dict(b=2, c=30, d=4)
>>> dict(set(a.iteritems()) ^ set(b.iteritems()))
{'a': 1, 'c': 3, 'd': 4}

</quote>

That's very cool! Thanks for sharing that technique.

Regards,
Malcolm

----- Original message -----
From: "Peter Otten" <__pet...@web.de>
To: pytho...@python.org

Gabriel Genellina wrote:

Peter
--
http://mail.python.org/mailman/listinfo/python-list

Malcolm Greene

unread,

Dec 24, 2008, 4:31:21 AM12/24/08

to pytho...@python.org, Gabriel Genellina

Hi Gabriel,

> in Python 3.0 keys() behaves as iterkeys() in previous versions, so the above code is supposed to be written in Python 2.x)

I understand. Thank you.

> note that dict comprehensions require Python 3.0

I'm relieved to know that I didn't miss that feature in my reading of
Python's 2.5/2.6 documentation :)

> You might use instead:
>
> dict((key,(dict1[key],dict2[key])) for key in ...)

Excellent. Thank you.

Regards,
Malcolm

----- Original message -----
From: "Gabriel Genellina" <gags...@yahoo.com.ar>
To: pytho...@python.org
Date: Wed, 24 Dec 2008 07:10:16 -0200
Subject: Re: Strategy for determing difference between 2 very large
dictionaries

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

bearoph...@lycos.com

unread,

Dec 24, 2008, 6:24:02 AM12/24/08

to

Peter Otten:

> >>> a = dict(a=1, b=2, c=3)
> >>> b = dict(b=2, c=30, d=4)
> >>> dict(set(a.iteritems()) ^ set(b.iteritems()))

For larger sets this may be better, may avoid the creation of the
second set:

dict(set(a.iteritems()).symmetric_difference(b.iteritems()))

Bye,
bearophile

Scott David Daniels

unread,

Dec 24, 2008, 2:19:03 PM12/24/08

to

Gabriel Genellina wrote:
> En Wed, 24 Dec 2008 06:23:00 -0200, <pyt...@bdurham.com> escribió:

>> ... k1 = set(dict1.iterkeys())

> You've got an excelent explanation from Marc Rintsch. (Note that in
> Python 3.0 keys() behaves as iterkeys() in previous versions, so the
> above code is supposed to be written in Python 2.x)

And, in fact, a dictionary iterates its keys, so:
k1 = set(dict1)
works in 2.4, 2.5, 2.6, and 3.0

--Scott David Daniels
Scott....@Acm.Org

Terry Reedy

unread,

Dec 24, 2008, 6:38:12 PM12/24/08

to pytho...@python.org

If you can, consider using 3.0 in which d.keys() is a set-like view of
the keys of d. Same for d.values and d.items. The time and space to
create such is O(1), I believe. since they are just alternate read-only
interfaces to the internal dict storage. There is not even a separate
set until you do something like intersect d1.keys() with d2.keys() or
d1.keys() - d2.keys(). I think this is an under-appreciated feature of
3.0 which should be really useful with large dicts.

tjr

John Machin

unread,

Dec 24, 2008, 8:48:43 PM12/24/08

to

On Dec 24, 7:04 pm, "Malcolm Greene" <mgre...@bdurham.com> wrote:
> Hi Roger,
>
> By very large dictionary, I mean about 25M items per dictionary. Each
> item is a simple integer whose value will never exceed 2^15.

In Python-speak about dictionaries, an "item" is a tuple (key, value).
I presume from the gross difference between 25M and 2**15
-- more pedantry: 2^15 is 13 :-) -- that you mean that the values
satisfy 0 <= value <= 2**15.

>
> I populate these dictionaries by parsing very large ASCII text files
> containing detailed manufacturing events. From each line in my log file
> I construct one or more keys and increment the numeric values associated
> with these keys using timing info also extracted from each line.
>
> Some of our log files are generated by separate monitoring equipment
> measuring the same process. In theory, these log files should be
> identical, but of course they are not. I'm looking for a way to
> determine the differences between the 2 dictionaries I will create from
> so-called matching sets of log files.

You seem to refer to the dictionaries as part of your requirements,
not as part of the solution. Do you *need* the two dictionaries after
you have obtained the differences?

Let's start from the beginning: You have two log files, each of which
can be abstracted as containing lots of (k, v) data, where each k
describes an event and each v is a compressed 15-bit timestamp. You
want to know for each datum whether it is (a) in both files (b) only
in file1 (c) only in file2. Is that a fair summary?

If so, there's an alternative that doesn't need to use dictionaries.
Each file can be represented by a *list* of 32768 sets. Each set
contains the ks that happened at the corresponding time... sets[fileno]
[v].add(k). Once you've built that, you trundle through the pairs of
sets doing set differences etc.

Bear in mind that Python dicts and sets grow as required by doubling
(IIRC) in size and rehashing from old to new. Consequently you get
periodical disturbances which are not usually noticed but may be
noticeable with large amounts of data.

HTH,
John