multiset_partitions does not seem to be generating all possible partitions.
In particular, if I use this function to enumerate factorings of 672 it
does not find 2*2*8*21
Also, some of the partitions of [2,2,2,2,2,2] include the empty set as an
element.
I am aware of bug 2485 for multiset_partitions, but that seems to be
describing different behavior. (If this seems close enough, please go
ahead and merge this into 2485.)
Found this on 0.7.1, and it seems unchanged with a top-of-tree version I
just pulled with git. (Both tests on Linux, with 64 bit Python 2.7.3)
I am attaching a code snippet which shows the behavior.
This is very pertinent issue since I just committed kbins which depends on
this. I think this is different, because an empty set rather than None is
being returned, and as you point out, there are some partitions missing. We
probably just need to check the Knuth reference to see where we went wrong
in implementing this. That, or we might be able to file a bug report to
Knuth!
The relevant Knuth reference is Algorithm M, (Multipartitions in decreasing
lexicographic order), page 429 of volume 4, TAOCP, section 7.2.1.5. While
it is of course *possible* that there is an error in Knuth's presentation
of the algorithm, that wouldn't be my first bet.
WRT the list of factorings in comment 3 -- yes, it looks like an
improvement, although my feel for multiset partitions is not good enough
that I can just eyeball it. It is still visiting the same factoring
multiple times, but that is a known issue.
It looks like Sage implements multiset partitions
(sage/combinat/subset.py). I'll see if I can use that to generate some
test cases.
Oops - scratch the Sage reference I gave at the end of comment #4. The
class subset.py in sage enumerates the *subsets* of a multiset, not the
partitions.
I was going to point you to the oracle himself. Knuth still has the
fasicles on his web page. See <
www-cs-faculty.stanford.edu/~uno/fasc3b.ps.gz >.
The discussion of multiset partitions is on pages 38-40 of this file
(corresponding to pp 428-430 in the real book).
But ... there are some differences between the fascicle and the printed
version. First of all, algorithm M goes from being "simple" in the fasicle
to "fairly simple" in print :). More pertinently, he made some changes in
the code itself. Perhaps saptman (GSOC 11) was working from the fascile
and lead astray?
WRT post 9. Great! I will take a look, once I figure out how to use GIT a
bit better.
Meanwhile, for testing, here are some possibilities:
1) Multiset partitions are generalizations of integer partitions and set
partitions. As a sanity check we can verify that the multiset
implementation gives the same results on these simpler cases.
2) Multiset partitions can be implemented as a wrapper on top of set
partitions, and then filtering out the duplicates. I have attached a file
which does the wrapping, but not the filtering (yet). I can try it against
pull 1658 to verify that it gets the same answers.
3) Counting multiset partitions (comment 5) is possible, but I didn't see
an easy-to-understand formula or anything.
I haven't done an exhaustive regression but it is providing the correct
answer on everything I have checked so far.
Using your implementation to enumerate all the possible factorings of numbers up to 12000 takes 624 seconds on my laptop. (Well, not counting
the powers of primes. I had code which used integer partitions for that
special case.) This is a *lot* faster than the filter method I posted in
comment 11. Enumerating factoring of numbers up to 1000 takes 292 seconds
with my code, and only 2 seconds with yours. The large discrepancy is
surprising, since in each case we look at all the set partitions and
collapse them down as needed. If the performance difference is real, it
might be worth exposing the core of your code as a set partitions
implementation.
Some specific comments (and sorry, if I should be commenting on the pull
request itself, let me know).
1) You still have the second parameter, m, but the comment explaining what
it is for has gone away.
2) Integer partitions could be special cased to use
combinatorics.partitions. The difference is pretty dramatic. For example,
for n=14 there are 190 million set partitions, but only 135 integer
partitions.
3) Integer versus set partitions is the extreme case, but any time the
multiplicity of elements is high, a proper implementation of Knuth's
7.1.2.5m is going to be much faster than finding equivalence classes of set
partitions. (And to be fair, Knuth himself would not refer to 7.1.2.5m as
his algorithm. He traces it back to work by Wallis in 1685.)
4) After much procrastination, I think I now have my head around 7.1.2.5m. I will see if I can produce an implementation of it, if only for my own
edification. One thing that might have to be sacrificed would be the
second parameter. Knuth provides, in an exercise, a version which produces
partitions of at most r parts, but not exactly r parts.
Can you explain what you mean by "exposing the core of your code as a set
partitions implementation"? I think I might get it but let me explain the
current situation:
iterables contains partitions and multiset_partitions. `partitions` is
actually a very fast integer partition algorithm. multiset_partition
handles both set and multiset partitions. So perhaps you are suggesting
that partitions should be renamed integer_partitions, I expose the core of
multiset_partitions as set_partitions and then use that in
multiset_partitions. Yes? Perhaps a better thing to do would be to make
partitions itself the doorway to all the functions. 1) if input is an
integer you get integer partitions; 2) if you pass a list (or set) you get
set partitions; 3) if you pass a dictionary you get multiset partitions.
> Some specific comments (and sorry, if I should be commenting on the pull
> request itself, let me know).
On the PR is a the preferred place, but the two refer to each other so this
is ok, too.
> 1) You still have the second parameter, m, but the comment explaining
> what it is for has gone away.
I'll add more description.
> 2) Integer partitions could be special cased to use
> combinatorics.partitions. The difference is pretty dramatic. For
> example, for n=14 there are 190 million set partitions, but only 135
> integer partitions.
I'm not sure this is necessary since Partitions only accepts set-like input:
> 3) Integer versus set partitions is the extreme case, but any time the
> multiplicity of elements is high, a proper implementation of Knuth's
> 7.1.2.5m is going to be much faster than finding equivalence classes of
> set partitions. (And to be fair, Knuth himself would not refer to
> 7.1.2.5m as his algorithm. He traces it back to work by Wallis in 1685.)
Agreed. If you can get the routine running again it would be great to see
it working. I think Knuth said something about the enduring quality of an
algorithm like "you just have to see it". Well the nextequ has that quality
for me...it's so simple!
> 4) After much procrastination, I think I now have my head around
> 7.1.2.5m. I will see if I can produce an implementation of it, if only
> for my own edification. One thing that might have to be sacrificed would
> be the second parameter. Knuth provides, in an exercise, a version which
> produces partitions of at most r parts, but not exactly r parts.
Sorry for any confusion. I am not using the terminology consistently and am
probably (still) a bit confused by the relationship betweeen the code in
utilities.iterables and combinatorics.partitions.
For the "expose the core of" remark, I didn't realize that yes, your
multiset_partitions code is a perfectly good implementation of set
partitions. It even has a provision for avoiding creation of the cache in
this case. BTW, I think there is a bug in this bit -- you set canon = None
(if no dups are found) but later test against canon == 0. Shouldn't one be
changed?
What I meant with my point number 2 is that if I feed multiset_partitions
input in which is just the same element repeated n times, the output is
isomorphic (except for the m parameter) to the partitions of the integer n.
(And in my earlier comment, I was confusing combinatorics and utilities
implementations of integer partitions.)
In code:
from sympy.utilities.iterables import partitions
for m in xrange(1, 6):
print [p for p in multiset_partitions('aaaaa', m)]
for p in partitions(5):
print p
Except that for large n, the second expression is using a much faster
algorithm.
Thanks for your clarifying comments. multiset_partitions is now a clearing
house for any enumerations of sets (with or without repeats). It uses the
partitions() routine if there is only 1 element in the multiset. I also
exposed (as a private method) the nexequ routine under the name
_set_partitions.
(Changing the routine to be 1-based was a delicate procedure! I left the
original version in case some bug is discovered in what I did.)
btw, the old multiset code didn't look at the values of the multiset.
Unless a routine does, I don't know how anything could be faster than the
replacement code that I put it when you consider the simple output that it
is generating. Here is the output of _set_partitions for n=5
It's all very orderly and I suspect the 20+ lines of code that are
generating it are about as compact as you can get. So, again, unless any
replacement code actually considers the multiplicity of the elements I
think this is as good as we can get. The challenge for any code that
considers the multiplicity is to produce a pattern like the following:
>>> L=[1,1,1,2,2]
>>> s=set()
>>> for n, p in _set_partitions(len(L)):
Re comment 17 - Great! With a general dispatcher routine, if/when
7.2.1.5m working, can be later integrated, for those cases where it
provides a speedup.
re: comment 18 / checking in pull request 1658.
I agree that the existing code isn't functioning as anything better
than a (buggy) set partitions routine, since it doesn't look at its
elements. I have to admit that I have not been able to understand the
code in that is currently checked in -- is the original author
available to shed some light on the intent? It doesn't look anything
like algorithm M (the data structures are different), and it also
doesn't look like restricted growth strings, which is the main
algorithm Knuth presents for set partitions. So, I am puzzled.
At any rate, the choice between broken and mysterious and working and
understood is pretty clearcut, so checking this in will be a big
improvement.
I should have some time tomorrow to review your code more carefully.
(Of course, any feedback I provide comes with the proviso that I am a
complete newbie wrt sympy.)
One thing I noticed when reading it and trying examples -- if the
input multiset has no dups, you carefully set canon to None. But, in
this case, can't you also skip creating canonical, and checking the
cache?
As a test, I switched the code at the tail end of multiset_partitions to
if canon:
canonical = tuple(
sorted([tuple([canon[i] for i in j]) for j in rv]))
if canonical not in cache:
cache.add(canonical)
yield [[multiset[j] for j in i] for i in rv]
else:
yield [[multiset[j] for j in i] for i in rv]
and noticed a pleasant speedup. (624 to 550 seconds on my factorings
to 12000 code.) Of course, I am not claiming that this is *the* fix,
but you get the idea.
I have had a chance to look at the pull request, and perform some
testing. (I see that you are adding more commits, so this is only
up-to-date as of 2c11617.)
Thanks for the commit which responds to comment 20 (and incorporates the
optimization).
First, since I am new to sympy and git, here is what I did to get
access to your changed files:
If this is wrong, that might explain inconsistencies in the comments
below (and please let me know, so I can fix my process).
Overall, the code looks very clean.
I paid particular attention to _set_partitions. I agree that this
implementation is about as good as possible for set
partitions in the m=None case. After all, it is amortized constant
time per partition vector returned, and it is hard to do better than
constant time! As used in multiset_partitions, in the the m=<number>
case, there may well be a faster approach than generating all the
patitions and filtering to get only those of the desired size. But, I
am certainly not suggesting we try for that case -- this code is
already quite elaborate, and a huge improvement over what is currently
in master.
The code for _set_partitions is short, and well commented, but the the
algorithm (and especially the data structure) is going to be
non-obvious to someone looking at it for the first time.
You *might* consider adding something like the following to the
comments/doc:
This algorithm is similar to, and solves the same problem as,
Algorithm 7.2.1.5H, from volume 4A of Knuth's The Art of Computer
Programming. Knuth uses the term "restricted growth string" where
this code refers a "partition vector". In each case, the meaning is
the same: the value in the ith element of the vector specifies to
which part the ith set element is to be assigned.
At the lowest level, this code implements an n-digit big-endian
counter (stored in the array q) which is incremented (with carries) to
get the next partition in the sequence. A special twist is that a
digit is constrained to be at most one greater than the maximum of all
the digits to the left of it. The array b maintains this maximum, so
that the code can efficiently decide when a digit can be incremented
in place or whether it needs to be reset to 0 and trigger a carry to
the next digit. The enumeration starts with all the digits 0 (which
corresponds to all the set elements being assigned to the same 0th
part), and ends with 0123...n, which corresponds to each set element
being assigned to a different, singleton, part.
---------------------------
The helper function _partition provides similar, but lighter-weight
functionality to that provided by the classmethod
combinatorics.partitions.Partition.from_rgs(). Perhaps worth a
cross-reference?
I also wrote a tester to sanity check. I am attaching it to this
post. Perhaps something like test_partitions() is worth incorporating
into the tests? The one-off tests at the bottom of this most
likely duplicate tests you already have.
-----------------
multiset_partitions
--> docstring
The analytical discussion under "Counting"
These formulas only apply in the set case. Multisets are more
complicated... perhaps:
If the input is a set (no repeated elements), the number of partitions
returned is given by the bell number ...
----------------
I looked briefly at the new multiset_combinations and
multiset_permutations and didn't see anything to argue with. (Except
of course the TODO comment in multiset_combinations).
Misc - combinatorics.RGS_enum() seems to be computing the same thing
as bell()? Worth a cleanup or cross reference?
Re your Git workflow: I don't know if git checkout sets up the branch to
track by default. See if git pull updates the branch.
Usually, I just do "git checkout smichr/branch", which doesn't create a
local branch. To update with that model, just do git fetch and git checkout
again.
> You *might* consider adding something like the following to the
> comments/doc:
I added the excellent comments (with very minor modification) to the notes
of `_set_partitions`.
> The helper function _partition provides similar, but lighter-weight
> functionality to that provided by the classmethod
> combinatorics.partitions.**Partition.from_rgs(). Perhaps worth a
> cross-reference?
> I added it as a See Also
> I also wrote a tester to sanity check. I am attaching it to this
> post. Perhaps something like test_partitions() is worth incorporating
> into the tests? The one-off tests at the bottom of this most
> likely duplicate tests you already have.
> -----------------
> multiset_partitions
> --> docstring
> The analytical discussion under "Counting"
> These formulas only apply in the set case. Multisets are more
> complicated... perhaps:
done
> If the input is a set (no repeated elements), the number of partitions
> returned is given by the bell number ...
> ----------------
> I looked briefly at the new multiset_combinations and
> multiset_permutations and didn't see anything to argue with. (Except
> of course the TODO comment in multiset_combinations).
The TODO is done, too, and now my brain is mush.
> Misc - combinatorics.RGS_enum() seems to be computing the same thing
> as bell()? Worth a cleanup or cross reference?
> Other than the calculation for numbers < 1, it appears to be the same and