Gradient descent method

Vidar

unread,

Jul 8, 2019, 4:41:21 PM7/8/19

to Superpermutators

Hello everyone,

I want to start by saying that I haven't had any success with this method, but maybe the basic idea can be useful in some way.

So I have tried a simple gradient descent approach, which I will try to outline with an example where a length 9 superpermutation of the three objects A, B and C is sought:

Consider a string of length 9, where each position is associated with three non-negative variables representing the "A-ness", "B-ness" and "C-ness" of that position. We impose the constraint that the sum of these variables is 1 for each position, since there should be a total of one character per position. So all in all there are 27 non-negative variables which sum to 1 in triples, if that makes sense.

The goal is to define an objective function to optimize subject to these constraints, such that it attains its extreme values at points corresponding to superpermutations. Ideally a solution will look like this:

Pos 1: (1, 0, 0) = A
Pos 2: (0, 1, 0) = B
Pos 3: (0, 0, 1) = C
Pos 4: (1, 0, 0) = A
Pos 5: (0, 1, 0) = B
Pos 6: (1, 0, 0) = A
Pos 7: (0, 0, 1) = C
Pos 8: (0, 1, 0) = B
Pos 9: (1, 0, 0) = A

But values between 0 and 1 are also okay since each triple can be rounded to the closest object. Something like this would also be a solution then:

Pos 1: (0.9, 0.1, 0.0) ~ A
Pos 2: (0.1, 0.8, 0.1) ~ B
Pos 3: (0.2, 0.0, 0.8) ~ C
...

The objective functions that I have tested take as input the number of times each permutation occurs in the string, but how do we count permutations when we have fuzzy characters like this? I suggest two ways, which I will describe by counting the numbers of ABC's, BCA's and CBA's in the first three positions of the above example.

1. Multiplying variables:

The number of ABC's in the first three positions is 0.9 * 0.8 * 0.8 = 0.576, since there are 0.9 A's in the first position, 0.8 B's in the second position and 0.8 C's in the third position. The number of BCA's is 0.1 * 0.1 * 0.2 = 0.002, since there are 0.1 B's in the first position, 0.1 C's in the second position and 0.2 A's in the third position. The number of CBA's is 0 since there are 0.0 C's in the first position.

2. Raising some number to a sum of variables:

The number of ABC's is b^(0.9 + 0.8 + 0.8) / b^3 = b^-0.5, the number of BCA's is b^(0.1 + 0.1 + 0.2) / b^3 = b^-2.6 and the number of CBA's is b^(0.0 + 0.8 + 0.2) / b^3 = b^-2.0. If we choose b = 3 for example we get 0.577 ABC's, 0.057 BCA's and 0.111 CBA's.

The total permutation counts for the whole string are the sums of these numbers for all length 3 substrings.

Now we want to put this all together in an objective function and optimize it. This is what I have tried:

1. Maximizing (the logarithm of) the product of the permutation counts. This product will be close to zero if any single permutation count is close to zero, so maximizing this objective function should produce strings where no permutation counts are close to zero.

2. Maximizing the sum of the square roots of the permutation counts. The square root grows the fastest close to zero, so this objective function might also result in strings where no permutation counts are close to zero.

3. Minimizing the permutation counts' deviations from 1 in the least squares sense. This function also penalizes strings where permutation counts grow unnecessarily large.

There's a jungle of methods for solving this type of problem and I have little knowledge of the subject, but I'm aware that it's generally very difficult to find a global minimum of a non-convex function and vice versa. I experimented with a simple gradient descent method with logarithmic barrier functions and the results were a bit discouraging. I think I found the length 9 superpermutation in about half of the runs when I started with random values. This could probably be improved by tuning various parameters.

This was a while back and I don't remember all the details, so I'm sorry if my explanation is unclear.

Kind regards

Vidar

Miles Gould

unread,

Jul 9, 2019, 6:18:44 AM7/9/19

to Vidar, Superpermutators

On Mon, 8 Jul 2019 at 21:41, Vidar <vid...@gmail.com> wrote:

Consider a string of length 9, where each position is associated with three non-negative variables representing the "A-ness", "B-ness" and "C-ness" of that position. We impose the constraint that the sum of these variables is 1 for each position, since there should be a total of one character per position. So all in all there are 27 non-negative variables which sum to 1 in triples, if that makes sense.

You may know this already, but this trick is known as "one-hot encoding" in machine-learning: https://en.wikipedia.org/wiki/One-hot. In statistics such variables are apparently known as "dummy variables" (https://en.wikipedia.org/wiki/Dummy_variable_(statistics) ), and in linear programming I think they're called "switch variables".

A common way of building a classifier is to encode the classification of each training example using one-hot encoding, then train a neural network (or other system with continuous outputs) to reproduce these. When the trained system is then run on new data, it outputs its estimate of the probability that a datapoint belongs to each class.

[This is not a criticism! Reinventing a standard notion or technique is often a sign that you're thinking along promising lines :-)]

The number of ABC's in the first three positions is 0.9 * 0.8 * 0.8 = 0.576, since there are 0.9 A's in the first position, 0.8 B's in the second position and 0.8 C's in the third position. The number of BCA's is 0.1 * 0.1 * 0.2 = 0.002, since there are 0.1 B's in the first position, 0.1 C's in the second position and 0.2 A's in the third position. The number of CBA's is 0 since there are 0.0 C's in the first position.

This makes sense in terms of probabilities: the probability that we have ABC in the first position is 0.576.

2. Raising some number to a sum of variables:

The number of ABC's is b^(0.9 + 0.8 + 0.8) / b^3 = b^-0.5, the number of BCA's is b^(0.1 + 0.1 + 0.2) / b^3 = b^-2.6 and the number of CBA's is b^(0.0 + 0.8 + 0.2) / b^3 = b^-2.0. If we choose b = 3 for example we get 0.577 ABC's, 0.057 BCA's and 0.111 CBA's.

Wait, what? I don't understand why you're doing this. The expected total number of ABCs is P(ABC at position 1) + P(ABC at position 2) + ... + P(ABC at position 9). Where does the parameter b come from?

I did wonder for a while if you were calculating the generating function of the number of permutations (https://en.wikipedia.org/wiki/Generating_function), but in that case I'd expect the character-probabilities to be coefficients, not powers.

1. Maximizing (the logarithm of) the product of the permutation counts.

OK, so b falls out as a constant factor here.

3. Minimizing the permutation counts' deviations from 1 in the least squares sense. This function also penalizes strings where permutation counts grow unnecessarily large.

This sounds promising!

There's a jungle of methods for solving this type of problem and I have little knowledge of the subject, but I'm aware that it's generally very difficult to find a global minimum of a non-convex function and vice versa.

Also, we're ultimately trying to solve the related problem in integers, which is NP-complete. However, "relax the problem to real numbers, solve that, then use the solution to the relaxed problem as a starting point for your search" is a standard and powerful technique for many NP-complete problems, so this is definitely worth a shot.

I experimented with a simple gradient descent method with logarithmic barrier functions and the results were a bit discouraging. I think I found the length 9 superpermutation in about half of the runs when I started with random values. This could probably be improved by tuning various parameters.

Is your code available anywhere? There are a bunch of tricks for making gradient descent less prone to getting stuck in local minima, and there are also other algorithms like simulated annealing (https://en.wikipedia.org/wiki/Simulated_annealing).

This was a while back and I don't remember all the details, so I'm sorry if my explanation is unclear.

It makes sense to me apart from the b thing :-)

HTH,

Miles

Miles Gould

unread,

Jul 9, 2019, 9:09:42 AM7/9/19

to Vidar, Superpermutators

On Tue, 9 Jul 2019, 11:18 Miles Gould, <mi...@assyrian.org.uk> wrote:

On Mon, 8 Jul 2019 at 21:41, Vidar <vid...@gmail.com> wrote:
I experimented with a simple gradient descent method with logarithmic barrier functions and the results were a bit discouraging.

What constraints are you encoding in your barrier functions? From your first message, it sounds like

all probabilities lie between 0 and 1
each position's probabilities sum to 1

This should be enough to rule out the trivial solution: set each probability to (m-n+1)^-n, where n is the number of possible characters and m is the length of the string, so the expected number of occurrences of each permutation is (m-n+1)*((m-n+1)^-n)^n = 1.

I think you could get some mileage out of adding additional constraints, if you're not already doing so:

the first characters are always ABC...
each character is different from the one before it

You could maybe also add some terms in your objective function to penalise higher-weight edges (indicated by a character occurring less than n positions apart, with smaller gaps between occurrences being penalised more heavily).

Also, and sorry for not saying this in my first reply: hi, and welcome! It's great to have you in the group, and I for one am intrigued by your approach: I'm not aware of anyone having tried it before. I'm really interested to see your code!

Miles

Miles Gould

unread,

Jul 10, 2019, 2:56:06 PM7/10/19

to Vidar, Superpermutators

I had a go at implementing this myself, using PyTorch for the gradient descent:

https://www.kaggle.com/pozorvlak/superpermutations-in-pytorch

This... did not go brilliantly. In particular, I switched to using quadratic loss functions rather than logarithmic barrier functions because the latter gave me NaNs everywhere.

So I tried again using scipy.optimize:

https://www.kaggle.com/pozorvlak/superpermutations-with-scipy-optimize

This, if anything, went worse - if I leave out the "must start with abc" constraint it quickly converges to all probabilities being 1/3, but if I add it in then the optimizer quickly converges to all probabilities after the first three positions being 0, in defiance of the "probabilities for a position must sum to 1" constraint. Does anyone have any experience with either of these toolkits, or any insight into what I'm doing wrong?

Thanks!

Miles

Cory Braker Scott

unread,

Jul 10, 2019, 3:18:07 PM7/10/19

to Miles Gould, Vidar, Superpermutators

Hi all,

I've just been passively reading along on all of these threads until now, but numeric optimization is much more in my wheelhouse.

In general, I am quite pessimistic about using gradient descent for combinatorial optimization - there aren't really clear gradients to navigate for this type of problem (not sure I can make this statement more rigorous). However, I do think the idea of optimizing a relaxed version of the problem using some kind of numeric method is really promising!

I don't have time to implement either of these myself, but here are some ideas:

1) Start by minimizing a version of the objective function with no barriers/no constraints, and then gradually "warm up" the strength of the barrier function in relation to the objective.

2) Generally to get a list of probabilities, you use the softmax activation function - so the model's outputs can be unconstrained logits which are then converted into probabilities by the softmax.

(granted, I know you're not using a neural network, but the idea should still work)

3) There are neural network models which are specifically designed to analyze/output sequential data - LSTMs and GRUs in particular. Perhaps one of those models could be modified to output a list of characters, starting from the seed "abc" or some permutation, with the objective function being related to how close the output is to a superpermutation - for example this semi-recent paper using reinforcement learning to solve TSP. https://arxiv.org/pdf/1611.09940.pdf

Hope these are helpful!

C.B. Scott

--
You received this message because you are subscribed to the Google Groups "Superpermutators" group.
To unsubscribe from this group and stop receiving emails from it, send an email to superpermutato...@googlegroups.com.
To post to this group, send email to superper...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/superpermutators/CAKFNCm1kNrMtfbvLSKpOmGzAC30%2BsMWxgzoxwzfqqRMM8gHh4w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Vidar

unread,

Jul 10, 2019, 6:48:22 PM7/10/19

to Superpermutators

Den tisdag 9 juli 2019 kl. 12:18:44 UTC+2 skrev Miles Gould:

You may know this already, but this trick is known as "one-hot encoding" in machine-learning: https://en.wikipedia.org/wiki/One-hot. In statistics such variables are apparently known as "dummy variables" (https://en.wikipedia.org/wiki/Dummy_variable_(statistics) ), and in linear programming I think they're called "switch variables".

A common way of building a classifier is to encode the classification of each training example using one-hot encoding, then train a neural network (or other system with continuous outputs) to reproduce these. When the trained system is then run on new data, it outputs its estimate of the probability that a datapoint belongs to each class.

Interesting! I didn't know about this.

2. Raising some number to a sum of variables:

The number of ABC's is b^(0.9 + 0.8 + 0.8) / b^3 = b^-0.5, the number of BCA's is b^(0.1 + 0.1 + 0.2) / b^3 = b^-2.6 and the number of CBA's is b^(0.0 + 0.8 + 0.2) / b^3 = b^-2.0. If we choose b = 3 for example we get 0.577 ABC's, 0.057 BCA's and 0.111 CBA's.

Wait, what? I don't understand why you're doing this. The expected total number of ABCs is P(ABC at position 1) + P(ABC at position 2) + ... + P(ABC at position 9). Where does the parameter b come from?

I should have motivated this. The product formula has the disadvantage (?) that for instance ABCDDD counts as zero ABCDEF's, and the gradient of the objective function doesn't even indicate the direction toward ABCDEF. The exponential formula was just an attempt to fix this, but it certainly doesn't represent a probability.

3. Minimizing the permutation counts' deviations from 1 in the least squares sense. This function also penalizes strings where permutation counts grow unnecessarily large.

This sounds promising!

If I remember correctly, objective function 1 actually worked better (together with rule 1 for counting permutations).

What constraints are you encoding in your barrier functions? From your first message, it sounds like
all probabilities lie between 0 and 1
each position's probabilities sum to 1
This should be enough to rule out the trivial solution: set each probability to (m-n+1)^-n, where n is the number of possible characters and m is the length of the string, so the expected number of occurrences of each permutation is (m-n+1)*((m-n+1)^-n)^n = 1.

Yes, those are the constraints, but I'm not sure we're talking about the same probabilities. What do you mean by the trivial solution?

I think you could get some mileage out of adding additional constraints, if you're not already doing so:
the first characters are always ABC...
each character is different from the one before it
You could maybe also add some terms in your objective function to penalise higher-weight edges (indicated by a character occurring less than n positions apart, with smaller gaps between occurrences being penalised more heavily).

Great suggestions! I'll try this.

Also, and sorry for not saying this in my first reply: hi, and welcome! It's great to have you in the group, and I for one am intrigued by your approach: I'm not aware of anyone having tried it before. I'm really interested to see your code!

Thanks! :) I'll share the code after I've fixed a few things.

Vidar

unread,

Jul 10, 2019, 6:55:56 PM7/10/19

to Superpermutators

In the examples I have looked at, no permutation of k distinct symbols occurs more than 0! + 1! + 2! + ... + (n - k)! times in a minimal superpermutation of n symbols. Maybe this is a useful constraint.

Miles Gould

unread,

Jul 11, 2019, 6:19:46 AM7/11/19

to Cory Braker Scott, Vidar, Superpermutators

On Wed, 10 Jul 2019 at 20:18, Cory Braker Scott <sco...@uci.edu> wrote:

In general, I am quite pessimistic about using gradient descent for combinatorial optimization - there aren't really clear gradients to navigate for this type of problem (not sure I can make this statement more rigorous).

I think that makes sense - and the gradients we have are deeply non-convex.

However, I do think the idea of optimizing a relaxed version of the problem using some kind of numeric method is really promising!

Agreed! I'd previously tried out an off-the-shelf constraint solver, which (AIUI) compiles the constraints down into an integer linear program and solves that (see https://github.com/pozorvlak/superbracelets), but performance was much worse than that of a specialised TSP solver; however, I think some other idea in this space might well work out better.

I don't have time to implement either of these myself, but here are some ideas:
1) Start by minimizing a version of the objective function with no barriers/no constraints, and then gradually "warm up" the strength of the barrier function in relation to the objective.

Makes sense.

2) Generally to get a list of probabilities, you use the softmax activation function - so the model's outputs can be unconstrained logits which are then converted into probabilities by the softmax.
(granted, I know you're not using a neural network, but the idea should still work)

Yes, that makes a lot of sense - and PyTorch should have no problem backpropagating through the softmax layer (after all, backpropagating through multi-layer neural networks is what it's designed for).

3) There are neural network models which are specifically designed to analyze/output sequential data - LSTMs and GRUs in particular. Perhaps one of those models could be modified to output a list of characters, starting from the seed "abc" or some permutation, with the objective function being related to how close the output is to a superpermutation - for example this semi-recent paper using reinforcement learning to solve TSP. https://arxiv.org/pdf/1611.09940.pdf

Oh, that's very cool. Thanks!

Miles

Miles Gould

unread,

Jul 11, 2019, 6:26:35 AM7/11/19

to Vidar, Superpermutators

On Wed, 10 Jul 2019 at 23:48, Vidar <vid...@gmail.com> wrote:

I should have motivated this. The product formula has the disadvantage (?) that for instance ABCDDD counts as zero ABCDEF's, and the gradient of the objective function doesn't even indicate the direction toward ABCDEF. The exponential formula was just an attempt to fix this, but it certainly doesn't represent a probability.

Ah, cunning!

What constraints are you encoding in your barrier functions? From your first message, it sounds like
all probabilities lie between 0 and 1
each position's probabilities sum to 1
This should be enough to rule out the trivial solution: set each probability to (m-n+1)^-n, where n is the number of possible characters and m is the length of the string, so the expected number of occurrences of each permutation is (m-n+1)*((m-n+1)^-n)^n = 1.

Yes, those are the constraints, but I'm not sure we're talking about the same probabilities. What do you mean by the trivial solution?

Sorry, I was unclear: by "the trivial solution", I meant "all entries are equal, such that each permutation has one expected appearance spread evenly over the whole string". In general, this "solution" does not meet the "entries are probabilities" constraints.

You could maybe also add some terms in your objective function to penalise higher-weight edges (indicated by a character occurring less than n positions apart, with smaller gaps between occurrences being penalised more heavily).

Great suggestions! I'll try this.

Good luck! After Cory's post, I'm wondering if we could represent these penalties using a convolutional layer...

Also, and sorry for not saying this in my first reply: hi, and welcome! It's great to have you in the group, and I for one am intrigued by your approach: I'm not aware of anyone having tried it before. I'm really interested to see your code!

Thanks! :) I'll share the code after I've fixed a few things.

Great! Looking forward to seeing it.

In the examples I have looked at, no permutation of k distinct symbols occurs more than 0! + 1! + 2! + ... + (n - k)! times in a minimal superpermutation of n symbols. Maybe this is a useful constraint.

Interesting! Have you checked this against all the example superpermutations in the repo?

Miles

Vidar

unread,

Jul 14, 2019, 5:50:42 PM7/14/19

to Superpermutators

Den torsdag 11 juli 2019 kl. 12:26:35 UTC+2 skrev Miles Gould:

Sorry, I was unclear: by "the trivial solution", I meant "all entries are equal, such that each permutation has one expected appearance spread evenly over the whole string". In general, this "solution" does not meet the "entries are probabilities" constraints.

Alright, now I'm with you.

Good luck! After Cory's post, I'm wondering if we could represent these penalties using a convolutional layer...

I wish I knew more about machine learning so I could give it a try myself.

In the examples I have looked at, no permutation of k distinct symbols occurs more than 0! + 1! + 2! + ... + (n - k)! times in a minimal superpermutation of n symbols. Maybe this is a useful constraint.

Interesting! Have you checked this against all the example superpermutations in the repo?

I've checked all superpermutations of length 872 for n = 6 and these are the maximum occurrences:

k = 1: 152

k = 2: 34

k = 3: 10

k = 4: 4

k = 5: 2

k = 6: 1

Vidar

unread,

Jul 14, 2019, 5:59:18 PM7/14/19

to Superpermutators

I've put my gradient descent code on GitHub: https://github.com/vidarjh/superpermutations/blob/master/gradient_descent.c

It works worse than I remembered, but now I have a rough idea of how to proceed. Thanks for all the feedback!

Reply all

Reply to author

Forward