WRMF (or general) hyperparameter optimization strategy

Djoels

unread,

Feb 12, 2017, 9:51:43 AM2/12/17

to MyMediaLite

Dear all,

I'm wondering what the best method for hyperparameter tuning would look like.

I am currently attempting this with the WRMF item recommender on a "large" dataset (11.000.000 user item interactions):

My strategy so far has been:

* find the right alpha value (alpha in 1, 5, 10, 20, 40, 50, 100) with a default of 50 as number of factors, the regularization parameter set to its default 0.015 and see result of 10 first iterations

--> choose alpha would be the one that converges most rapidly in terms of prec@k and NDCG scores

* find the right number of factors with use of the optimal alpha found previously

--> choose number that keeps computation time feasible and seems to allow capturing the essential tendencies of the source data (again noting what happens after 10 iterations)

* find the right regularization parameter value for the alpha and num_factors found previously
* add iterations to the optimal parameter setting found above, and see to which metric scores it converges (and which number of iterations really add something to the fit).

Questions regarding my own approach are:

* Is it nonsensical to work in the order I mentioned? Are some steps to be avoided or turned around (or skipped altogether)?

* What would the natural order for optimization be? Or is it really necessary to try all combinations of all parameters?

* In the WRMF paper a alpha value of 40 is proposed by the authors, as it would benefit their own case. Why is the alpha value of 1 taken as a default in this implementation?

* What is the acceptable range of "search" values for a parameter such as regularization?

Can anybody shed some light on this?

Kind regards,

Julien

Zeno Gantner

unread,

Feb 12, 2017, 10:10:47 AM2/12/17

to mymed...@googlegroups.com

Hi Julien,

First of all, the nice thing about WRMF is that, in my experience, it is fairly robust wrt. the choice of hyperparameters, and does not need many iterations.

On Sun, Feb 12, 2017 at 3:51 PM, Djoels <juv...@gmail.com> wrote:

Dear all,

I'm wondering what the best method for hyperparameter tuning would look like.

I am currently attempting this with the WRMF item recommender on a "large" dataset (11.000.000 user item interactions):

My strategy so far has been:
* find the right alpha value (alpha in 1, 5, 10, 20, 40, 50, 100) with a default of 50 as number of factors, the regularization parameter set to its default 0.015 and see result of 10 first iterations
--> choose alpha would be the one that converges most rapidly in terms of prec@k and NDCG scores
* find the right number of factors with use of the optimal alpha found previously
--> choose number that keeps computation time feasible and seems to allow capturing the essential tendencies of the source data (again noting what happens after 10 iterations)
* find the right regularization parameter value for the alpha and num_factors found previously
* add iterations to the optimal parameter setting found above, and see to which metric scores it converges (and which number of iterations really add something to the fit).

Questions regarding my own approach are:
* Is it nonsensical to work in the order I mentioned? Are some steps to be avoided or turned around (or skipped altogether)?

1. I would first pick the number of factors, based on what is feasible computationally. Maybe pick 2 different sizes, and play with the smaller one first before going to the larger one.

2. I'd watch convergence on a hold-out set, not on the training set itself. If you use the command line tool, --find-iter=N comes in handy for this.

3. I would try to optimize alpha and regularization concurrently, i.e. using a kind of grid search, i.e. try 3x3=9 values on a logarithmic scale (e.g. 0.01, 0.1, 1 or 0.25, 0.5, 1) and then refine; maybe a simplex-like approach will save you some computation, or random search, but I am not sure ...

* What would the natural order for optimization be? Or is it really necessary to try all combinations of all parameters?

See above.

* In the WRMF paper a alpha value of 40 is proposed by the authors, as it would benefit their own case. Why is the alpha value of 1 taken as a default in this implementation?

The choice was more or less arbitrary -- maybe because it works nicely for typical datasets in the literature. Different data, different optimal hyperparameters.

* What is the acceptable range of "search" values for a parameter such as regularization?

Negative values make no sense. Everything else depends on your data.

Can anybody shed some light on this?

I hope my answers help at least a little bit.

I am also curious what others on the list have to say.

Julien, let us know if you find out more.

Best regards,

Z.

Djoels

unread,

Feb 12, 2017, 10:38:27 AM2/12/17

to MyMediaLite

Hi Zeno,

Thank you very much for your (extremely) quick reply!

Op zondag 12 februari 2017 16:10:47 UTC+1 schreef Zeno Gantner:

Hi Julien,

First of all, the nice thing about WRMF is that, in my experience, it is fairly robust wrt. the choice of hyperparameters, and does not need many iterations.

On Sun, Feb 12, 2017 at 3:51 PM, Djoels <juv...@gmail.com> wrote:
Dear all,

I'm wondering what the best method for hyperparameter tuning would look like.

I am currently attempting this with the WRMF item recommender on a "large" dataset (11.000.000 user item interactions):

My strategy so far has been:
* find the right alpha value (alpha in 1, 5, 10, 20, 40, 50, 100) with a default of 50 as number of factors, the regularization parameter set to its default 0.015 and see result of 10 first iterations
--> choose alpha would be the one that converges most rapidly in terms of prec@k and NDCG scores
* find the right number of factors with use of the optimal alpha found previously
--> choose number that keeps computation time feasible and seems to allow capturing the essential tendencies of the source data (again noting what happens after 10 iterations)
* find the right regularization parameter value for the alpha and num_factors found previously
* add iterations to the optimal parameter setting found above, and see to which metric scores it converges (and which number of iterations really add something to the fit).

Questions regarding my own approach are:
* Is it nonsensical to work in the order I mentioned? Are some steps to be avoided or turned around (or skipped altogether)?

1. I would first pick the number of factors, based on what is feasible computationally. Maybe pick 2 different sizes, and play with the smaller one first before going to the larger one.
2. I'd watch convergence on a hold-out set, not on the training set itself. If you use the command line tool, --find-iter=N comes in handy for this.

I am using a (rather small) test-ratio (such that I have a reasonable test-time) and specify the random seed.

Am I correct in expecting that whenever you have a equal test-ratio and (more importantly) random seed, the hold-out data "generated" from this test-ratio should be the same?

Furthermore: do you think a test-ratio of 0.0025 could be problematic? The reason I would use such a small ratio is the fact that computation time is very long for a large amount of interactions.

3. I would try to optimize alpha and regularization concurrently, i.e. using a kind of grid search, i.e. try 3x3=9 values on a logarithmic scale (e.g. 0.01, 0.1, 1 or 0.25, 0.5, 1) and then refine; maybe a simplex-like approach will save you some computation, or random search, but I am not sure ...

* What would the natural order for optimization be? Or is it really necessary to try all combinations of all parameters?

See above.

* In the WRMF paper a alpha value of 40 is proposed by the authors, as it would benefit their own case. Why is the alpha value of 1 taken as a default in this implementation?

The choice was more or less arbitrary -- maybe because it works nicely for typical datasets in the literature. Different data, different optimal hyperparameters.

* What is the acceptable range of "search" values for a parameter such as regularization?

Negative values make no sense. Everything else depends on your data.

Can anybody shed some light on this?

I hope my answers help at least a little bit.
I am also curious what others on the list have to say.

Julien, let us know if you find out more.

Will do!

Best regards,
Z.

Zeno Gantner

unread,

Feb 12, 2017, 10:50:22 AM2/12/17

to mymed...@googlegroups.com

On Sun, Feb 12, 2017 at 4:38 PM, Djoels <juv...@gmail.com> wrote:

On Sun, Feb 12, 2017 at 3:51 PM, Djoels <juv...@gmail.com> wrote:
Dear all,

I'm wondering what the best method for hyperparameter tuning would look like.

I am currently attempting this with the WRMF item recommender on a "large" dataset (11.000.000 user item interactions):

My strategy so far has been:
* find the right alpha value (alpha in 1, 5, 10, 20, 40, 50, 100) with a default of 50 as number of factors, the regularization parameter set to its default 0.015 and see result of 10 first iterations
--> choose alpha would be the one that converges most rapidly in terms of prec@k and NDCG scores
* find the right number of factors with use of the optimal alpha found previously
--> choose number that keeps computation time feasible and seems to allow capturing the essential tendencies of the source data (again noting what happens after 10 iterations)
* find the right regularization parameter value for the alpha and num_factors found previously
* add iterations to the optimal parameter setting found above, and see to which metric scores it converges (and which number of iterations really add something to the fit).

Questions regarding my own approach are:
* Is it nonsensical to work in the order I mentioned? Are some steps to be avoided or turned around (or skipped altogether)?

1. I would first pick the number of factors, based on what is feasible computationally. Maybe pick 2 different sizes, and play with the smaller one first before going to the larger one.
2. I'd watch convergence on a hold-out set, not on the training set itself. If you use the command line tool, --find-iter=N comes in handy for this.

I am using a (rather small) test-ratio (such that I have a reasonable test-time) and specify the random seed.
Am I correct in expecting that whenever you have a equal test-ratio and (more importantly) random seed, the hold-out data "generated" from this test-ratio should be the same?
Furthermore: do you think a test-ratio of 0.0025 could be problematic? The reason I would use such a small ratio is the fact that computation time is very long for a large amount of interactions.

If you limit the number of users used for testing to something that is computationally feasible with --num-test-users=N, you should be fine with any test ratio.