Question about regularization

Jiaji Zhou

unread,

Feb 4, 2013, 4:53:16 PM2/4/13

to 10-701-spri...@googlegroups.com

Hi!
As taught in the class, SVM is a combination of hinge loss and square loss(L2 regularization term), and setting \lambda > 0 gives us zero coefficients about orthogonal part of the feature space.
However, I find out sometimes I get better performance when setting \lambda =0 (especially with quite an amount of data), and in fact, \lambda = 0 is the default value of \lambda in Vopel Wabbit is zero.
Could someone explains this?

Thanks!

Krikamol Muandet

unread,

Feb 4, 2013, 5:14:33 PM2/4/13

to Jiaji Zhou, 10-701-spri...@googlegroups.com

That's surprising. Could you please give us more detail on how you evaluate the accuracy? I guess the lambda is the regularization constant in the SVM primal form, isn't it?

Krik

--
http://alex.smola.org/teaching/cmu2013-10-701 (course website)
http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9 (YouTube playlist)
---
You received this message because you are subscribed to the Google Groups "10-701 Spring 2013 CMU" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-201...@googlegroups.com.
To post to this group, send email to 10-701-spri...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

Krikamol Muandet
PhD Student
Max Planck Institute for Intelligent Systems
Spemannstrasse 38, 72076 Tübingen, Germany
Telephone: +49-(0)7071 601 554
http://www.kyb.mpg.de/~krikamol

Jiaji Zhou

unread,

Feb 6, 2013, 5:51:28 PM2/6/13

to 10-701-spri...@googlegroups.com, Jiaji Zhou

Hi, Krik!
Sorry for the late reply.
The problem is basically a linear SVM-Rank.
Yes, the \lambda is the regularization constant.

On Monday, February 4, 2013 5:14:33 PM UTC-5, Krik wrote:

That's surprising. Could you please give us more detail on how you evaluate the accuracy? I guess the lambda is the regularization constant in the SVM primal form, isn't it?

Krik

On 4 February 2013 22:53, Jiaji Zhou <robin...@gmail.com> wrote:

Hi!
As taught in the class, SVM is a combination of hinge loss and square loss(L2 regularization term), and setting \lambda > 0 gives us zero coefficients about orthogonal part of the feature space.
However, I find out sometimes I get better performance when setting \lambda =0 (especially with quite an amount of data), and in fact, \lambda = 0 is the default value of \lambda in Vopel Wabbit is zero.
Could someone explains this?

Thanks!

--
http://alex.smola.org/teaching/cmu2013-10-701 (course website)
http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9 (YouTube playlist)
---
You received this message because you are subscribed to the Google Groups "10-701 Spring 2013 CMU" group.

To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-2013-cmu+unsub...@googlegroups.com.

To post to this group, send email to 10-701-spri...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jiaji Zhou

unread,

Feb 6, 2013, 6:35:43 PM2/6/13

to 10-701-spri...@googlegroups.com, Jiaji Zhou

I guess the reason might be since it is only linear svm with many training data, regularization might not be necessary, so we are only optimizing the hinge loss (as an convex upper bound of 0-1 classification error).

Also Vopel Wabbit is doing online gradient descent in the primal space, and \lambda is effectively divided by the (total number of data points N* #passes), so setting \lambda very small is natural.

But the thing that bothers me is that usually it works best when setting it(\lambda) to be zero.
Further, we lose the logarithmic regret bound if setting \lambda = 0, ( since without the L2 norm term, the loss function is no longer strongly convex ).

Thanks!

Alex Smola

unread,

Feb 6, 2013, 8:12:57 PM2/6/13

to Jiaji Zhou, 10-701-spri...@googlegroups.com

ok, here's some clarification on the Vowpal Wabbit issue:

a) if you're doing stochastic gradient descent with ONE pass through the data you do not need (much) regularization at all since you're actually minimizing a stochastic gradient estimate of the risk.

b) ditto if you perform early stopping with sgd. there is actually quite some research on early stopping. after all, this also ensures that the parameter w never gets too large (see e.g. bartlett & mendelson, jmlr 2001?). the trouble with early stopping is that it's hard to control.

c) there are plenty cases where unregularized vw actually runs into lots of problems (tried & tested)

d) the objective function of vw is something like a huberized soft margin. that is constant, then quadratic and then linear. but even then the updates are rather nonstandard, so it's hard to compare in many cases (maybe john fixed this recently).

cheers,

alex

To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-201...@googlegroups.com.

To post to this group, send email to 10-701-spri...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

''~``

( o o )

+----------------------.oooO--(_)--Oooo.---------------------------+

| Prof. Alexander J. Smola http://alex.smola.org |

| 5000 Forbes Ave phone: (+1) 408-759-1044 |

| Gates Hillman Center 8002 Carnegie Mellon University |

| Pittsburgh 15213 PA ( ) Oooo. Machine Learning Department |

+-------------------------\ (----( )-----------------------------+

\_) ) /

(_/

Reply all

Reply to author

Forward