Switching between optimizers

burak...@gmail.com

unread,

Jun 16, 2021, 8:23:14 AM6/16/21

to knet-users

Hi,

I am experimenting with switching between optimizers but could not make Adam work. Briefly:

function loss(w, x, dim) is defined where w are the learnable parameters
w is initialized (not with Param() though)
I compute grad with "dw = grad(loss)(w, x, dim)"
And can update w with "update!(w, dw, SGD(lr=0.1))"
But cannot update it with "update!(w, dw, Adam(lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, gclip=0))"

I must be missing something fundamental here.

Thanks

Deniz Yuret

unread,

Jun 17, 2021, 5:44:51 AM6/17/21

to burak...@gmail.com, knet-users

Can you try the new interface:

1. Wrap weight arrays in Param.  (e.g. w = Param(randn(20,10)))

2. Use @diff calling the loss function. (e.g. J = @diff loss(w,x,dim))

3. Use grad to collect derivatives (e.g. dw = grad(J, w))

You can see `@doc AutoGrad` and the tutorial for examples.

--
You received this message because you are subscribed to the Google Groups "knet-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knet-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/knet-users/e3520036-a300-4630-b355-34fe7481cec8n%40googlegroups.com.

burak...@gmail.com

unread,

Jun 18, 2021, 2:24:19 AM6/18/21

to knet-users

The problem is not in grad computation but updating using Adam, or possibly using any other momentum based scheme. The following one with SGD works:

for epoch=1:epochs

for (x, y) in dtrn

fval = @diff loss(θ,ϕ,x)

for param in params(fval)

∇param = grad(fval, param)

update!(param, ∇param) # SGD default - OK

#update!(param, ∇param, SGD(lr=0.001)) # Also works

end

I could not make Adam updates to work. In fact, I wonder how a momentum based update scheme, when used in the for loops above, would remember (keep track of) the previous updates. I can write the update equations explicitly in the inner for loop , but curious if the "update!" function or the "adam()" can be used directly.

burak

burak...@gmail.com

unread,

Jun 18, 2021, 2:31:35 AM6/18/21

to knet-users

I could make it work following the first option in update! 's documentation. Namely, I defined the optimizer to use for each variable while defining and initializing them, such as

Param(xavier(nh, 28*28),Adam())

and then called

update!(param, ∇param)

burak

Deniz Yuret

unread,

Jun 18, 2021, 11:38:45 AM6/18/21

to burak...@gmail.com, knet-users

On Fri, Jun 18, 2021 at 9:24 AM burak...@gmail.com <burak...@gmail.com> wrote:

The problem is not in grad computation but updating using Adam, or possibly using any other momentum based scheme. The following one with SGD works:

for epoch=1:epochs
for (x, y) in dtrn
fval = @diff loss(θ,ϕ,x)
for param in params(fval)
∇param = grad(fval, param)
update!(param, ∇param) # SGD default - OK
#update!(param, ∇param, SGD(lr=0.001)) # Also works
end
end
end

I could not make Adam updates to work. In fact, I wonder how a momentum based update scheme, when used in the for loops above, would remember (keep track of) the previous updates.

I can answer that one: each Param object x has an x.opt field that stores the optimization state. For Adam this would be an object of type Adam (initialized either by calling adam/adam! directly, or by calling minimize/minimize! with Adam() as the optimization algorithm etc). The design principle here is that every parameter has its own optimization state, which carries the necessary statistics as well as specifies the algorithm. In principle you could have one array updated with SGD, another updated with Adam, use different learning rates etc.

To view this discussion on the web visit https://groups.google.com/d/msgid/knet-users/29f94c0b-a327-4c06-ab3b-820f619a65f1n%40googlegroups.com.

Reply all

Reply to author

Forward