Should Dropout Be Applied Before Or After Activation?

Evan Klitzke

unread,

Apr 20, 2017, 1:26:14 PM4/20/17

to Discuss

Hi,

I've been getting into machine learning and neural networks using Tensorflow, and I'm a bit confused about the best practice for ordering dropout and activation. For instance, let's say I have a densely connected layer, and I want to use ReLU activation and dropout. Which is the recommended ordering?

dense -> relu -> dropout -> (other layers)

or

dense -> dropout -> relu -> (other layers)

I understand there is a difference, because the implementation of dropout in Tensorflow scales the output to compensate for the dropout rate. For instance, if you have keep_prob=0.8, after applying dropout, nodes that were kept will have their activation increased by 25% to compensate for the dropout rate. This can affect whether an activation function like ReLU produces a non-zero output.

Intuitively I would expect to apply dropout *after* activation, since I would think that dropout should not affect whether or not a node is activated. This is also what I see in the TF tutorial at https://www.tensorflow.org/tutorials/layers . However, in the original dropout paper it looks like it's the other way around: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf .

Thanks,

Evan

Sebastian Raschka

unread,

Apr 20, 2017, 1:36:11 PM4/20/17

to Evan Klitzke, TensorFlow Mailinglist

Hi, Evan,

> dense -> relu -> dropout -> (other layers)
>
> or
>
> dense -> dropout -> relu -> (other layers)

isn't that producing the exact same results? You can think of the dropout as a mask over your array, e.g., say your "dense" output is

[1, 2, 3, 4, 5]

then you apply dropout and it becomes e.g.,

[1*1.5, 0, 3*1.5, 0, 5*1.5]

Then, if you apply relu, it is still

[1*1.5, 0, 3*1.5, 0, 5*1.5]

If you do it the other way round, dense -> relu -> dropout, you have

relu([1, 2, 3, 4, 5]) -> [1, 2, 3, 4, 5]
dropout([1, 2, 3, 4, 5]) -> [1*1.5, 0, 3*1.5, 0, 5*1.5]

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google Groups "Discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
> To post to this group, send email to dis...@tensorflow.org.
> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/1e9fc77a-2874-4e1f-a769-22724bd51214%40tensorflow.org.

Stefan Meili

unread,

Jul 31, 2020, 8:08:19 PM7/31/20

to Discuss

An old post, but I think it's worth adding that it depends on what your activation function is. Sebastian is correct in that relu maps zero to zero so the order doesn't matter.

If you're using an exponential activation function, it maps zero to 1. -inf is zero, so applying dropout after activation would effectively set the output to -inf. I'm not sure, but I suspect that in this case, applying after activation might be closer to the intent of dropout (deactivation of a few weights). I'm not sure there's a universal rule that works in all cases. I'd probably put dropout before sigmoid activation, as that's trying to classify something as 0 or 1, and maps a zero input to 0.5 (no decision).

On an unrelated note, I'd like to plug Sebastian's text for him (in this obscure backwater of the internet). It's clear, concise and very well written. Seriously, thanks!

Stefan

> To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.

Reply all

Reply to author

Forward