Leakyrelu Activation Function

0 views

Skip to first unread message

Marketta Filipovich

unread,

Aug 5, 2024, 12:48:20 AM8/5/24

to calkatamti

Artificialneural networks are inspired by the biological neurons within the human body which activate under certain circumstances resulting in a related action performed by the body in response. Artificial neural nets consist of various layers of interconnected artificial neurons powered by activation functions which help in switching them ON/OFF. Like traditional machine learning algorithms, here too, there are certain values that neural nets learn in the training phase.

Briefly, each neuron receives a multiplied version of inputs and random weights which is then added with static bias value (unique to each neuron layer), this is then passed to an appropriate activation function which decides the final value to be given out of the neuron. There are various activation functions available as per the nature of input values. Once the output is generated from the final neural net layer, loss function (input vs output)is calculated and backpropagation is performed where the weights are adjusted to make the loss minimum. Finding optimal values of weights is what the overall operation is focusing around.

So, an activation function is basically just a simple function that transforms its inputs into outputs that have a certain range. There are various types of activation functions that perform this task in a different manner, For example, the sigmoid activation function takes input and maps the resulting values in between 0 to 1.

One of the reasons that this function is added into an artificial neural network in order to help the network learn complex patterns in the data. These functions introduce nonlinear real-world properties to artificial neural networks. Basically, in a simple neural network, x is defined as inputs, w weights, and we pass f (x) that is the value passed to the output of the network. This will then be the final output or the input of another layer.

If the activation function is not applied, the output signal becomes a simple linear function. A neural network without activation function will act as a linear regression with limited learning power. But we also want our neural network to learn non-linear states as we give it complex real-world information such as image, video, text, and sound.

ReLU stands for rectified linear activation unit and is considered one of the few milestones in the deep learning revolution. It is simple yet really better than its predecessor activation functions such as sigmoid or tanh.

ReLU function is its derivative both are monotonic. The function returns 0 if it receives any negative input, but for any positive value x, it returns that value back. Thus it gives an output that has a range from 0 to infinity.

As we have seen above, the ReLU function is simple and it consists of no heavy computation as there is no complicated math. The model can, therefore, take less time to train or run. One more important property that we consider the advantage of using ReLU activation function is sparsity.

The activations functions that were used mostly before ReLU such as sigmoid or tanh activation function saturated. This means that large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid respectively. Further, the functions are only really sensitive to changes around their mid-point of their input, such as 0.5 for sigmoid and 0.0 for tanh. This caused them to have a problem called vanishing gradient problem. Let us briefly see what vanishing gradient problem is.

Neural Networks are trained using the process gradient descent. The gradient descent consists of the backward propagation step which is basically chain rule to get the change in weights in order to reduce the loss after every epoch. It is important to note that the derivatives play an important role in updating of weights. Now when we use activation functions such as sigmoid or tanh, whose derivatives have only decent values from a range of -2 to 2 and are flat elsewhere, the gradient keeps decreasing with the increasing number of layers.

This reduces the value of the gradient for the initial layers and those layers are not able to learn properly. In other words, their gradients tend to vanish because of the depth of the network and the activation shifting the value to zero. This is called the vanishing gradient problem.

But there are some problems with ReLU activation function such as exploding gradient. The exploding gradient is opposite of vanishing gradient and occurs where large error gradients accumulate and result in very large updates to neural network model weights during training. Due to this, the model is unstable and unable to learn from your training data.

Leaky ReLU function is an improved version of the ReLU activation function. As for the ReLU activation function, the gradient is 0 for all the values of inputs that are less than zero, which would deactivate the neurons in that region and may cause dying ReLU problem.

Leaky ReLU is defined to address this problem. Instead of defining the ReLU activation function as 0 for negative values of inputs(x), we define it as an extremely small linear component of x. Here is the formula for this activation function

This function returns x if it receives any positive input, but for any negative value of x, it returns a really small value which is 0.01 times x. Thus it gives an output for negative values as well. By making this small modification, the gradient of the left side of the graph comes out to be a non zero value. Hence we would no longer encounter dead neurons in that region.

This brings us to the end of this article where we learned about ReLU activation function and Leaky ReLU activation function. You can click the banner below to get a free deep learning course and enhance your skills.

Leaky Rectified Linear Unit, or Leaky ReLU, is a type of activation function based on a ReLU, but it has a small slope for negative values instead of a flat slope. The slope coefficient is determined before training, i.e. it is not learnt during training. This type of activation function is popular in tasks where we may suffer from sparse gradients, for example training generative adversarial networks.

Are there any syntactic changes to the architecture I should make, or is there a deeper issue here? Any advice would be appreciated, as I had already tried setting some custom objects as described here:

There seem to be some issues when saving & loading models with such "non-standard" activations, as implied also in the SO thread keras.load_model() can't recognize Tensorflow's activation functions ; the safest way would seem to be to re-write your model with the LeakyReLU as a layer, and not as an activation:

We could specify the activation function in the dense layer itself, by using aliases like activation='relu', which would use the default keras parameters for relu. There is no such aliases available in keras, for LeakyRelu activation function. We have to use tf.keras.layers.LeakyRelu or tf.nn.leaky_relu.

We cannot set number of units in Relu layer, it just takes the previous output tensor and applies the relu activation function on it. You have specified the number of units for the Dense layer not the relu layer. When we specify Dense(1024, activation="relu") we multiply the inputs with weights, add biases and apply relu function on the output (all of this is mentioned on a single line). From the method mentioned on step 1, this process is done in 2 stages firstly to multiply weights, add biases and then to apply the LeakyRelu activation function (mentioned in 2 lines).

In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function[1][2] is an activation function defined as the positive part of its argument:

where x \displaystyle x is the input to a neuron. This is also known as a ramp function and is analogous to half-wave rectification in electrical engineering. This activation function was introduced by Kunihiko Fukushima in 1969 in the context of visual feature extraction in hierarchical neural networks.[3][4][5] It was later argued that it has strong biological motivations and mathematical justifications.[6][7] In 2011 it was found to enable better training of deeper networks,[8] compared to the widely used activation functions prior to 2011, e.g., the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more practical[9] counterpart, the hyperbolic tangent. The rectifier is, as of 2017[update], the most popular activation function for deep neural networks.[10]

Rectifying activation functions were used to separate specific excitation and unspecific inhibition in the neural abstraction pyramid, which was trained in a supervised way to learn several computer vision tasks.[16] In 2011,[8] the use of the rectifier as a non-linearity has been shown to enable training deep supervised neural networks without requiring unsupervised pre-training. Rectified linear units, compared to sigmoid function or similar activation functions, allow faster and effective training of deep neural architectures on large and complex datasets.

Exponential Linear Unit or its widely known name ELU is a function that tend to converge cost to zero faster and produce more accurate results. Different to other activation functions, ELU has a extra alpha constant which should be positive number.

ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.

I would like to use LeakyReLU as an activation function. However, this one seems to be different to implement and the keras documentation is not helping me that much compared to how others tend to do.

Although I understand why Relu is most common choice of activation function over sigmoid activation function, I want to know by example or by any model algorithm when LeakyRelu is preferred than Relu activation Function?