More questions about Sigmoid function

Jian-Zheng Zhou

unread,

Sep 4, 1996, 3:00:00 AM9/4/96

to

Since I posted a question about using sigmoid function, I have got several
responses. My original question was that how do we use sigmoid function if
output is not in the range between 0 and 1. Some people suggested me use
linear function instead of sigmoid function, other people said that I can
use sigmoid function, and scale back my output based on the minimum and
maximum values of real output variables. However, some people insisted that
sigmoid function can only be used for the output variables which real ranges
must be between 0 nad 1 such as binary or percentage varibales.

Here I would like to raise two more questions.

1. If we use other linear function instead of sigmoid function, do we
still need to scale input variables in to the range between 0 and
1 based on minmum and maximum values of the varibales? ( my guess
is not).

2. I personally prefer the solution which uses sigmoid function and
scale back the output variables to their original range. Is there any
one can further confirm or deny this solution. I would like to know
some references which support or against this solution.

I am appreciated anyone's help.
--
Jian-Zheng Zhou
Research Associate Tel: 301-405-1381(O)
Department of Animal Sciences 301-590-0902(H)
University of Maryland email:jz...@umail.umd.edu
College Park, MD20742 co...@marlowe.umd.edu

Yaron Danon

unread,

Sep 7, 1996, 3:00:00 AM9/7/96

to Jian-Zheng Zhou

Jian-Zheng Zhou wrote:
>

>
> Here I would like to raise two more questions.
>
> 1. If we use other linear function instead of sigmoid function, do we
> still need to scale input variables in to the range between 0 and
> 1 based on minmum and maximum values of the varibales? ( my guess
> is not).
>
> 2. I personally prefer the solution which uses sigmoid function and
> scale back the output variables to their original range. Is there any
> one can further confirm or deny this solution. I would like to know
> some references which support or against this solution.
>
>
> I am appreciated anyone's help.
> --

You can try WinNN which has linear and non linear normalization both
on the inputs and outputs. it is avlialble by anonymous ftp to
ftp.cica.indiana.edu (or one of its mirror sites). look for winnn97.zip
in pub/pc/programr

Yaron

Yaron Danon

unread,

Sep 7, 1996, 3:00:00 AM9/7/96

to Jian-Zheng Zhou

Yaron Danon

unread,

Sep 7, 1996, 3:00:00 AM9/7/96

to Jian-Zheng Zhou

Greg Heath

unread,

Sep 9, 1996, 3:00:00 AM9/9/96

to co...@csc.umd.edu

On 4 Sep 96 in <50jrj3$k...@hecate.umd.edu>, co...@csc.umd.edu
(Jian-Zheng Zhou) wrote:

>Since I posted a question about using sigmoid function, I have got several
>responses. My original question was that how do we use sigmoid function if
>output is not in the range between 0 and 1. Some people suggested me use

>linear function instead of sigmoid function, ...

I agree.

> ... other people said that I can

>use sigmoid function, and scale back my output based on the minimum and
>maximum values of real output variables.

You could, but why bother with sigmoid complications if you don't need them
to improve learning? If scaling is the only problem, scale linear activation
outputs.

> ... However, some people insisted that

>sigmoid function can only be used for the output variables which real ranges

>must be between 0 nad 1 ...

They are wrong if they really used the word "can". However, If they used the
word "should", then it's probably not bad advice.

> ... such as binary or percentage varibales.

Percentages? ... Yes. Binary? ... Only in the learning mode. In the operational
mode replace the sigmoids with step functions.

> Here I would like to raise two more questions.
>
> 1. If we use other linear function instead of sigmoid function, do we
> still need to scale input variables in to the range between 0 and
> 1 based on minmum and maximum values of the varibales? ( my guess
> is not).

Delete the word "still". Inputs "should" be scaled(and the interval doesn't
have to be [0,1]) if they have drastically different dynamic ranges or the
size of the values affect arithmetic accuracy. Otherwise, it is just a
convenience. I'm sure this is adequately covered in the FAQ.

> 2. I personally prefer the solution which uses sigmoid function and
> scale back the output variables to their original range. Is there any
> one can further confirm or deny this solution. I would like to know
> some references which support or against this solution.

Why do you want to use sigmoids when you don't have to? Isn't your life
complicated enough?

> I am appreciated anyone's help.
>--

>Jian-Zheng Zhou

Greg
--
Gregory E. Heath he...@ll.mit.edu The views expressed here are
M.I.T. Lincoln Lab (617) 981-2815 not ncessarily shared by
Lexington, MA (617) 981-0908(FAX) M.I.T./LL or its sponsors
02173-9185, USA

E. Robert Tisdale

unread,

Sep 9, 1996, 3:00:00 AM9/9/96

to

Greg Heath <he...@ll.mit.edu> writes:

>On 4 Sep 96 in <50jrj3$k...@hecate.umd.edu>,
>co...@csc.umd.edu (Jian-Zheng Zhou) wrote:

[among other things]

>>... However, some people insisted that sigmoid function can only be used

>>for the output variables which real ranges must be between 0 and 1 ...

>They are wrong if they really used the word "can". However,
>If they used the word "should", then it's probably not bad advice.

I think Jian-Zheng Zhou is referring to my assertion that sigmoidal functions
are only VALID for binary ({0, 1}, {-1, +1}, etc.) outputs. Warren Sarle
took exception to my remark and seemed to imply that he thought that sigmoidal
output functions were valid estimators of probability!

>>... such as binary or percentage variables.

>Percentages? ... Yes. Binary? ... Only in the learning mode.
>In the operational mode replace the sigmoids with step functions.

Does this mean that you believe that sigmoidal output functions are valid
estimators of percentages? If so, please explain what that means to you.

>> Here I would like to raise two more questions.
>>
>> 1. If we use other linear function instead of sigmoid function, do we
>> still need to scale input variables in to the range between 0 and 1

>> based on minimum and maximum values of the variables? ( my guess is
>> not).

>Delete the word "still". Inputs "should" be scaled(and the interval doesn't
>have to be [0,1]) if they have drastically different dynamic ranges or the
>size of the values affect arithmetic accuracy. Otherwise, it is just a
>convenience. I'm sure this is adequately covered in the FAQ.

I agree that the inputs should be scaled and the interval doesn't matter.
But I doubt very much that it will have much effect upon the arithmetic
accuracy. The reason for scaling the inputs is to improve the stability
and rate of convergence in the learning algorithm.

>> 2. I personally prefer the solution which uses sigmoid function and
>> scale back the output variables to their original range. Is there
>> any one can further confirm or deny this solution. I would like
>> to know some references which support or against this solution.

>Why do you want to use sigmoids when you don't have to?
>Isn't your life complicated enough?

I agree. Jian-Zheng should think very carefully about what his preferred
solution implies. He must multiply every output by the inverse of its scale
factor to get the correct output when the network is placed in operation.
This means that the network must support some of the computational burden
of an extra layer without benefit of the extra learning capacity that that
layer could provide. But he should also consider the fact that he changes
the relative importance of the outputs when he multiplies each one by a
different scale factor. He should scale them into the same range only if
he believes that they all have the same importance.

Hope this clears things up, Bob Tisdale.

Greg Heath

unread,

Sep 9, 1996, 3:00:00 AM9/9/96

to ed...@cs.ucla.edu

On 9 Sep 96, in <5101ai$l...@delphi.cs.ucla.edu>, ed...@cs.ucla.edu (E. Robert
Tisdale) wrote:

> Greg Heath <he...@ll.mit.edu> writes:

" On 4 Sep 96 in <50jrj3$k...@hecate.umd.edu>,
co...@csc.umd.edu (Jian-Zheng Zhou) wrote:"

>> ... However, some people insisted that sigmoid function can only be used
>> for the output variables which real ranges must be between 0 and 1 ...

" They are wrong if they really used the word "can". However,
If they used the word "should", then it's probably not bad advice."

> I think Jian-Zheng Zhou is referring to my assertion that sigmoidal functions

> are only VALID for binary ({0, 1}, {-1, +1}, etc.) outputs. ...

Absolutely not. Where did you get that idea?

> ... Warren Sarle

> took exception to my remark and seemed to imply that he thought that sigmoidal
> output functions were valid estimators of probability!

Yes, provided certain restrictions apply. I think the restrictions(which
include type of objective function and probability distribution of errors)
are covered in the FAQ. If not, search papers by Jordan and by Lippman.

>>... such as binary or percentage variables.

" Percentages? ... Yes. Binary? ... Only in the learning mode.
In the operational mode replace the sigmoids with step functions."

> Does this mean that you believe that sigmoidal output functions are valid
> estimators of percentages? If so, please explain what that means to you.

Properly scaled, sigmoids can be used to represent *any* real-valued outputs
that are known to be *constrained* between finite upper and lower bounds.
Obviously, percentages satisfy this requirement and binary outputs don't.

>> Here I would like to raise two more questions.
>>
>> 1. If we use other linear function instead of sigmoid function, do we
>> still need to scale input variables in to the range between 0 and 1
>> based on minimum and maximum values of the variables? ( my guess is
>> not).

" Delete the word "still". Inputs "should" be scaled(and the interval doesn't
have to be [0,1]) if they have drastically different dynamic ranges or the
size of the values affect arithmetic accuracy. Otherwise, it is just a
convenience. I'm sure this is adequately covered in the FAQ."

> I agree that the inputs should be scaled and the interval doesn't matter.
> But I doubt very much that it will have much effect upon the arithmetic

> accuracy. ...

On the contrary. It's more basic than NNs, it's a result of finite precision
arithmetic. See a text in numerical analysis. Errors from subtracting or dividing
large numbers resulting from large inputs can overwhelm calculations resulting
from combining small inputs. Also, the disparity in input values will cause large
disparities in the size of weight vector components. This can significantly slow
learning and cause naively coded optimization algorithms to fail.

> ... The reason for scaling the inputs is to improve the stability

> and rate of convergence in the learning algorithm.

Yes. That is the result. But the *causes* of instability and/or slow convergence
are as I have stated.

>> 2. I personally prefer the solution which uses sigmoid function and
>> scale back the output variables to their original range. Is there
>> any one can further confirm or deny this solution. I would like
>> to know some references which support or against this solution.

" Why do you want to use sigmoids when you don't have to?
Isn't your life complicated enough? "

> I agree. Jian-Zheng should think very carefully about what his preferred
> solution implies. He must multiply every output by the inverse of its scale
> factor to get the correct output when the network is placed in operation.
> This means that the network must support some of the computational burden
> of an extra layer without benefit of the extra learning capacity that that

> layer could provide. ...

That's inefficient but not critical.

> ... But he should also consider the fact that he changes

> the relative importance of the outputs when he multiplies each one by a

> different scale factor. ...

Yes. It also affects the relative sizes of the weight vector components.

> ... He should scale them into the same range only if

> he believes that they all have the same importance.

Not necessarily. He could use the *weighted* least squares objective function.

Hope this helps.

Greg.

Anna Zueva

unread,

Sep 11, 1996, 3:00:00 AM9/11/96

to

ed...@cs.ucla.edu (E. Robert Tisdale) writes:
[...]

>I think Jian-Zheng Zhou is referring to my assertion that sigmoidal functions

>are only VALID for binary ({0, 1}, {-1, +1}, etc.) outputs. Warren Sarle

>took exception to my remark and seemed to imply that he thought that sigmoidal
>output functions were valid estimators of probability!

Leaving aside the question of scaling inputs, I see no fundamental
difference between using sigmoidal functions for binary outputs and
probabilistic outputs. The sigmoidal output can be interpreted as a
probabilistic mixture of the two binary modes. Also, it is interesting
to note that the common tanh sigmoidal function is very close to a
Gaussian error function (erf) or the integral of a Gaussian, properly
scaled.

Warren Sarle

unread,

Sep 11, 1996, 3:00:00 AM9/11/96

to

In article <5101ai$l...@delphi.cs.ucla.edu>, ed...@cs.ucla.edu (E. Robert Tisdale) writes:
|> ...
|> I think Jian-Zheng Zhou is referring to my assertion that sigmoidal functions
|> are only VALID for binary ({0, 1}, {-1, +1}, etc.) outputs. Warren Sarle
|> took exception to my remark and seemed to imply that he thought that sigmoidal
|> output functions were valid estimators of probability!

Yes, sigmoid output activation functions are routinely used by
statisticians to estimate probabilities, and I am frankly baffled
by Bob's objection to such usage.

In logistic regression and similar methods with other types of sigmoids,
the form of the sigmoid function is part of the model specification and
should be validated, just as other parts of the model specification such
as linearity and independence are validated. The logistic function is a
particularly useful variety of sigmoid function, since it is the form
taken by the posterior probability in a discriminant analysis of two
multivariate normal populations with equal covariance matrices, and it
has useful interpretations in terms of log-odds. The inverse of the
logistic function is also the canonical link function for a binomial
distribution. So from a statistical point of view, a logistic function
is the most obvious output activation function to use for estimating
probabilities.

In MLPs, the form of the output activation function is much less
critical due the universal approximation property of MLPs. It is often
convenient to use a logistic function for the log-odds interpretation,
but many other sigmoid functions will work just as well when you are not
dealing with such convenient situations as multivariate normal
distributions. Non-sigmoidal output activation functions, such as
Gaussians, can also be used to estimate probabilities. It is convenient
to use an output activation function with a range of (0,1) to keep the
log likelihood finite, but this is by no means necessary.

References:

McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models,
2nd ed., London: Chapman & Hall.

Jordan, M.I. (1995), "Why the logistic function? A tutorial
discussion on probabilities and neural networks",
ftp://psyche.mit.edu/pub/jordan/uai.ps.Z

--

Warren S. Sarle SAS Institute Inc. The opinions expressed here
sas...@unx.sas.com SAS Campus Drive are mine and not necessarily
(919) 677-8000 Cary, NC 27513, USA those of SAS Institute.