Vectorizing tanh?

Julien Cornebise

unread,

Jun 25, 2013, 6:05:17 AM6/25/13

to stan-...@googlegroups.com

Dear all,

It seems that few numerical functions are vectorized. In particular, I'm looking for a vectorization of tanh() -- ideally over a matrix, but over a vector would already be a great plus.

This is because a simple 1-8-2 neural with 100 datapoints net is *very* slow to fit [12 hours for 15K iterations], which I suppose can be blamed on the loop required by tanh, if the notes from Andrew

Any idea how difficult it would be to vectorize tanh(), and where I should start? Or would I be better off by replacing it with its Taylor approximation, hence using only additions, pows, and multiplications, which I assume are vectorized already?

I confess that I haven't used https://groups.google.com/forum/#!msg/stan-users/4gv3fNCqSNk/J6ZItL2ZJ-IJ yet, especially not Matt's trick nor profiling (but details on profiling compiled code from R in OSX seem sparse)

Thanks again for the quick feedback of yesterday, Ben and Bob!

All the best,

Julien

Julien Cornebise

unread,

Jun 25, 2013, 6:34:40 AM6/25/13

to stan-...@googlegroups.com

Sorry, answering to myself: no need to go for Taylor, it suffices to use the exponential formulation of tanh(), since exp() is matrix vectorized! Not sure how harder it makes it for the auto-diff, though -- maybe for that reason a vectorized tanh() would be better, I do not know.

I will report on timings, and will also try to get Matt's trick to work. In the meantime, suggestions on how to profile and/or how to vectorize tanh() still welcome :)

Thanks !

Julien

Julien Cornebise

unread,

Jun 25, 2013, 7:00:18 AM6/25/13

to stan-...@googlegroups.com

it suffices to use the exponential formulation of tanh(), since exp() is matrix vectorized! Not sure how harder it makes it for the auto-diff, though -- maybe for that reason a vectorized tanh() would be better, I do not know.

I will report on timings.

Well, the limited timing test is not very encouraging: 100 iterations actually ran in roughly 60 seconds with

for (n in 1:n_cases) {

for (k in 1:n_hidden) {

hidden_nl[k,n] <- tanh(hidden[k,n]);

}

where n_cases = 300 and n_hidden = 8, while they actually took longer, 70 seconds, with the poor-man's vectorizing:

hidden_nl <- exp(hidden*2.0);

hidden_nl <- (hidden_nl-1)./(hidden_nl+1);

I'm rather surprised and disappointed. It goes showing the need for profiling -- not a trivial thing to do, though.

Julien

Ben Goodrich

unread,

Jun 25, 2013, 10:16:38 AM6/25/13

to stan-...@googlegroups.com

On Tuesday, June 25, 2013 7:00:18 AM UTC-4, Julien Cornebise wrote:

it suffices to use the exponential formulation of tanh(), since exp() is matrix vectorized! Not sure how harder it makes it for the auto-diff, though -- maybe for that reason a vectorized tanh() would be better, I do not know.
I will report on timings.

Well, the limited timing test is not very encouraging: 100 iterations actually ran in roughly 60 seconds with
for (n in 1:n_cases) {
for (k in 1:n_hidden) {
hidden_nl[k,n] <- tanh(hidden[k,n]);
}
}

There are at least three senses of "vectorizing" in Stan.

Does the function, tanh() in this case, accept a vector as an argument an apply the function elementwise? Apparently not, but that doesn't really affect the speed. If there were a tanh() function that took a vector or matrix as an argument, then it would just do what you did above by looping over the rows and columns and calling the corresponding scalar function. It would be nice if the user didn't need to type as much, but either way it gets compiled to a C++ loop.
In addition to 1), does the function do something clever with the derivatives? For a lot of the probability distributions, the answer is yes, but not so much for tanh()
Does the function utilize hardware vectorization? This is a major concept for Eigen but doesn't get used much in Stan because the hardware can't vectorize the custom scalar types that Stan uses and Stan currently doesn't do that much with plain doubles.

So, I wouldn't really expect anything to speed tanh() up except by writing a more efficient implementation of tanh(). The existing one

https://raw.github.com/stan-dev/stan/develop/src/stan/agrad/rev/tanh.hpp

is not great because it calculates std::tanh() and then std::cosh() to get the derivative. It might be faster (and should be more stable) if we calculated

log(1 - exp(-2x)) == log_diff_exp(0, -2 * x) = a
log(1 + exp(-2x)) == log_sum_exp( 0, -2 * x) = b
tanh(x) == exp( a - b )
cosh(x) == exp( b - LOG2 + x )

Ben

Julien Cornebise

unread,

Jun 25, 2013, 11:11:50 AM6/25/13

to stan-...@googlegroups.com

Thanks a lot Ben for the answer. It confirms that the problem might be somewhere else than in the tanh. I keep overlooking the fact that the loop is compiled and hence not very costly, and that difficulties lie in the auto-differentiation.
I have a somewhat stupid follow-up: is there a way to give Stan closed-form the derivatives to use (i.e. use NUTS with manual derivatives), since they are easy-ish in neural net models with the so-called "backpropagation", and would that be likely to speed anything up, or is AFD always using the most efficient way of computing the derivatives?

Otherwise, I might just have a posterior with very weird shape that would simply be slow to sample from by NUTS, at least in some regions: I haven't dug enough into NUTS iterations to remember if the sampling time can vary. In particular, I was surprised that the 15,000 iterations so far took much longer than a simple proportionality rule based on 100 iterations made me expect.

Michael Betancourt

unread,

Jun 25, 2013, 11:18:50 AM6/25/13

to stan-...@googlegroups.com

Backprop is exactly (reverse mode) autodiff, so there wouldn't be any benefit there. I'm guessing you're running into the same problem that Neal ran into with BNN in his thesis. Because the net is highly correlated a small change in parameters will yield a huge change in density, but because on average HMC transitions can only vary the log density by (dimension) / 2 the parameter movement will be small and you'll basically have a random walk. Riemannian HMC fixes this automatically, but the quickest solution is to employ the "Matt trick" to remove the correlations in the first place.

--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,

Jun 25, 2013, 11:35:50 AM6/25/13

to stan-...@googlegroups.com

On 6/25/13 11:11 AM, Julien Cornebise wrote:
> ...

> I have a somewhat stupid follow-up: is there a way to give Stan closed-form the derivatives to use (i.e. use NUTS with
> manual derivatives), since they are easy-ish in neural net models with the so-called "backpropagation", and would that
> be likely to speed anything up, or is AFD always using the most efficient way of computing the derivatives?

To elaborate a bit on Michael's answer, it depends what the model is.
Our auto-diff is pretty efficient, but there are certainly cases
where it can be made more efficient and more arithmetically stable.

For example, rather than doing poisson(exp(alpha)), we have a
distribution poisson_log(alpha) =def= poisson(exp(alpha)), and we
have customized the derivatives for poisson_log.

The auto-diff needs to be defined in C++ along the same lines
as our other functions.

In the particular case of logistic regression, even if we had
a custom auto-diffed GLM distribution

y ~ logistic_regression(x, beta);

where y is an N-vector, x is an (N x K) predictor matrix,
and beta is a K-vector, even a careful implementation probably
wouldn't be much more efficient than the current idiom:

y ~ bernoulli_logit(x * beta);

The main saving comes from eliminating intermediate expressions by
reducing (partially evaluating in computational terms) analytically.

- Bob

Reply all

Reply to author

Forward