Simple Finite Difference Methods

Christoph Ortner

unread,

Nov 20, 2014, 10:08:30 AM11/20/14

to julia...@googlegroups.com

I am trying to port

http://uk.mathworks.com/examples/matlab/1091-finite-difference-laplacian

to Julia for a quick Julia intro for some grad students at my department. Has anybody ported `numgrid` and `delsq` ?

Thanks,

Christoph

Christoph Ortner

unread,

Nov 20, 2014, 11:54:32 AM11/20/14

to julia...@googlegroups.com

Turns out this was *very* quick to translate from the Matlab codes. If there is any interest I can post my IJulia notebook somewhere.

Stefan Karpinski

unread,

Nov 20, 2014, 12:33:25 PM11/20/14

to Julia Users

Yes, translation from Matlab to Julia is generally pretty easy. Would be cool to take a look at at the code you ended up with.

Christoph Ortner

unread,

Nov 20, 2014, 2:23:20 PM11/20/14

to julia...@googlegroups.com

I've put the current version at

https://dl.dropboxusercontent.com/u/9561945/JuliaNBs/Laplacian.ipynb

Would there be any interest in porting a larger set of Matlab examples to Julia? I wouldn't mind porting a subset of the "Mathematics" section on

http://uk.mathworks.com/examples/matlab

This might be a small way how I could contribute here, as so far I've mostly been enjoying the benefits of everyone else's hard work :).

Christoph

Stefan Karpinski

unread,

Nov 20, 2014, 2:28:38 PM11/20/14

to julia...@googlegroups.com

I would be quite wary of material that may be copyright MathWorks. You may not be legally allowed to port it or even download it.

Mauro

unread,

Nov 20, 2014, 2:29:34 PM11/20/14

to julia...@googlegroups.com

The usual copyright issues apply here though: numgird and delsq are
copyrighted Matlab functions which you are not allowed to copy and even
less allowed post copied code. Best re-implement them without looking
at the source code!

Same goes for all of http://uk.mathworks.com/examples/matlab

Christoph Ortner

unread,

Nov 20, 2014, 3:15:51 PM11/20/14

to julia...@googlegroups.com

too bad. Anyhow, if I continue to develop a similar set of elementary tutorials, I will make them available.

Christoph

Christoph Ortner

unread,

Nov 20, 2014, 3:17:08 PM11/20/14

to julia...@googlegroups.com

I've also removed the notebook.

Mauro

unread,

Nov 20, 2014, 3:45:01 PM11/20/14

to julia...@googlegroups.com

The whole copyright stuff is a pain! IANAL, but as far as I understand
it:
- it's ok to copy examples which have some open source licence, probably
needing to keep that licence.
- it's ok to re-do someone elses examples as long as you don't look at
source code. However, that is probably a bit tricky as examples are
usually mixed with code.
- if unsure, ask the original author. This should work fine for
academics and the like but probably less well for Mathworks.

This could be a PDE example to look into:
http://wiki.scipy.org/PerformancePython

Tutorials/examples would definitely be useful!

Christoph Ortner

unread,

Nov 20, 2014, 4:18:00 PM11/20/14

to julia...@googlegroups.com

Good idea to start from the Python tutorials - I will look into that.

Christoph

Pontus Stenetorp

unread,

Nov 20, 2014, 11:06:12 PM11/20/14

to julia...@googlegroups.com

On 21 November 2014 05:35, Mauro <maur...@runbox.com> wrote:
>
> - it's ok to re-do someone elses examples as long as you don't look at
> source code. However, that is probably a bit tricky as examples are
> usually mixed with code.

Having been involved WiFi drivers back-in-the-day, it is insane what
kind of hacks you can pull off to dodge copyright. The Wikipedia
article on "Clean room design" [1] is a pretty good read.

Pontus

[1]: http://en.wikipedia.org/wiki/Clean_room_design

Steven G. Johnson

unread,

Nov 21, 2014, 10:28:40 AM11/21/14

to julia...@googlegroups.com

I prefer to construct multidimensional Laplacian matrices from the 1d ones via Kronecker products, and the 1d ones from -D'*D where D is a 1d first-derivative matrix, to make the structure (symmetric-definite!) and origin of the matrices clearer. I've posted a notebook showing some examples from my 18.303 class at MIT:

http://nbviewer.ipython.org/url/math.mit.edu/~stevenj/18.303/lecture-10.ipynb

(It would be nicer to construct the sdiff1 function so that the matrix is sparse from the beginning, rather than converting from a dense matrix. But I was lazy and dense matrices on 1d grids are cheap.)

Christoph Ortner

unread,

Nov 30, 2014, 6:51:31 AM11/30/14

to julia...@googlegroups.com

Belated update to this thread:

I have now finished a first draft of three tutorial-like numerical PDE notebooks; they can be viewed at

http://homepages.warwick.ac.uk/staff/C.Ortner/index.php?page=julia

I have two more coming up in the near future, one on spectral methods, the other on an optimisation problem. For the moment, I am using them primarily for my research group to learn Julia, and to show it to colleagues when they are interested.

Q1: May I use the Julia logo on that website, as well as for any tutorials / courses that I teach based on Julia?

Q2: Eventually I think it would be good to have a "Julia Examples" page such as

http://www.mathworks.com/examples/

Q3: I'd of course be interested in feedback.

Valentin Churavy

unread,

Nov 30, 2014, 2:54:58 PM11/30/14

to julia...@googlegroups.com

Nice work!

Regarding the pretty Julia version of Lennard-Jones MD.

You can shape of another second (on my machine) by not passing in the lj method as a parameter, but directly calling it.

I tried to write an optimize version of your lj_pretty function by analysing it with @profile and rewriting the slow parts. You can see my results here: https://gist.github.com/vchuravy/f42f458717a7a49395a5

I went step for step through it and applied one optimization at a time. You can also see the time computation time spend at each line as a comment. Mostly I just removed temporary array allocation and then applied your math optimization.

One question though. In lj_cstyle(x) you calculate dJ = -12.*(t*t - t) * s , shouldn't it be dJ = -12.*(t*t - t) / s?

Kind regards,

Valentin

Valentin Churavy

unread,

Nov 30, 2014, 3:54:40 PM11/30/14

to julia...@googlegroups.com

I found a second error in lj_cstyle

t is calculated wrongly:

t = 1./s*s*s != 1/s^3

It probably should be t = 1.0 / (s * s * s)

t = 1.0 / (s*s*s)
E += t*t - 2.*t
dJ = -12.0 *(t*t - t) / s

cstyle is on my machine still two times faster then my optimized variant of jl_pretty

Best Valentin

Christoph Ortner

unread,

Dec 1, 2014, 8:09:22 AM12/1/14

to julia...@googlegroups.com

Hi Valentin,

This is extremely helpful - thank you for providing these tweaks. I hope you don't mind if I incorporate this.

(And sorry for the typos - I am normally better at testing.)

Christoph

Steven G. Johnson

unread,

Dec 1, 2014, 10:49:10 AM12/1/14

to julia...@googlegroups.com

I notice that in your finite-element notebook you use:

    T = zeros(Integer, 3, 2*N^2)

You really want to use Int in cases like this (Int = Int64 on 64-bit machines and Int32 on 32-bit machines). Using Integer means that you have an array of pointers to generic integer containers, whose type must be checked at runtime (e.g. T[i] could be an Int32, Int64, BigInt, etc.).

(You also have very Matlab-like code that allocates zillions of little temporary arrays in your inner loops. I assume this is intentional, but it is definitely suboptimal performance, possibly by an order of magnitude.)

Valentin Churavy

unread,

Dec 1, 2014, 11:27:33 AM12/1/14

to julia...@googlegroups.com

You are very welcome. I hope that your collection of notebooks continues to grow.

Christoph Ortner

unread,

Dec 1, 2014, 12:14:09 PM12/1/14

to julia...@googlegroups.com

Hi Steven,

Thanks for the feedback - yes it was intentional, but I am definitely planning to extend the notebook with performance enhancements.

I was not yet aware of this distinction between Int and Integer. Thanks for pointing it out.

Christoph

Jiahao Chen

unread,

Dec 1, 2014, 9:40:54 PM12/1/14

to julia...@googlegroups.com

I took a look at the Lennard-Jones example. On top of the memory allocation from array slicing, there is also a lot of overhead in passing lj(r) as an input function phi. There is also some overhead from computing size(x) in the range for the inner loop.

Consider this version of the code, which is "C style" in that the loops are explicitly unrolled, but otherwise is not that far from the original lj_pretty:

# definition of the Lennard-Jones potential

function LJ(r)

r⁻² = sumabs2(r)

r⁻⁶ = 1.0/(r⁻²*r⁻²*r⁻²)

J = r⁻⁶*r⁻⁶ - 2.0r⁻⁶

∇ᵣJ = -12. * (r⁻⁶*r⁻⁶ - r⁻⁶) / r⁻²

J, ∇ᵣJ #<--

end

function lj_cstyle2(x)

d, N = size(x)

E = 0.0

∇E = zeros(x) #note that zeros supports this form too

r = zeros(d)

@inbounds for n = 1:(N-1), k = (n+1):N

for i = 1:d

r[i] = x[i,k]-x[i,n]

end

J, ∇ᵣJ = LJ(r) #<--

E += J

for i = 1:d

∇E[i, k] += ∇ᵣJ * r[i]

∇E[i, n] -= ∇ᵣJ * r[i]

end

E, ∇E #return keyword not strictly required

end

For me, lj_cstyle2 runs in 83 ms, compared to 66 ms for lj_cstyle.

There is some overhead from constructing and unpacking the tuples in the two lines marked with #<--, but I think it is unavoidable if you want to have a single external function that computes both the potential and its gradient.

Note that this version of LJ(r) computes two numbers, not a number and a vector. For any central potential the gradient along r, ∇ᵣJ, is sufficient to characterize the entire force ∇E, so there is not much loss of generality in not returning the force in the original coordinates. In this case, the mathematical structure of the problem is not entirely at odds with its implementation in code.

Thanks,

Jiahao Chen
Staff Research Scientist
MIT Computer Science and Artificial Intelligence Laboratory

Jiahao Chen

unread,

Dec 1, 2014, 9:57:48 PM12/1/14

to julia...@googlegroups.com

* r⁻² should be r², of course.

Note that the following very naive version of LJ(r)

function LJ(r)

r² = sumabs2(r)

J = 1.0/(r²)^6 - 2.0/(r²)^3

∇ᵣJ = (-12.0/(r²)^6 + 12.0/(r²)^3) / r²

return J, ∇ᵣJ

end

results in code that runs in 0.42s, which is not terrible (and such performance would be expected for more complicated Lennard-Jones potentials which are not 12-6).

Thanks,

Jiahao Chen
Staff Research Scientist
MIT Computer Science and Artificial Intelligence Laboratory

Christoph Ortner

unread,

Dec 2, 2014, 1:02:41 AM12/2/14

to julia...@googlegroups.com

Thank you for these ideas. I particularly like how to use unicode - I may need to learn how to do this.

Having to manually write out all these loops is a bit of a pain. Is @devec not doing the job?

You seem to say that I shouldn't use the return keyword - I like for readability. But is there a Julia style guide I should start following?

Christoph

Christoph Ortner

unread,

Dec 2, 2014, 1:04:39 AM12/2/14

to julia...@googlegroups.com

I took a look at the Lennard-Jones example. On top of the memory allocation from array slicing, there is also a lot of overhead in passing lj(r) as an input function phi.

this is counterintuitive to me. Why is this a problem?

Christoph

Christoph Ortner

unread,

Dec 2, 2014, 1:13:16 AM12/2/14

to julia...@googlegroups.com

How do you get timings from the Julia profiler, or even better, %-es? I guess one can convert from the numbers one gets, but it is a bit painful?

Christoph

Tim Holy

unread,

Dec 2, 2014, 6:57:16 AM12/2/14

to julia...@googlegroups.com

By default, the profiler takes one sample per millisecond. In practice, the
timing is quite precise on Linux, seemingly within a factor of twoish on OSX,
and nowhere close on Windows. So at least on Linux you can simply read samples
as milliseconds.

If you want to visualize the relative contributions of each statement, I
highly recommend ProfileView. If you use LightTable, it's already built-in via
the profile() command. The combination of ProfileView and @profile is, in my
(extremely biased) opinion, quite powerful compared to tools I used previously
in other programming environments.

Finally, there's IProfile.jl, which works via a completely different mechanism
but does report raw timings (with some pretty big caveats).

Best,
--Tim

Valentin Churavy

unread,

Dec 2, 2014, 8:24:00 AM12/2/14

to julia...@googlegroups.com

Hi Christoph,

If you pass in a function via an argument the compiler get relatively little information about that function and can not do any optimization on it. That is part of the reason why map(f, xs) is currently slower than a for-loop. There are ongoing discussions on how to solve that problem as an example see https://github.com/JuliaLang/julia/issues/8450

For profiling I used LightTable's ProfileView support and I have to agree with Tim that it is pretty amazing.

Valentin

Christoph Ortner

unread,

Dec 2, 2014, 10:22:46 AM12/2/14

to julia...@googlegroups.com

Dear Tim and Valentin,

Thanks for the feedback - It sounds like I need to give Lighttable another try. So far I even have had difficulties setting it up to work properly with Julia.

But I will definitely try to play more with ProfileView in Julia.

Christoph

Peter Simon

unread,

Dec 2, 2014, 12:57:28 PM12/2/14

to julia...@googlegroups.com

I have also experienced the inaccurate profile timings on Windows. Is the reason for the bad profiler performance on Windows understood? Are there plans for improvement?

Thanks,

--Peter

Tim Holy

unread,

Dec 2, 2014, 12:59:42 PM12/2/14

to julia...@googlegroups.com

I think it's just that Windows is bad at scheduling tasks with short-latency,
high-precision timing, but I am not the right person to answer such questions.

--Tim

Jiahao Chen

unread,

Dec 2, 2014, 1:10:07 PM12/2/14

to julia...@googlegroups.com

> Thank you for these ideas. I particularly like how to use unicode - I may need to learn how to do this.

The manual has some sections on Unicode input using LaTeX-like tab completion syntax.

> Having to manually write out all these loops is a bit of a pain. Is @devec not doing the job?

I haven't really tried.

> You seem to say that I shouldn't use the return keyword - I like for readability. But is there a Julia style guide I should start following?

All I mean is that it's not strictly required if all you do is return the value of the last statement. We don't really have a style guide.

Jameson Nash

unread,

Dec 2, 2014, 1:20:25 PM12/2/14

to julia...@googlegroups.com

Correct. Windows imposes a much higher overhead on just about every aspect of doing profiling. Unfortunately, there isn't much we can do about this, other then to complain to Microsoft. (It doesn't have signals, so we must emulate them with a separate thread. The accuracy of windows timers is somewhat questionable. And the stack walk library (for recording the backtrace) is apparently just badly written and therefore insanely slow and memory hungry.)

Jameson Nash

unread,

Dec 2, 2014, 1:22:41 PM12/2/14

to julia...@googlegroups.com

Although, over thanksgiving, I pushed a number of fixes which should improve the quality of backtraces on win32 (and make sys.dll usable there)

Tim Holy

unread,

Dec 2, 2014, 2:18:25 PM12/2/14

to julia...@googlegroups.com

On Windows, is there any chance that one could set up a separate thread for
profiling and use busy-wait to do the timing? (I don't even know whether one
thread can snoop on the execution state of another thread.)

--Tim

Jameson Nash

unread,

Dec 2, 2014, 4:54:24 PM12/2/14

to julia...@googlegroups.com

That's essentially what we do now. (Minus the busy wait part). The overhead is too high to run it any more frequently -- it already causes a significant performance penalty on the system, even at the much lower sample rate than linux. However, I suspect the truncated backtraces on win32 were exaggerating the effect somewhat -- that should not be as much of an issue now.

Sure, windows lets you snoop on (and modify) the address space of any process, you just need to find the right handle.

Tim Holy

unread,

Dec 2, 2014, 5:12:37 PM12/2/14

to julia...@googlegroups.com

If the work of walking the stack is done in the thread, why does it cause any
slowdown of the main process?

But of course the time it takes to complete the backtrace sets an upper limit
on how frequently you can take a snapshot. In that case, though, couldn't you
just have the thread always collecting backtraces?

--Tim

Jameson Nash

unread,

Dec 2, 2014, 5:24:45 PM12/2/14

to julia...@googlegroups.com

You can't profile a moving target. The thread must be frozen first to ensure the stack trace doesn't change while attempting to record it

Tim Holy

unread,

Dec 2, 2014, 6:23:09 PM12/2/14

to julia...@googlegroups.com

On Tuesday, December 02, 2014 10:24:43 PM Jameson Nash wrote:
> You can't profile a moving target. The thread must be frozen first to
> ensure the stack trace doesn't change while attempting to record it

Got it. I assume there's no good way to "make a copy and then discard if the
copy is bad"?

--Tim

Jameson Nash

unread,

Dec 2, 2014, 10:14:05 PM12/2/14

to julia...@googlegroups.com

you could copy the whole stack (typically only a few 100kb, max of 8MB), then do the stack walk offline. if you could change the stack pages to copy-on-write, it may even not be too expensive.

but this is the real problem:

```

|__/ | x86_64-linux-gnu

julia> @time for i=1:10^4 backtrace() end

elapsed time: 2.789268693 seconds (3200320016 bytes allocated, 89.29% gc time)

```

|__/ | x86_64-apple-darwin14.0.0

julia> @time for i=1:10^4 backtrace() end

elapsed time: 2.586410216 seconds (6400480000 bytes allocated, 89.96% gc time)

```

jameson@julia:~/julia-win32$ ./usr/bin/julia.exe -E " @time for i=1:10^3 backtrace() end "

fixme:winsock:WS_EnterSingleProtocolW unknown Protocol <0x00000000>

err:dbghelp_stabs:stabs_parse Unknown stab type 0x0a

elapsed time: 22.6314386 seconds (320032016 bytes allocated, 1.51% gc time)

```

|__/ | i686-w64-mingw32

julia> @time for i=1:10^4 backtrace() end

elapsed time: 69.243275608 seconds (3200320800 bytes allocated, 13.16% gc time)

```

And yes, those gc fractions are verifiably correct. With gc_disable(), they execute in 1/10 of the time. So, that pretty much means you must take 1/100 of the samples if you want to preserve roughly the same slow down. On linux, I find the slowdown to be in the range of 2-5x, and consider that to be pretty reasonable, especially for what you're getting. If you took the same number of samples on windows, it would cause a 200-500x slowdown (give or take a few percent). If you wanted to offload this work to other cores to get the same level of accuracy and no more slowdown than linux, you would need a machine with 200-500 processors (give or take 2-5)!

(I think I did those conversions correctly. However, since I just did them for the purposes of this email, sans calculator, and as I was typing, let me know if I made more than a factor of 2 error somewhere, or just have fun reading https://what-if.xkcd.com/84/ instead)

Tim Holy

unread,

Dec 2, 2014, 10:50:21 PM12/2/14

to julia...@googlegroups.com

Wow, those are pathetically-slow backtraces. Since most of us don't have
machines with 500 cores, I don't see anything we can do.

--Tim

Jameson Nash

unread,

Dec 2, 2014, 11:55:28 PM12/2/14

to julia...@googlegroups.com

(I forgot to mention, that, to be fair, the windows machine that was used to run this test was an underpowered dual-core hyperthreaded atom processor, whereas the linux and mac machines were pretty comparable Xeon and sandybridge machines, respectively. I only gave windows a factor of 2 advantage in the above computation in my accounting for this gap)

Stefan Karpinski

unread,

Dec 3, 2014, 9:11:29 AM12/3/14

to Julia Users

This seems nuts. There have to be good profilers on Windows – how do those work?

Jameson Nash

unread,

Dec 3, 2014, 9:25:26 AM12/3/14

to Julia Users

this stack overflow question indicates that there are two options (http://stackoverflow.com/questions/153559/what-are-some-good-profilers-for-native-c-on-windows)

https://software.intel.com/sites/default/files/managed/cd/92/Intel-VTune-AmplifierXE-2015-Product-Brief-072914.pdf ($900)

http://www.glowcode.com/summary.htm ($500)

Tim Holy

unread,

Dec 3, 2014, 10:04:20 AM12/3/14

to julia...@googlegroups.com

Some potentially-interesting links (of which I understand very little):
http://stackoverflow.com/questions/860602/recommended-open-source-profilers#comment2363112_1137133
http://stackoverflow.com/questions/8406175/optimizing-stack-walking-performance

I can tell from this comment:
https://github.com/JuliaLang/julia/issues/2597#issuecomment-15159868
that you already know about this (and its negatives):
http://www.lenholgate.com/blog/2008/09/alternative-call-stack-capturing.html

--Tim

Jameson Nash

unread,

Dec 3, 2014, 10:38:02 AM12/3/14

to julia...@googlegroups.com

The suggestion apparently is to use "Event Tracing for Windows" (aka ptrace/dtrace). Yes, that is faster (on linux too...), but misses the point entirely of profiling user code.

the other offerings typically wrap StalkWalk64 (as we do), and complain about how absurdly slow it is

we used to use RtlCaptureStackBackTrace, but it often failed to give useful backtraces. I think it depends too heavily of walking the EBP pointer chain (which typically doesn't exist on x86_64). As it happens, the remaining suggestions fall into the category of "well, obviously you should just write your own EBP (32-bit base pointer register) pointer chain walk algorithm. here, I'll even write part of it for you" ... which would be very helpful, if RBP (64-bit base pointer register) was used to make stack frame chains (hint: it isn't). and these days, the EBP isn't used to make stack pointer chains on x86 either.

llvm3.5 contains the ability to interpret COFF files, so you could presumably write your own stack-walk algorithm. i don't recommend it, however. you might have to call out to StalkWalk anyways to access the microsoft symbol server (yes, off their network servers) to complete the stalk walk symbol lookup correctly

Stefan Karpinski

unread,

Dec 3, 2014, 10:48:44 AM12/3/14

to Julia Users

Tim Holy

unread,

Dec 3, 2014, 10:57:28 AM12/3/14

to julia...@googlegroups.com

That's a pretty serious bummer. I can't believe anybody puts up with this.

Should we change the default initialization
https://github.com/JuliaLang/julia/blob/b1c99af9bdeef22e0999b28388597757541e2cc7/base/profile.jl#L44
so that, on Windows, it fires every 100ms or so? And add a note to the Profile
docs?

--Tim

Jameson Nash

unread,

Dec 3, 2014, 11:38:05 AM12/3/14

to julia...@googlegroups.com

Yes, probably (I thought we already had). Someone would need to do some comparison work though first.

Tim Holy

unread,

Dec 3, 2014, 11:46:44 AM12/3/14

to julia...@googlegroups.com

Can somebody on a Windows system report back with the output of
`Profile.init()`?

--Tim

Daniel Høegh

unread,

Dec 3, 2014, 11:53:50 AM12/3/14

to julia...@googlegroups.com

_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.3.3 (2014-11-23 20:19 UTC)
_/ |\__'_|_|_|\__'_| |
|__/ | x86_64-w64-mingw32

julia> Profile.init()
(1000000,0.001)

julia>

Isaiah Norton

unread,

Dec 3, 2014, 12:01:45 PM12/3/14

to julia...@googlegroups.com

The accuracy of windows timers is somewhat questionable.

I don't know if 5% is good enough for this purpose, but: one of our collaborators uses (a lightly modified version of) the code below in a real-time imaging application:

http://www.codeguru.com/cpp/w-p/system/timers/article.php/c5759/Creating-a-HighPrecision-HighResolution-and-Highly-Reliable-Timer-Utilising-Minimal-CPU-Resources.htm

Tim Holy

unread,

Dec 3, 2014, 12:44:28 PM12/3/14

to julia...@googlegroups.com

Thanks. Looks like that needs to be changed.

--Tim

Reply all

Reply to author

Forward