Using Julia for real time astronomy

991 views
Skip to first unread message

John leger

unread,
May 30, 2016, 6:00:13 AM5/30/16
to julia-users, Matthew Ozon
Hi everyone,

I am working in astronomy and we are thinking of using Julia for a real time, high performance adaptive optics system on a solar telescope.

This is how the system is supposed to work:
   1) the image is read from the camera
   2) some correction are applied
   3) the atmospheric turbulence is numerically estimated in order to calculate the command to be sent to the deformable mirror

The overall process should be executed in less than 1ms so that it can be integrated to the chain (closed loop).

Do you think it is possible to do all the computation in Julia or would it be better to code some part in C/C++. What I fear the most is the GC but in our case we can pre-allocate everything, so once we launch the system there will not be any memory allocated during the experiment and it will run for days.

So, what do you think? Considering the current state of Julia will I be able to get the performances I need. Will the garbage collector be an hindrance ?

Thank you.

Uwe Fechner

unread,
May 30, 2016, 8:10:39 AM5/30/16
to julia-users, matthe...@gmail.com
I think, that would be difficult.

As soon as you use any packages for image conversion or estimation you have to assume that they use dynamic memory allocation.

The garbage collector of Julia is fast, but not suitable for hard real-time requirements. Implementing a garbage collector for hard real-time
applications is possible, but a lot of work and will probably not happen in the near future.

Their was an issue on this topic, that was closed as "won't fix":
https://github.com/JuliaLang/julia/issues/8543

Uwe

Leger Jonathan

unread,
May 30, 2016, 8:35:06 AM5/30/16
to julia...@googlegroups.com
Thanks for the answer.

I don't intend to use any package, only use my array so I can confirm that I will not have dynamic memory allocation (let's hope that I'm true ;) ).
But even in this case Julia itself may do allocations, so my question would be more: if there is nearly nothing to do, is the GC fast ?
I already read many topics about GC and yes, even if there was very good improvements, is it enough for my case ?

In the worst case Julia will be for testing and will only call the main loop in C++.

Tamas Papp

unread,
May 30, 2016, 8:41:36 AM5/30/16
to julia...@googlegroups.com
You could test whether the GC is fast enough by implementing the
computational core (using simulated data or something similar), then
just running it. Then if you find it is not acceptable, you haven't
wasted time on writing the code for interfacing with the equipment.

Also, you could think about the "cost" of an occasional longer GC run,
and what the acceptable failure rate is. For example, is it a great
concern if you have suboptimal quality or even total loss of every
1000th frame? or 10000th? Of course one would like to have all the data,
but equipment can be down for all sorts of reasons and maybe the GC
hiccups will not be your primary concern.

Best,

Tamas

Tobias Knopp

unread,
May 30, 2016, 4:19:34 PM5/30/16
to julia-users, matthe...@gmail.com
If you are prepared to make your code to not perform any heap allocations, I don't see a reason why there should be any issue. When I once worked on a very first multi-threading version of Julia I wrote exactly such functions that won't trigger gc since the later was not thread safe. This can be hard work but I would assume that its at least not more work than implementing the application in C/C++ (assuming that you have some Julia experience)

Tobi

Páll Haraldsson

unread,
May 31, 2016, 12:28:54 PM5/31/16
to julia-users, matthe...@gmail.com
On Monday, May 30, 2016 at 12:10:39 PM UTC, Uwe Fechner wrote:
I think, that would be difficult.

As soon as you use any packages for image conversion or estimation you have to assume that they use dynamic memory allocation.

The garbage collector of Julia is fast, but not suitable for hard real-time requirements. Implementing a garbage collector for hard real-time
applications is possible, but a lot of work and will probably not happen in the near future.

Their was an issue on this topic, that was closed as "won't fix":
https://github.com/JuliaLang/julia/issues/8543

Well, the "won't fix"-label was later taken off the issue.

Yes, the issue is still closed, but it's unclear to me what has changed with the GC, when. I know incremental GC was implemented at some point. No hard-real-time GC is available.

It would be cool to know of Julia in space so I gave this some thought..

I recall from MicroPython, that they claimed hard-real-time GC (also available for Java with Metronome), that is predictable pause times. I remember thinking, how can they do/claim that (and if I recall, didn't change the GC)? MicroPython is meant for microcontrollers (at the time only one), that has a known amount of memory. I can't locate the information I read at the time now, I think they where talking in megabytes range. Then worst case, you have to scan a fixed amount of memory, and the speed of the CPU is also known. Unlike with MicroPython, you will have an operating system (that is not real-time, but Linux can be configured as such, but caches are a problem..). Maybe if you can limit the RAM, or just how much Julia will try to allocate, it helps in the same way.

Anyway, you may not strictly need hard-real-time. I think, as always (in non-real-time/concurrent GC variants)?, the garbage collection only happens when you try to allocate memory and it is full. If you preallocate all memory and make sure no more is allocated, I can't see the GC being a problem (you can also disable it for some period of time).

Libc.malloc and free is also available with Julia..


[Possibly it helps to split your task into more than one process, having only one real-time? If you can have shared memory between two processes, would that help? Be careful with that.. I'm not sure it's a good idea or at least I need to explain it better..]



"Regarding RAM usage, MicroPython can start up with 2KB of heap. Adding stack and required static memory, a 4KB microcontroller could start a MicroController, but hardly could go further than interpreting simple expressions. Thus, 8KB is minimal amount to run simple scripts."

"today I painfully learned, that uPy's automatic garbage collection can really mess up your 500Hz feedback control loop, since it takes forever (>1ms  :o :shock: :cry: )."



Páll Haraldsson

unread,
May 31, 2016, 12:44:17 PM5/31/16
to julia-users, matthe...@gmail.com
On Monday, May 30, 2016 at 8:19:34 PM UTC, Tobias Knopp wrote:
If you are prepared to make your code to not perform any heap allocations, I don't see a reason why there should be any issue. When I once worked on a very first multi-threading version of Julia I wrote exactly such functions that won't trigger gc since the later was not thread safe. This can be hard work but I would assume that its at least not more work than implementing the application in C/C++ (assuming that you have some Julia experience)

I would really like to know why the work is hard, is it getting rid of the allocations, or being sure there are no more hidden in your code? I would also like to know then if you can do the same as in D language:


The most reliable way to guarantee latency is to preallocate all data that will be needed by the time critical portion. If no calls to allocate memory are done, the GC will not run and so will not cause the maximum latency to be exceeded.

It is possible to create a real-time thread by detaching it from the runtime, marking the thread function @nogc, and ensuring the real-time thread does not hold any GC roots. GC objects can still be used in the real-time thread, but they must be referenced from other threads to prevent them from being collected."

that is would it be possible to make a macro @nogc and mark functions in a similar way? I'm not aware that such a macro is available, to disallow. There is a macro, e.g. @time, that is not sufficient, that shows GC actitivy, but knowing there was none could have been an accident; if you run your code again and memory fills up you see different result.

As with D, the GC in Julia is optional. The above @nogc, is really the only thing different, that I can think of that is better with their optional memory management. But I'm no expert on D, and I mey not have looked too closely:

John leger

unread,
Jun 1, 2016, 5:40:54 AM6/1/16
to julia-users, matthe...@gmail.com
So for now the best is to build a toy that is equivalent in processing time to the original and see by myself what I'm able to get.
We have many ideas, many theories due to the nature of the GC so the best is to try.

Páll -> Thanks for the links

Páll Haraldsson

unread,
Jun 1, 2016, 5:59:15 PM6/1/16
to julia-users, matthe...@gmail.com
On Wednesday, June 1, 2016 at 9:40:54 AM UTC, John leger wrote:
So for now the best is to build a toy that is equivalent in processing time to the original and see by myself what I'm able to get.
We have many ideas, many theories due to the nature of the GC so the best is to try.

Páll -> Thanks for the links

No problem.

While I did say it would be cool to now of Julia in space, I would hate for the project to fail because of Julia (because of my advice).

I endorse Julia for all kinds of uses, hard real-time (and building operating systems) are where I have doubts.

A. I thought a little more about making a macro @nogc to mark functions, and it's probably not possible. You could I guess for one function, as the macro has access to the AST of it. But what you really want to disallow, is that function calling functions that are not similarly marked. I do not know about metadata on functions and if a nogc-bit could be put in, but even then, in theory couldn't that function be changed at runtime..?

What you would want is that this nogc property is statically checked as I guess D does, but Julia isn't separately compiled by default. Note there is Julia2C, and see


for gory details on compiling Julia.

I haven't looked, I guess Julia2C does not generate malloc and free, only some malloc substitute in libjulia runtime. That substitute will allocate and run the GC when needed. These are the calls you want to avoid in your code and could maybe grep for.. There is a Lint.jl tool, but as memory allocation isn't an error it would not flag it, maybe it could be an option..

B. One idea I just had (in the shower..), if @nogc is used or just on "gc_disable" (note it is deprecated*), it would disallow allocations (with an exception if tried), not just postpone them, it would be much easier to test if your code uses allocations or calls code that would. Still, you would have to check all code-paths..

C. Ada, or the Spark-subset, might be the go-to language for hard real-time. Rust seems also good, just not as tried. D could also be an option with @nogc. And then there is C and especially C++ that I try do avoid recommending.

D. Do tell if you only need soft real-time, it makes the matter so much simpler.. not just programming language choice..

*
help?> gc_enable
search: gc_enable

  gc_enable(on::Bool)

  Control whether garbage collection is enabled using a boolean argument (true for enabled, false for disabled). Returns previous GC state. Disabling
  garbage collection should be used only with extreme caution, as it can cause memory use to grow without bound.

Cedric St-Jean

unread,
Jun 1, 2016, 6:39:30 PM6/1/16
to julia-users, matthe...@gmail.com
Apparently, ITA Software (Orbitz) was written nearly entirely in Lisp, with 0 heap-allocation during runtime to have performance guarantees. It's pretty inspiring, in a I-crossed-the-Himalayas-barefoot kind of way.

Leger Jonathan

unread,
Jun 2, 2016, 3:55:03 AM6/2/16
to julia...@googlegroups.com, matthe...@gmail.com
Páll: don't worry about the project failing because of YOUUUUUU ;) in any case we wanted to try Julia and see if we could get help/tips from the community.
About the nogc I wonder if activating it will also prevent the core of Julia to be garbage collected ? If yes for long run it's a bad idea to disable it too long.

For now the only options options are C/C++ and Julia, sorry no D or Lisp :) Why would you not recommend C for this kind of tasks ?
And I said 1000 images/sec but the camera may be able to go up to 10 000 images/sec so I think we can define it as hard real time.

Thank you for all these ideas !

Cedric St-Jean

unread,
Jun 2, 2016, 7:53:39 AM6/2/16
to julia...@googlegroups.com
John: Common Lisp and Julia have a lot in common. I didn't mean to suggest writing your software in Lisp, I meant that if ITA was able to run a hugely popular website involving a complicated optimization problem without triggering the GC, then you can do the same in Julia. Like others have suggested, you just preallocate everything (global const arrays), and make sure that every code path is run once (to force compilation) before the system goes online. @time will tell you if you've been successful at eliminating everything. You might run into issues with libraries allocating during their calls, and it might be easier all things considered in C, but it's certainly doable with enough efforts in Julia. I might be up for helping out, if you're interested.

Páll Haraldsson

unread,
Jun 3, 2016, 12:28:58 PM6/3/16
to julia-users, matthe...@gmail.com
On Thursday, June 2, 2016 at 7:55:03 AM UTC, John leger wrote:
Páll: don't worry about the project failing because of YOUUUUUU ;) in any case we wanted to try Julia and see if we could get help/tips from the community.

Still, feel free to ask me anytime. I just do not want to give bad professional advice or oversell Julia.
 
About the nogc I wonder if activating it will also prevent the core of Julia to be garbage collected ? If yes for long run it's a bad idea to disable it too long.

Not really,* see below.
 
For now the only options options are C/C++ and Julia, sorry no D or Lisp :) Why would you not recommend C for this kind of tasks ?
And I said 1000 images/sec but the camera may be able to go up to 10 000 images/sec so I think we can define it as hard real time.

Not really. There's a hard and fast definition of hard real-time (and real-time in general), it's not about speed, is about timely actions. That said 10 000 images/sec is a lot.. 9 GB uncompressed data per second, assuming gray-scale byte-per-pixel megapixel resolution. You will fill up your 2 TB SSD I've seen advertised [I don't know about radiation-hardening those, I guess anything is possible, you know anything about the potential hardware used?], in three and a half minute.

How fast are the down-links on these satellites? Would you get all the [processed] data down to earth? If you can not, do you pick and choose framerate and/or which period of time to "download"? Since I'm sure you want lossless compression, it seems http://flif.info/ might be of interest to you. [FLIF should really be wrapped as a Julia library.. There's also a native executable, that could do, while maybe not suitable for you/real-time, for invoking a separate process.] FLIF was GPL licensed, that shouldn't be a problem for government work, and should be even more non-issue now [for anybody].


You can see from here:

https://github.com/JuliaLang/julia/pull/12915#issuecomment-137114298

that soft real-time was proposed for the NEWS section and even that proposal was shot down. That may have been be overly cautious for the incremental GC and I've seen audio (that is more latency sensitive than video - at the usual frame rates..) being talked about working in some thread, and software-defined-radio being discussed as a Julia project.


* "About the nogc", if you meant the function to disable the GC, then it doesn't block allocations (but my proposal did), only postpones deallocations. There is no @nogc macro; my proposal for @nogc to block allocations, was only a proposal, and rethinking it, not really too helpful. It was a fail-fast debugging proposal, but as @time does show allocations (or not when there are none), not just GC activity, it should do, for debugging. I did a test:

[Note, this has to be in a function, not in the global scope:]

julia> function test()
         @time a=[1 2 3]
         @time a[1] = 2
       end
test (generic function with 1 method)

julia> test()
  0.000001 seconds (1 allocation: 96 bytes)
  0.000000 seconds

You want to see similar to the latter result, not the former, not even with "1 allocation". It seems innocent enough, as there is no GC activity (then there would be more text), but that is just an accident. When garbage accumulates, even one allocation can tip off a GC and lots of deallocations. And take an unbounded amount of time in naive GC implementations. Incremental, means it's not that bad, but still theoretically unbounded time I think.

I've seen recommending disabling GC periodically, such as in games with Lua, after each drawn frame ("vblank"). That scenario is superficially similar to yours. I'm however skeptical of that approach, as a general idea, if you do not minimize allocations. Note, that in games, the heavy lifting is done by game engines, almost exclusively done in C++. As they do not use GC (while GC IS optional in C++ and C), Lua will handle game logic with probably much less memory allocated, so it works ok there, postponing deallocations, while taking [potentially] MORE cumulative time later at the convenient time.

Why do I say more? The issue of running out of RAM because of garbage isn't the only issue. NOT deallocating early, prevents reusing memory (that is currently covered by the cache) and doing that would have helped for cache purposes.

By recommending FILF, I've actually recommended using C++ indirectly, and reusing C++ (or C) code isn't bad all things equal. It's just that for new code, I recommend not using C for many reasons, such as safety and C++ as it's a complex language, to easy to "blow your leg off", to quote it's designer.. and in both cases there are better languages, with some rare exceptions that do not apply here (except one reason, can be reusing the code, that MAY apply here).

I believe lossy compression such as JPEG (even MPEG etc. at least on average), has a consistent performance profile. But you wouldn't want to use lossy. In general, lossless cannot guarantee any compression, while in practice you would always get some. That makes me wander if any kind of lossless is compatible with [hard] real-time.. It's probably hard to know the (once in a million) worst case (or just bad) time complexity..

If it is sufficient for you, to be ok with missing some frames infrequently, it makes the problem no longer hard real-time. I understand that is called soft real-time. You should still be able to get a know if some frame is missed, such as by a timestamp. I haven't thought through is missing some frame, and not knowing, would screw up some black hole video analysis for Hawking. I'm only an amateur physicist; I've still not gotten QM to work with general relativity, so I'm not sure about the Hawking radiation-theory, and what a missed frame could do.

Why D seems a better language than C and C++ is in part that you can avoid the GC (and still be a better language), but also that you can use the GC! That you can use @nogc, ensures at compile time that no future maintenance of your code will add some accidental allocation and then GC [pause]. It isn't really that you can't avoid GC in Julia, but this possibility, that you add some, say logging, and forget to disable it..


https://en.wikipedia.org/wiki/Ariane_5

rocket blew up, in part because of failed maintenance of software, despite the "safe" language Ada. Requirements changed and the software should have been changed, but wasn't.


Linus Torvalds, on his own Linux kernel (may be outdated, there is real-time kernel now available, it's not the default, just read the fine print there):

http://yarchive.net/comp/linux/rtlinux.html

"Can we make the whole kernel truly hard-RT? Sure, possible in theory. In
practice? No way, José. It's just not mainline enough."

Note what he says about CPUs with caches (all modern CPUs.. even some microcontrollers, those without wouldn't be fast enough anyway..). Silicon Graphics had real-time I/O capabilities in their filesystem:

https://en.wikipedia.org/wiki/XFS
"A feature unique to XFS is the pre-allocation of I/O bandwidth at a pre-determined rate, which is suitable for many real-time applications; however, this feature was supported only on IRIX, and only with specialized hardware."

This isn't I guess too much of a problem, as [XFS was for spinning disks and] you just do not do any concurrent I/O. SSDs could have some issues, do not trust them blindly.. Similarly, with the Linux kernel (or any kernel), you can NOT run many processes. Real-time operating system, are to solve that problem. You can't get down to one process, but it might be close enough.


While googling for XFS I found [might be interesting]:
http://moss.csc.ncsu.edu/~mueller/rt/rt05/readings/g7/


Mostly unread [in addition to below IBM's Metronome GC allows hard-real-time without having to avoid the GC], but at least interesting (note real-time Java dates back to 1998 but not quite to when it was first public, I recall it being disallowed in the license and if I recall for nuclear reactors..):

http://www.oracle.com/technetwork/articles/java/nilsen-realtime-pt1-2264405.html

"Learn why Java SE is a good choice for implementing real-time systems, especially those that are large, complex, and dynamic.

Published August 2014

[..]
The presented methods and techniques have been proven in many successfully deployed Java SE applications, including a variety of telecommunications infrastructure devices; automation of manufacturing processes, ocean-based oil drilling rigs, and fossil fuel power plants; multiple radar systems; and the modernization of the US Navy's Aegis Warship Weapons Control System with enhanced ballistic missile defense capabilities.

Note: The full source code for the sample application described in this article is available here.
[..]

Java SE Versus Other Languages

The use of Java SE APIs in the implementation of real-time systems is most appropriate for soft real-time development. Using Java SE for hard real-time development is also possible, but generally requires the use of more specialized techniques such as the use of NoHeapRealtimeThread abstractions, as described in the Real-Time Specification for Java (JSR 1), or the use of the somewhat simpler ManagedSchedulable abstractions of the Safety Critical Java Technology specification (JSR 302).

[..]

Projects that can be implemented entirely by one or two developers in a year's time are more likely to be implemented in a less powerful language such as C or C++

[..]

About the Author

As Chief Technology Officer over Java at Atego Systems—a mission- and safety-critical solutions provider—Dr. Kelvin Nilsen oversees the design and implementation of the Perc Ultra virtual machine and other Atego embedded and real-time oriented products. Prior to joining Atego, Dr. Nilsen served on the faculty of Iowa State University where he performed seminal research on real-time Java that led to the Perc family of virtual machine products."


John leger

unread,
Jun 6, 2016, 5:41:29 AM6/6/16
to julia-users, matthe...@gmail.com
Since it seems you have a good overview in this domain I will give more details:
We are working in signal processing and especially in image processing. The goal here is just the adaptive optic: we just want to stabilize the image and not get the final image.
The consequence is that we will not store anything on the hard drive: we read an image, process it and destroy it. We stay in RAM all the time.
The processing is done by using/coding our algorithms. So for now, no need of any external library (for now, but I don't see any reason for that now)

First I would like to apologize: just after posting my answer I went to wikipedia to search the difference between soft and real time.
I should have done it before so that you don't have to spend more time to explain.

In the end I still don't know if I am hard real time or soft real time: the timing is given by the camera speed and the processing should be done between the acquisition of two images.
We don't want to miss an image or delay the processing, I still need to clarify the consequences of a delay or if we miss an image.
For now let's just say that we can miss some images so we want soft real time.

I'm making a benchmark that should match the system in term of complexity, these are my first remarks:

When you say that one allocation is unacceptable, I say it's shockingly true: In my case I had 2 allocations done by:
    A +=1 where A is an array
and in 7 seconds I had 600k allocations.
Morality :In closed loop you cannot accept any alloc and so you have to explicit all loops.

I have two problems now:

1/ Many times, the first run that include the compilation was the fastest and then any other run was slower by a factor 2.
2/ If I relaunch many times the main function that is in a module, there are some run that were very different (slower) from the previous.

About 1/, although I find it strange I don't really care.
2/ If far more problematic, once the code is compiled I want it to act the same whatever the number of launch.
I have some ideas why but no certitudes. What bother me the most is that all the runs in the benchmark will be slower, it's not a temporary slowdown it's all the current benchmark that will be slower.
If I launch again it will be back to the best performances.

Thank you for the links they are very interesting and I keep that in mind.

Note: I disabled hyperthreading and overclock, so it should not be the CPU doing funky things.

Islam Badreldin

unread,
Jun 6, 2016, 12:45:35 PM6/6/16
to julia-users, matthe...@gmail.com
Hi John,

I am currently pursuing similar effort. I got a GPIO pin on the BeagleBone Black embedded board toggling in hard real-time and verified the jitter with an oscilloscope. For that, I used a vanilla Linux 4.4.11 kernel with the PREEMPT_RT patch applied. I also released an initial version of a Julia package that wraps the clock_nanosleep() and clock_gettime() functions from the POSIX real-time extensions. Please see this other thread:
https://groups.google.com/forum/#!topic/julia-users/0Vr2rCRwJY4

I tested that package both on Intel-based laptop and on the BeagleBone Black. I am giving some of the relevant details below..


On Monday, June 6, 2016 at 5:41:29 AM UTC-4, John leger wrote:
Since it seems you have a good overview in this domain I will give more details:
We are working in signal processing and especially in image processing. The goal here is just the adaptive optic: we just want to stabilize the image and not get the final image.
The consequence is that we will not store anything on the hard drive: we read an image, process it and destroy it. We stay in RAM all the time.
The processing is done by using/coding our algorithms. So for now, no need of any external library (for now, but I don't see any reason for that now)

First I would like to apologize: just after posting my answer I went to wikipedia to search the difference between soft and real time.
I should have done it before so that you don't have to spend more time to explain.

In the end I still don't know if I am hard real time or soft real time: the timing is given by the camera speed and the processing should be done between the acquisition of two images.
We don't want to miss an image or delay the processing, I still need to clarify the consequences of a delay or if we miss an image.
For now let's just say that we can miss some images so we want soft real time.

The real-time performance you are after could be 95% hard real-time. See e.g. here: https://www.osadl.org/fileadmin/dam/rtlws/12/Brown.pdf
 

I'm making a benchmark that should match the system in term of complexity, these are my first remarks:

When you say that one allocation is unacceptable, I say it's shockingly true: In my case I had 2 allocations done by:
    A +=1 where A is an array
and in 7 seconds I had 600k allocations.
Morality :In closed loop you cannot accept any alloc and so you have to explicit all loops.

Yes, try to completely avoid memory allocations while developing your own algorithms in Julia. Pre-allocations and in-place operations are your friends! The example script available on the POSIXClock package is one way to do this (https://github.com/ibadr/POSIXClock.jl/blob/master/examples/rt_histogram.jl). The real-time section of the code is marked by a ccall to mlockall() in order to cause immediate failure upon memory allocations in the real-time section. You can also use the --track-allocation option to hunt down memory allocations while developing your algorithm. See e.g. http://docs.julialang.org/en/release-0.4/manual/profile/#man-track-allocation
 

I have two problems now:

1/ Many times, the first run that include the compilation was the fastest and then any other run was slower by a factor 2.
2/ If I relaunch many times the main function that is in a module, there are some run that were very different (slower) from the previous.

About 1/, although I find it strange I don't really care.
2/ If far more problematic, once the code is compiled I want it to act the same whatever the number of launch.
I have some ideas why but no certitudes. What bother me the most is that all the runs in the benchmark will be slower, it's not a temporary slowdown it's all the current benchmark that will be slower.
If I launch again it will be back to the best performances.

Thank you for the links they are very interesting and I keep that in mind.

Note: I disabled hyperthreading and overclock, so it should not be the CPU doing funky things.



Regarding these two issues, I encountered similar ones. Are you running on an Intel-based computer? I had to do many tweaks to get to acceptable real-time performance with Intel processors. Many factors could be at play. As you said, you have to make sure hyper-threading is disabled and not to overclock the processor. Also, monitor the kernel dmesg log for any errors or warnings regarding RT throttling or local_softitq_pending.

Additionally, I had to use the following options in the Linux command line (pass them from the bootloader):

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Together with removing the intel_powerclamp kernel module (sudo rm intel_powerclamp). Caution: be extremely careful with such configuration as it disables many power saving features in the processor and can potentially overheat it. Keep an eye on the kernel dmesg log and try to monitor the CPU temperature.

I also found it useful to isolate one CPU core using the isolcpus=1 kernel command line option and then set the affinity of the real-time Julia process to run on that isolated CPU (using the taskset command). This way, you can almost guarantee the Linux kernel and all other user-space process will not run on that isolated CPU so it becomes wholly dedicated to running the real-time Julia process. I am planning to post more details to the POSIXClock package in the near future.

Best,
Islam

John leger

unread,
Jun 7, 2016, 5:26:32 AM6/7/16
to julia-users, matthe...@gmail.com
Hi Islam,

I like the definition of 95% hard real time; it suits my needs. Thanks for this good paper.


Le lundi 6 juin 2016 18:45:35 UTC+2, Islam Badreldin a écrit :
Hi John,

I am currently pursuing similar effort. I got a GPIO pin on the BeagleBone Black embedded board toggling in hard real-time and verified the jitter with an oscilloscope. For that, I used a vanilla Linux 4.4.11 kernel with the PREEMPT_RT patch applied. I also released an initial version of a Julia package that wraps the clock_nanosleep() and clock_gettime() functions from the POSIX real-time extensions. Please see this other thread:
https://groups.google.com/forum/#!topic/julia-users/0Vr2rCRwJY4

I tested that package both on Intel-based laptop and on the BeagleBone Black. I am giving some of the relevant details below..

On Monday, June 6, 2016 at 5:41:29 AM UTC-4, John leger wrote:
Since it seems you have a good overview in this domain I will give more details:
We are working in signal processing and especially in image processing. The goal here is just the adaptive optic: we just want to stabilize the image and not get the final image.
The consequence is that we will not store anything on the hard drive: we read an image, process it and destroy it. We stay in RAM all the time.
The processing is done by using/coding our algorithms. So for now, no need of any external library (for now, but I don't see any reason for that now)

First I would like to apologize: just after posting my answer I went to wikipedia to search the difference between soft and real time.
I should have done it before so that you don't have to spend more time to explain.

In the end I still don't know if I am hard real time or soft real time: the timing is given by the camera speed and the processing should be done between the acquisition of two images.
We don't want to miss an image or delay the processing, I still need to clarify the consequences of a delay or if we miss an image.
For now let's just say that we can miss some images so we want soft real time.

The real-time performance you are after could be 95% hard real-time. See e.g. here: https://www.osadl.org/fileadmin/dam/rtlws/12/Brown.pdf
 

I'm making a benchmark that should match the system in term of complexity, these are my first remarks:

When you say that one allocation is unacceptable, I say it's shockingly true: In my case I had 2 allocations done by:
    A +=1 where A is an array
and in 7 seconds I had 600k allocations.
Morality :In closed loop you cannot accept any alloc and so you have to explicit all loops.

Yes, try to completely avoid memory allocations while developing your own algorithms in Julia. Pre-allocations and in-place operations are your friends! The example script available on the POSIXClock package is one way to do this (https://github.com/ibadr/POSIXClock.jl/blob/master/examples/rt_histogram.jl). The real-time section of the code is marked by a ccall to mlockall() in order to cause immediate failure upon memory allocations in the real-time section. You can also use the --track-allocation option to hunt down memory allocations while developing your algorithm. See e.g. http://docs.julialang.org/en/release-0.4/manual/profile/#man-track-allocation
 

I discovered --track-allocation not so long ago and it is a good tool. For now I think I will rely on tracking allocation manually. I am a little afraid of using mlockall(): In soft or real time crashing (failure) is not a good option for me...
Since you are talking about --track-allocation I have a question:


        -     function deflat(v::globalVar)
       
0         @simd for i in 1:v.len_sub
       
0             @inbounds v.sub_imagef[i] = v.flat[i]*v.image[i]
       
-         end
       
-        
       
0         @simd for i in 1:v.len_ref
       
0             @inbounds v.ref_imagef[i] = v.flat[i]*v.image[i]
       
-         end
       
0         return
       
-     end
       
-
       
-     # get min max
       
-     # apply norm_coef
       
-     # MORE TO DO HERE
       
-     function normalization(v::globalVar)
       
0         min::Float32 = Float32(4095)
       
0         max::Float32 = Float32(0)
       
0         tmp::Float32 = Float32(0)
       
0         norm_fact::Float32 = Float32(0)
       
0         norm_coef::Float32 = Float32(0)
       
-         # find min max
       
0         @simd for i in 1:v.nb_mat
       
0             # Doing something with no allocs
       
0         end
       
0     end
       
0
 
1226415     # SAD[70] 16x16 de Ref_Image sur Sub_Image[60]
       
-     function correlation_SAD(v::globalVar)
       
0
       
-     end
       
-

In the mem output file I have this information: at the end of normalization I have no alloc and in front of the SAD comment and before the empty correlation function I have 1226415 allocations.
It should be logic that these allocations happened in normalization but why is it here between two function ?
 

I have two problems now:

1/ Many times, the first run that include the compilation was the fastest and then any other run was slower by a factor 2.
2/ If I relaunch many times the main function that is in a module, there are some run that were very different (slower) from the previous.

About 1/, although I find it strange I don't really care.
2/ If far more problematic, once the code is compiled I want it to act the same whatever the number of launch.
I have some ideas why but no certitudes. What bother me the most is that all the runs in the benchmark will be slower, it's not a temporary slowdown it's all the current benchmark that will be slower.
If I launch again it will be back to the best performances.

Thank you for the links they are very interesting and I keep that in mind.

Note: I disabled hyperthreading and overclock, so it should not be the CPU doing funky things.



Regarding these two issues, I encountered similar ones. Are you running on an Intel-based computer? I had to do many tweaks to get to acceptable real-time performance with Intel processors. Many factors could be at play. As you said, you have to make sure hyper-threading is disabled and not to overclock the processor. Also, monitor the kernel dmesg log for any errors or warnings regarding RT throttling or local_softitq_pending.

Additionally, I had to use the following options in the Linux command line (pass them from the bootloader):

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Together with removing the intel_powerclamp kernel module (sudo rm intel_powerclamp). Caution: be extremely careful with such configuration as it disables many power saving features in the processor and can potentially overheat it. Keep an eye on the kernel dmesg log and try to monitor the CPU temperature.

I also found it useful to isolate one CPU core using the isolcpus=1 kernel command line option and then set the affinity of the real-time Julia process to run on that isolated CPU (using the taskset command). This way, you can almost guarantee the Linux kernel and all other user-space process will not run on that isolated CPU so it becomes wholly dedicated to running the real-time Julia process. I am planning to post more details to the POSIXClock package in the near future.


I have an intel processor indeed and thanks for all the tips I will first try to apply to isolate a CPU then disabling the intel options.
 
Best,
Islam


Again thanks a lot for all the help.
 

Islam Badreldin

unread,
Jun 7, 2016, 8:31:06 PM6/7/16
to julia-users, matthe...@gmail.com
Hi John,

Please see below ..

Yes, I noticed the same thing when I used track-allocation=user. The following lines from the manual solved the puzzle:
"In interpreting the results, there are a few important details. Under the user setting, the first line of any function directly called from the REPL will exhibit allocation due to events that happen in the REPL code itself. More significantly, JIT-compilation also adds to allocation counts, because much of Julia’s compiler is written in Julia (and compilation usually requires memory allocation). The recommended procedure is to force compilation by executing all the commands you want to analyze, then call Profile.clear_malloc_data() to reset all allocation counters."
http://docs.julialang.org/en/release-0.4/manual/profile/#memory-allocation-analysis


 
 

I have two problems now:

1/ Many times, the first run that include the compilation was the fastest and then any other run was slower by a factor 2.
2/ If I relaunch many times the main function that is in a module, there are some run that were very different (slower) from the previous.

About 1/, although I find it strange I don't really care.
2/ If far more problematic, once the code is compiled I want it to act the same whatever the number of launch.
I have some ideas why but no certitudes. What bother me the most is that all the runs in the benchmark will be slower, it's not a temporary slowdown it's all the current benchmark that will be slower.
If I launch again it will be back to the best performances.

Thank you for the links they are very interesting and I keep that in mind.

Note: I disabled hyperthreading and overclock, so it should not be the CPU doing funky things.



Regarding these two issues, I encountered similar ones. Are you running on an Intel-based computer? I had to do many tweaks to get to acceptable real-time performance with Intel processors. Many factors could be at play. As you said, you have to make sure hyper-threading is disabled and not to overclock the processor. Also, monitor the kernel dmesg log for any errors or warnings regarding RT throttling or local_softitq_pending.

Additionally, I had to use the following options in the Linux command line (pass them from the bootloader):

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Together with removing the intel_powerclamp kernel module (sudo rm intel_powerclamp). Caution: be extremely careful with such configuration as it disables many power saving features in the processor and can potentially overheat it. Keep an eye on the kernel dmesg log and try to monitor the CPU temperature.

I also found it useful to isolate one CPU core using the isolcpus=1 kernel command line option and then set the affinity of the real-time Julia process to run on that isolated CPU (using the taskset command). This way, you can almost guarantee the Linux kernel and all other user-space process will not run on that isolated CPU so it becomes wholly dedicated to running the real-time Julia process. I am planning to post more details to the POSIXClock package in the near future.


I have an intel processor indeed and thanks for all the tips I will first try to apply to isolate a CPU then disabling the intel options.
 
Best,
Islam


Again thanks a lot for all the help.
 

You're welcome!

Cheers,
Islam

Páll Haraldsson

unread,
Jun 8, 2016, 11:33:18 AM6/8/16
to julia-users, matthe...@gmail.com
On Monday, June 6, 2016 at 9:41:29 AM UTC, John leger wrote:
Since it seems you have a good overview in this domain I will give more details:
We are working in signal processing and especially in image processing. The goal here is just the adaptive optic: we just want to stabilize the image and not get the final image.
The consequence is that we will not store anything on the hard drive: we read an image, process it and destroy it. We stay in RAM all the time.
The processing is done by using/coding our algorithms. So for now, no need of any external library (for now, but I don't see any reason for that now)

I completely misread/missed reading 3) about the "deformable mirror" seeing now it's a down-to-earth project - literally.. :)

Still, glad to help, even if it doesn't get Julia into space. :)



First I would like to apologize: just after posting my answer I went to wikipedia to search the difference between soft and real time.
I should have done it before so that you don't have to spend more time to explain.

In the end I still don't know if I am hard real time or soft real time: the timing is given by the camera speed and the processing should be done between the acquisition of two images.


From: https://en.wikipedia.org/wiki/Real-time_computing#Criteria_for_real-time_computing
  • Hard – missing a deadline is a total system failure.
  • Firm – infrequent deadline misses are tolerable, but may degrade the system's quality of service. The usefulness of a result is zero after its deadline.
  • Soft – the usefulness of a result degrades after its deadline, thereby degrading the system's quality of service.

[Note also, real-time also applies to doing stuff too early, not only to not doing stuff too late.. In some cases, say in games, that is not a [big] problem, getting a frame ready earlier isn't a big concern.]


Are you sure "the processing should be done between the acquisition of two images" is a strict requirement? I assume the "atmospheric turbulence" to not change extremely quickly and you could have some latency with you calculation applying for some time/at least a few/many frames after and then your project seems not hard real-time at all. Maybe soft or firm, a category I had forgotten..


At least your timescale is much longer than the camera speed to capture each frame in a video?


You also said "1000 images/sec but the camera may be able to go up to 10 000 images/sec". I'm aware of very high-speed photography, such as capturing a picture of a bullet from a gun, or seeing light literally spreading across a room. Still do you need many frames per second for (capturing video, that seems not your job) or for correction? Did you mix up camera speed for exposure time? Ordinary cameras go up to 1/1000 s shutter speed, but might only take video at up to 30, 60 or say 120 fps.



>I like the definition of 95% hard real time; it suits my needs. Thanks for this good paper.

The term/title, sounds like firm real-time..

 
We don't want to miss an image or delay the processing, I still need to clarify the consequences of a delay or if we miss an image.
For now let's just say that we can miss some images so we want soft real time.

You could store with each frame a) how long since the mirror was corrected, based on b) the measurement since how long ago. Also can't you [easily] see from a picture if it is mirror is maladjusted? Does to then look blurred and then high-frequency content missing?

How many "mirrors" are adjusted, or points in the mirror[s]?


I'm making a benchmark that should match the system in term of complexity, these are my first remarks:

When you say that one allocation is unacceptable, I say it's shockingly true: In my case I had 2 allocations done by:
    A +=1 where A is an array
and in 7 seconds I had 600k allocations.
Morality :In closed loop you cannot accept any alloc and so you have to explicit all loops.

I think you mean two (or even one) allocation are bad because they are in a loop. And that loop runs for each adjustment.

I meant even just one allocation (per adjustment, or frame of you will) can be a problem. Well, not strictly, but say there have been many in the past, then it's only the last one that is the problem.
 

I have two problems now:

1/ Many times, the first run that include the compilation was the fastest and then any other run was slower by a factor 2.
2/ If I relaunch many times the main function that is in a module, there are some run that were very different (slower) from the previous.

About 1/, although I find it strange I don't really care.
2/ If far more problematic, once the code is compiled I want it to act the same whatever the number of launch.
I have some ideas why but no certitudes. What bother me the most is that all the runs in the benchmark will be slower, it's not a temporary slowdown it's all the current benchmark that will be slower.
If I launch again it will be back to the best performances.

Thank you for the links they are very interesting and I keep that in mind.

Note: I disabled hyperthreading and overclock, so it should not be the CPU doing funky things.

Keep at least possible thermal throttling in mind.. The other guy, Islam, had something on it. I had my mind set on the coldness or hotness of space.. and radiation-hardening.

--
Palli.

Páll Haraldsson

unread,
Jun 8, 2016, 1:55:11 PM6/8/16
to julia-users, matthe...@gmail.com
On Tuesday, May 31, 2016 at 4:44:17 PM UTC, Páll Haraldsson wrote:
On Monday, May 30, 2016 at 8:19:34 PM UTC, Tobias Knopp wrote:
If you are prepared to make your code to not perform any heap allocations, I don't see a reason why there should be any issue. When I once worked on a very first multi-threading version of Julia I wrote exactly such functions that won't trigger gc since the later was not thread safe. This can be hard work but I would assume that its at least not more work than implementing the application in C/C++ (assuming that you have some Julia experience)

I would really like to know why the work is hard, is it getting rid of the allocations, or being sure there are no more hidden in your code? I would also like to know then if you can do the same as in D language:

 
that is would it be possible to make a macro @nogc and mark functions in a similar way?

The @nogc macro was made a long time ago, I now see:

https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Suspending$20Garbage$20Collection$20for$20Performance...good$20idea$20or$20bad$20idea$3F/julia-users/6_XvoLBzN60/nkB30SwmdHQJ

I'm not saying disabling the GC is preferred, just that the macro has been done to do it had already been done.

Karpinski has his own exception variant a little down the thread with "you really want to put a try-catch around it". I just changed that variant so it can be called recursively (and disabled try-catch as it was broken):

macro nogc(ex)
         quote
           #try
             local pref = gc_enable(false)
             local val = $(esc(ex))
           #finally
             gc_enable(pref)
           #end
           val
         end
       end


Islam Badreldin

unread,
Jun 8, 2016, 2:13:47 PM6/8/16
to julia-users, matthe...@gmail.com
Hi Páll,
This is a very informative thread. Thank you for pointing it out!
 

I'm not saying disabling the GC is preferred, just that the macro has been done to do it had already been done.

Karpinski has his own exception variant a little down the thread with "you really want to put a try-catch around it". I just changed that variant so it can be called recursively (and disabled try-catch as it was broken):

macro nogc(ex)
         quote
           #try
             local pref = gc_enable(false)
             local val = $(esc(ex))
           #finally
             gc_enable(pref)
           #end
           val
         end
       end



  -Islam 

John leger

unread,
Jun 9, 2016, 4:02:03 AM6/9/16
to julia-users, matthe...@gmail.com


Le mercredi 8 juin 2016 17:33:18 UTC+2, Páll Haraldsson a écrit :
On Monday, June 6, 2016 at 9:41:29 AM UTC, John leger wrote:
Since it seems you have a good overview in this domain I will give more details:
We are working in signal processing and especially in image processing. The goal here is just the adaptive optic: we just want to stabilize the image and not get the final image.
The consequence is that we will not store anything on the hard drive: we read an image, process it and destroy it. We stay in RAM all the time.
The processing is done by using/coding our algorithms. So for now, no need of any external library (for now, but I don't see any reason for that now)

I completely misread/missed reading 3) about the "deformable mirror" seeing now it's a down-to-earth project - literally.. :)

Still, glad to help, even if it doesn't get Julia into space. :)



First I would like to apologize: just after posting my answer I went to wikipedia to search the difference between soft and real time.
I should have done it before so that you don't have to spend more time to explain.

In the end I still don't know if I am hard real time or soft real time: the timing is given by the camera speed and the processing should be done between the acquisition of two images.


From: https://en.wikipedia.org/wiki/Real-time_computing#Criteria_for_real-time_computing
  • Hard – missing a deadline is a total system failure.
  • Firm – infrequent deadline misses are tolerable, but may degrade the system's quality of service. The usefulness of a result is zero after its deadline.
  • Soft – the usefulness of a result degrades after its deadline, thereby degrading the system's quality of service.

[Note also, real-time also applies to doing stuff too early, not only to not doing stuff too late.. In some cases, say in games, that is not a [big] problem, getting a frame ready earlier isn't a big concern.]



That's why in the previous mail I said that for now we will consider the system as a Soft real Time but in our case. But even if we will tolerate some deadline we don't want it to happen too many times. So Soft is not bad but some 95% Hard real time (firm) sound better in our case.
 

Are you sure "the processing should be done between the acquisition of two images" is a strict requirement? I assume the "atmospheric turbulence" to not change extremely quickly and you could have some latency with you calculation applying for some time/at least a few/many frames after and then your project seems not hard real-time at all. Maybe soft or firm, a category I had forgotten..



The system is a closed loop without thread. So the closed loop do all the steps described before one after each other and restart, so taking the flow of the camera as the reference timer is a good idea.
Your assumption is not correct for the turbulence is the main reason why we need the 1kHz so much and you can add the fact that we are working with the sun in visible spectrum (we want to observe fast things in hard conditions).

 

At least your timescale is much longer than the camera speed to capture each frame in a video?


You also said "1000 images/sec but the camera may be able to go up to 10 000 images/sec". I'm aware of very high-speed photography, such as capturing a picture of a bullet from a gun, or seeing light literally spreading across a room. Still do you need many frames per second for (capturing video, that seems not your job) or for correction? Did you mix up camera speed for exposure time? Ordinary cameras go up to 1/1000 s shutter speed, but might only take video at up to 30, 60 or say 120 fps.



This will be the kind of camera we will be using:
http://www.mikrotron.de/en/products/machine-vision-cameras/coaxpressr.html <- 4CXP
If you look at the datasheet and consider the fact we will work at a resolution ~400x400, 1000fps is an easy thing to do.
 


>I like the definition of 95% hard real time; it suits my needs. Thanks for this good paper.

The term/title, sounds like firm real-time..

 
We don't want to miss an image or delay the processing, I still need to clarify the consequences of a delay or if we miss an image.
For now let's just say that we can miss some images so we want soft real time.

You could store with each frame a) how long since the mirror was corrected, based on b) the measurement since how long ago. Also can't you [easily] see from a picture if it is mirror is maladjusted? Does to then look blurred and then high-frequency content missing?

How many "mirrors" are adjusted, or points in the mirror[s]?

We will use this DM the 97-15, so 97 actuators.
http://www.alpao.com/Products/Deformable_mirrors.htm
All of the values I gave to you were given to me by the people currently working on the telescope so even if I don't know if we are Soft, Firm or Hard (and I hope we will be able to find) this is what is needed for the AO to be working and the output image usable.
 


I'm making a benchmark that should match the system in term of complexity, these are my first remarks:

When you say that one allocation is unacceptable, I say it's shockingly true: In my case I had 2 allocations done by:
    A +=1 where A is an array
and in 7 seconds I had 600k allocations.
Morality :In closed loop you cannot accept any alloc and so you have to explicit all loops.

I think you mean two (or even one) allocation are bad because they are in a loop. And that loop runs for each adjustment.

I meant even just one allocation (per adjustment, or frame of you will) can be a problem. Well, not strictly, but say there have been many in the past, then it's only the last one that is the problem.

Yes, one alloc in a closed loop is deadly.
 
 

I have two problems now:

1/ Many times, the first run that include the compilation was the fastest and then any other run was slower by a factor 2.
2/ If I relaunch many times the main function that is in a module, there are some run that were very different (slower) from the previous.

About 1/, although I find it strange I don't really care.
2/ If far more problematic, once the code is compiled I want it to act the same whatever the number of launch.
I have some ideas why but no certitudes. What bother me the most is that all the runs in the benchmark will be slower, it's not a temporary slowdown it's all the current benchmark that will be slower.
If I launch again it will be back to the best performances.

Thank you for the links they are very interesting and I keep that in mind.

Note: I disabled hyperthreading and overclock, so it should not be the CPU doing funky things.

Keep at least possible thermal throttling in mind.. The other guy, Islam, had something on it. I had my mind set on the coldness or hotness of space.. and radiation-hardening.
 

If you have any questions, just ask.
Maybe another time for julia in space ^^ 
 
--
Palli.

Reply all
Reply to author
Forward
0 new messages