How can Fortran become more popular for Data Science and Machine Learning?

Beliavsky

unread,

Feb 1, 2018, 3:34:00 PM2/1/18

to

Fortran has been be used for statistical analysis for more than half a century. Linear algebra libraries such as Lapack are used in many codes for statistical algorithms, and many codes in the journal Applied Statistics were written in Fortran https://jblevins.org/mirror/amiller/#apstat .

"Data Science" is partly just a cool new name for statistics, but beyond statistics is encompasses the acquisition and cleaning of data and the use of machine learning algorithms such as neural networks and Support Vector Machines.

According to a 2017 poll by KDnuggets https://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html , the most popular languages in data science are

Python, 52.6% usage (was 45.8% in 2016), 15% up
R language, 52.1% (was 49.0%), 6% up
SQL, 34.9% (was 35.5%), 2% down
Java, 13.8% (was 16.8%), 18% down
Unix shell/awk/gawk, 9.6% (was 10.4%), 7% down
C/C++, 6.3%, (was 7.3%), 13% down
Perl, 1.7%, (was 2.3%), 27% down
Julia, 1.1%, (was 1.1%), no change

Not listed, but a language I have often seen used in machine learning is Matlab. For example, the very popular MOOC on Machine Learning by Andrew Ng https://www.coursera.org/learn/machine-learning#syllabus suggests that students use Matlab or Octave.

One reason Fortran is not a primary language of data science is that with faster computers, interpreted languages such as Python (with NumPy), R, Julia, and Matlab are more convenient. But a lot of compiled code in C, C++ or Fortran is being used behind the scenes. It is not coincidental that the interpreted languages all have powerful array operations, and modern Fortran should be a good fit for developers needing more speed.

It would be interesting to see how the speed of machine learning algorithms such as deep neural networks compares in C++ and Fortran. Students of machine learning are often asked to code basic algorithms in a language such as Python (with NumPy) to better understand them. I wonder how the code size and speed of Fortran compares to Python for simple implementations. It is easy to access machine learning algorithms from R or Python (because of packages such as scikit-learn for the latter). Fortran-callable libraries for machine learning are less common. Maybe I will try my hand at these projects.

Fortran becoming even a secondary language in Data Science and Machine Learning would help assure its continued viability, since those fields are growing so rapidly.

Ron Shepard

unread,

Feb 1, 2018, 8:52:28 PM2/1/18

to

On 2/1/18 2:33 PM, Beliavsky wrote:
> One reason Fortran is not a primary language of data science is that with faster computers, interpreted languages such as Python (with NumPy), R, Julia, and Matlab are more convenient.

I would guess that part of that "convenience" in the field of statistics
and data science is graphics, the ability to display and analyze data
visually. There is no graphics or display capability in standard
fortran, and there really isn't even a de facto standard that is in
common use. In contrast, graphics capability is either built-in or there
are de facto common standards in most of the other languages that are
mentioned.

I would not know how to go about changing that situation. Any attempt
would likely be met with strong opposition.

$.02 -Ron Shepard

Mohammad

unread,

Feb 1, 2018, 11:57:15 PM2/1/18

to

I agree with Ron!
We know it is not Fortran language to include or have a graphic library or something in the standard to force compiler developer to include such capability! To this end, there is common census!

BUT, the availability and simplicity of using a library to be used for quick and easy plotting is essential!
Please do not recommend Plplot, PGplot and Dislin which take a lot of effort to be used! If you like to have Fortran in that list of languages for data science, the library also should be free and accessible,

Each time people talk about Fortran graphic capabilities, people are there to answer, it is not the objective of standard or Fortran language. As I explained above, we know this! but today with many options available like
Octave, Python, Julia, ... (R and even Matlab = $$) don't expect students and researchers put a lot of time to link plplot or pgplot to their code!

Wolfgang Kilian

unread,

Feb 2, 2018, 3:31:38 AM2/2/18

to

On 01.02.2018 21:33, Beliavsky wrote:
> Fortran has been be used for statistical analysis for more than half a century. Linear algebra libraries such as Lapack are used in many codes for statistical algorithms, and many codes in the journal Applied Statistics were written in Fortran https://jblevins.org/mirror/amiller/#apstat .
>
> "Data Science" is partly just a cool new name for statistics, but beyond statistics is encompasses the acquisition and cleaning of data and the use of machine learning algorithms such as neural networks and Support Vector Machines.
>
> According to a 2017 poll by KDnuggets https://www.kdnuggets.com/2017/05/poll-analytics-data-science-machine-learning-software-leaders.html , the most popular languages in data science are
>
> Python, 52.6% usage (was 45.8% in 2016), 15% up
> R language, 52.1% (was 49.0%), 6% up
> SQL, 34.9% (was 35.5%), 2% down
> Java, 13.8% (was 16.8%), 18% down
> Unix shell/awk/gawk, 9.6% (was 10.4%), 7% down
> C/C++, 6.3%, (was 7.3%), 13% down
> Perl, 1.7%, (was 2.3%), 27% down
> Julia, 1.1%, (was 1.1%), no change
>
> Not listed, but a language I have often seen used in machine learning is Matlab. For example, the very popular MOOC on Machine Learning by Andrew Ng https://www.coursera.org/learn/machine-learning#syllabus suggests that students use Matlab or Octave.
>
> One reason Fortran is not a primary language of data science is that with faster computers, interpreted languages such as Python (with NumPy), R, Julia, and Matlab are more convenient. But a lot of compiled code in C, C++ or Fortran is being used behind the scenes. It is not coincidental that the interpreted languages all have powerful array operations, and modern Fortran should be a good fit for developers needing more speed.

I'm not an expert in that field, but isn't it true that most
applications merely use prebuilt packages for the actual number
crunching? Similar to the legacy linear algebra packages that are
popular in the Fortran (and Matlab) world. I could imagine that most of
the CPU time is spent in machine code that originates from some C or C++
source, regardless of the high-level scripting.

If you can write a convenient package as a library with BIND(C) in
Fortran, it might as well be accessed from Python, R, etc.

For the high-level part, users are lazy and will take string-handling,
generic data storage, sorting, and finally visualization off the shelf.
Fortran as a language, but also as an ecosystem, doesn't really offer that.

Fortran does offer efficiency and (most recently) parallel programming.
If somebody comes up with a convincing coarray implementation for some
low-level neural networks, together with a real-world application, it
would get its merits.

>
> It would be interesting to see how the speed of machine learning algorithms such as deep neural networks compares in C++ and Fortran. Students of machine learning are often asked to code basic algorithms in a language such as Python (with NumPy) to better understand them. I wonder how the code size and speed of Fortran compares to Python for simple implementations. It is easy to access machine learning algorithms from R or Python (because of packages such as scikit-learn for the latter). Fortran-callable libraries for machine learning are less common. Maybe I will try my hand at these projects.
>
> Fortran becoming even a secondary language in Data Science and Machine Learning would help assure its continued viability, since those fields are growing so rapidly.
>

-- Wolfgang

--
E-mail: firstnameini...@domain.de
Domain: yahoo

complex...@gmail.com

unread,

Feb 2, 2018, 3:37:29 AM2/2/18

to

On Thursday, February 1, 2018 at 9:34:00 PM UTC+1, Beliavsky wrote:
> ...

> Fortran becoming even a secondary language in Data Science and Machine Learning would help assure its continued viability, since those fields are growing so rapidly.

As an enthusiastic Fortraner, I admit never having used Fortran for visualization. For sure, Fortran's relative "non-popularity" is rooted in myths and misconceptions. But "data science" is maybe not the best example, it is a field where you might need text processing, linear algebra, *interactive* visualization, then build up a web front-end, etc in a single solution. Fortran is just not competitive at that.

For most "plain science" computing (I mean, the typical work of modelers, mathematicians, chemists, etc) coupling Fortran to an external visualization tool (gnuplot, Python+matplotlib, whatever else) is a really good solution.

But the real issue is not with the language itself, I believe. Despite all the flaws of the "others" in terms of packaging, when will I be able to do "fpack install ddeabm StringiFor" and be ok?

For my personal needs, I am happy with git, git submodules and cmake to manage project-wise dependencies. Given the compiled nature of Fortran, this is close to ideal, as the instructions for any package go along the lines of:

git clone --recursive URL_TO_REPOSITORY/PROJECT
cd PROJECT
mkdir build
cd build
cmake ..
make

which is a bit verbose but consistent. Reeeaaally stable libraries could of course remain compiled system-wide or user-wide. I blogged about my use of git and cmake here: http://pdebuyl.be/blog/2018/fortran-cmake-git.html

A good install solution + a few reference libraries for handling input files, csv data, etc would go a long way in my opinion. Of course, some are more ambitious about what should go in a "Hypothetical Fortran Standard Library" :-) http://www.acorvid.com/2017/12/13/what-i-miss-when-writing-fortran/

Cheers,

Pierre

PS: woops, this ended up a bit longer than I expected

Gary Scott

unread,

Feb 2, 2018, 8:47:52 AM2/2/18

to

On 2/2/2018 2:37 AM, complex...@gmail.com wrote:
> On Thursday, February 1, 2018 at 9:34:00 PM UTC+1, Beliavsky wrote:
>> ...
>> Fortran becoming even a secondary language in Data Science and Machine Learning would help assure its continued viability, since those fields are growing so rapidly.
>
> As an enthusiastic Fortraner, I admit never having used Fortran for visualization. For sure, Fortran's relative "non-popularity" is rooted in myths and misconceptions. But "data science" is maybe not the best example, it is a field where you might need text processing, linear algebra, *interactive* visualization, then build up a web front-end, etc in a single solution. Fortran is just not competitive at that.
>

There are however, excellent although expensive options, specifically
designed (Fortran tailored API) for Fortran visualization. The upcoming
version 9 has significant improvements in interactive graphics editing
features. You could fairly easily write a 2D or 3D autocad application
with it. And not just including built in output only graphs, charts,
surface charts, etc., but with interactive chart editing and
manipulation possibilities (if you write your application that way).
I've helped fund many new features and anomaly corrections over the
decades (35 years) to help increase the interactivity support. I think
it is the most full-featured API available that is tailored for Fortran.

http://www.gino-graphics.com

herrman...@gmail.com

unread,

Feb 2, 2018, 9:13:21 AM2/2/18

to

On Thursday, February 1, 2018 at 8:57:15 PM UTC-8, Mohammad wrote:

(snip)

> Each time people talk about Fortran graphic capabilities,
> people are there to answer, it is not the objective of
> standard or Fortran language.

> As I explained above, we know this! but today with many
> options available like Octave, Python, Julia, ...
> (R and even Matlab = $$) don't expect students and researchers
> put a lot of time to link plplot or pgplot to their code!

Note that all the languages being compared are interpreted.

Interpreted languages are good for quick tests of algorithms,
and, as noted, often have built-in graphics.

Larger scale data science is run in batch mode, when it
might take days, or might run on a cluster or cloud multiple
processor system in hours.

It is reasonable to do the number crunching in Fortran batch,
and visualization of the results with an interpreted language
with graphics built-in.

There is one problem, though, and that is that it isn't as easy
as it could be to converted an algorithm developed and tested
with an interpreted language into Fortran.

FortranFan

unread,

Feb 2, 2018, 12:27:29 PM2/2/18

to

On Thursday, February 1, 2018 at 3:34:00 PM UTC-5, Beliavsky wrote:

> ..
>
> One reason Fortran is not a primary language of data science is that with faster computers, interpreted languages such as Python (with NumPy), R, Julia, and Matlab are more convenient. .. modern Fortran should be a good fit for developers needing more speed.
>
> It would be interesting to see how the speed of machine learning algorithms such as deep neural networks compares in C++ and Fortran. ..

Yes, speed is very important in these disciplines but for many practitioners in these fields, what matters is the *overall time* it takes to get their tasks of interest completed. Computational speed can often only be a small factor of the total work process.

And so yes, things such as convenience with easy/free accessibility to libraries and tools, systems integration, data exchange, reduction, post processing, visualization, etc. can all come into play in the programming languages that get used. And in industry, few who work in data analytics seem to have any interest in following any ISO IEC standard based software programming and development; what matters is what they can use 'today' to get their work done as quickly as possible.

The entire Fortran language evolution model with WG5 JTC1 etc. is so slow and so far behind the times, Fortran will struggle to play even a secondary role in these domains and may need to remain content with some back-end compute libraries being in legacy FORTRAN or whatever.

Lynn McGuire

unread,

Feb 2, 2018, 12:59:24 PM2/2/18

to

A key problem is that Microsoft does not support the Fortran language
directly.

If gfortran had direct support for embedding itself in Visual Studio,
that would help immensely.

Lynn

Clive Page

unread,

Feb 2, 2018, 5:55:07 PM2/2/18

to

On 02/02/2018 01:52, Ron Shepard wrote:
> On 2/1/18 2:33 PM, Beliavsky wrote:
>> One reason Fortran is not a primary language of data science is that with faster computers, interpreted languages such as Python (with NumPy), R, Julia, and Matlab are more convenient.
>
> I would guess that part of that "convenience" in the field of statistics and data science is graphics, the ability to display and analyze data visually. There is no graphics or display capability in standard fortran, and there really isn't even a de facto standard that is in common use. In contrast, graphics capability is either built-in or there are de facto common standards in most of the other languages that are mentioned.

I agree with you. But I think there are two factors which together make Fortran look a poor choice for data science, which is a pity.

The first is indeed graphics - if someone didn't say that a picture is worth a thousand numbers then they should have done. There are actually quite a lot of graphical packages with Fortran bindings, but as you say nothing has reached the state of being even a de facto standard. For astronomical data analysis over many years and many operating systems my colleagues and I have often used PGPLOT, which was invented by an astronomer so it suits us pretty well. But it's terribly out-of-date with Fortran77-style bindings, and it's also quite hard to port to new systems.

The second factor in my opinion is that Fortran and Microsoft Windows don't work all that well together. It's true that the great majority of super-computers run Linux but a lot of us developing Fortran programs start out running the code on our own personal computers which mostly run Windows for a variety of good or bad reasons. Of course gfortran can be installed on Windows, but the confusion over whether to use CygWin or MinGW doesn't make it entirely straightforward.

PGPLOT can just about be installed on Windows but it's a struggle especially to get real-time or interactive graphics working. The PLPLOT package was, I gather, designed as a modern replacement for PGPLOT but it's designed for Unix/Linux systems. If you want to install it on Windows you are pretty much on your own, and it requires the prior installation of about five other packages.

I'm not sure what can be done about this, but one can see why easily-installed languages like Python with easy to use graphics have captured a lot of the market.

--
Clive Page

Ron Shepard

unread,

Feb 2, 2018, 9:11:50 PM2/2/18

to

On 2/2/18 11:59 AM, Lynn McGuire wrote:
> On 2/1/2018 2:33 PM, Beliavsky wrote:

[...]

>> Python, 52.6% usage (was 45.8% in 2016), 15% up
>> R language, 52.1% (was 49.0%), 6% up
>> SQL, 34.9% (was 35.5%), 2% down
>> Java, 13.8% (was 16.8%), 18% down
>> Unix shell/awk/gawk, 9.6% (was 10.4%), 7% down
>> C/C++, 6.3%, (was 7.3%), 13% down
>> Perl, 1.7%, (was 2.3%), 27% down
>> Julia, 1.1%, (was 1.1%), no change

Just out of curiosity, where is MS Excel in this list. I would have
expected the number of Excel or Excel look-alike users to outnumber all
of these other languages put together for things like statistics and
data analysis.

And where are Mathematica and Matlab (which you also mentioned)? I know
those are expensive commercial products, but I also know many people who
use them daily.

> A key problem is that Microsoft does not support the Fortran language
> directly.

I think MS is satisfied with their captive audience. I use MS Office, I
write papers in MS Word, and it is a terrible environment for scientific
papers, but I use it nonetheless because everyone else also uses it.
Over the last 20 years or so, they have made barely a token effort to
improve their software for scientific users. I used to send comments to
their online feedback with various suggestions, but they were always
ignored, so I just gave up.

As for scientific programming in general, they have never made any
significant effort to support POSIX, or any of the de facto standard
file systems, or parallel programming libraries, or anything useful.
They just live in their own world, content to send out expensive
software updates that do nothing new that is useful to science.

They should have published libraries for scientific users that allow
them to read and write MS Excel files. They should have done that 30
years ago. But they didn't, and they probably never will.

My general feeling is that MS had done more to impede progress in
various ways in computing in general, and scientific computing in
particular, than they have done to advance it.

So just let them do their thing and the rest of the world will move on
without them, including fortran.

> If gfortran had direct support for embedding itself in Visual Studio,
> that would help immensely.

There is another possible solution to that problem.

$.02 -Ron Shepard

Ron Shepard

unread,

Feb 2, 2018, 9:14:58 PM2/2/18

to

On 2/2/18 4:55 PM, Clive Page wrote:
> The second factor in my opinion is that Fortran and Microsoft Windows
> don't work all that well together.

As I said in a different reply, there is an obvious solution to that
problem.

$.02 -Ron Shepard

Thomas Koenig

unread,

Feb 3, 2018, 7:36:42 AM2/3/18

to

Ron Shepard <nos...@nowhere.org> schrieb:

What would be required for this?

Note I have hardly used Visual Studio. Emacs, make and
gdb constitute my preferred build environment :-)

Ayyy LMAO

unread,

Feb 3, 2018, 1:38:27 PM2/3/18

to

On top of what has already been said, what many of the competitors in the OP's list have, and Fortran doesn't, is an effective interactive REPL mode. For the field in question, many times solutions have to be made in an ad-hoc fashion tailored for very specific projects. Because of this, the ability to write and evaluate chunks of code "on the fly" is very valuable.
Even the latest incarnations of C# and F# (which are compiled) support an interactive mode in Visual Studio.

Nasser M. Abbasi

unread,

Feb 3, 2018, 4:03:23 PM2/3/18

to

On 2/3/2018 12:38 PM, Ayyy LMAO wrote:
> On top of what has already been said, what many of the competitors in the OP's list have, and Fortran doesn't, is an effective interactive REPL mode. For the field in question, many times solutions have to be made in an ad-hoc fashion tailored for very specific projects. Because of this, the ability to write and evaluate chunks of code "on the fly" is very valuable.
> Even the latest incarnations of C# and F# (which are compiled) support an interactive mode in Visual Studio.
>

Java 9 has an interactive environment also. (REPL)

http://www.baeldung.com/java-9-repl
"This article is about jshell, an interactive REPL (Read-Evaluate-Print-Loop)
console that is bundled with the JDK for the upcoming Java 9 release.
For those not familiar with the concept, a REPL allows to interactively
run arbitrary snippets of code and evaluate their results.

A REPL can be useful for things such as quickly checking the
viability of an idea or figuring out e.g. a formatted string for String
or SimpleDateFormat."

Here is one online to try

http://www.javarepl.com/term.html

For data science, a language without REPL/interactive environment
and graphics, is like a car without wheels. It can be a nice fast
car, but not useful without wheels. Even if graphics are added
to Fortran standard, that is not enough. The interactive environment
is still missing which I think is as important for data exploration.

--Nasser

Lynn McGuire

unread,

Feb 4, 2018, 12:42:20 AM2/4/18

to

What is the other solution ? Visual Studio 2015 is one of the finest
development environments that I have ever used. And that includes Turbo
Pascal, Turbo C, Visual C++ 6, and Visual Studio 2005.

Lynn

Lynn McGuire

unread,

Feb 4, 2018, 12:43:50 AM2/4/18

to

But they are not integrated. In my experience, integrated build
environments give another level of ease of use.

Lynn

Lynn McGuire

unread,

Feb 4, 2018, 12:48:00 AM2/4/18

to

Our 800K loc calculation engine of mostly Fortran 77 (maybe 15K loc C++)
requires about 3 to 4 minutes for a complete rebuild. Our 500K loc user
interface of all C++ requires almost an hour for a full rebuild.

Lynn

spectrum

unread,

Feb 4, 2018, 4:07:24 AM2/4/18

to

#(note: my posts below are simply based on what I've read on the net,
so just guessing...)

On Friday, February 2, 2018 at 5:31:38 PM UTC+9, Wolfgang Kilian wrote:
> Fortran does offer efficiency and (most recently) parallel programming.
If somebody comes up with a convincing coarray implementation for some
low-level neural networks, together with a real-world application, it
would get its merits.

I guess that if Fortran could have some potential appeal or impact
for data science, it would be not integration of graphics or REPL into Fortran
(because it is almost impossible to "beat" Python for ease of use),
but rather some backend toolkit that allows scalable, parallel
computation of vector-matrix based algorithms or some
manipulation of large data over clusters and supercomputers.
I guess Coarray + Fortran may have advantage for this purpose
if parallel array handling makes the development easier.
If such backend library provides nice interfaces (bindings) for Python and R etc,
it may be have some appeal (if the performance is good and easy to use!)

Indeed, although Python etc are very convenient for manipulating
data interactively, there seems to be strong need for scaling up
the computation for larger data. I've often seen web/blog articles that say
that "the challenge is to meet both the users productivity and
and high performance/scalability" [*1]. So from the Python side, something like

productivity/ease-of-use (already good)
------>
challenge: scalability/performance (various options available
but each option not entirely satisfactory at the moment)

Among the web pages I've seen, Intel's high performance scripting library
was interesting (which seems to be related to scalable treatment of "data frame"):

Intel high-performance analytics toolkit and dataframes
https://markhkim.com/foundtechnicalities/intel-high-performance-analytics-toolkit-and-dataframes/

The original articles are here:

HPAT: High Performance Analytics with Scripting Ease-of-Use
https://arxiv.org/abs/1611.04934v2

HiFrames: High Performance Data Frames in a Scripting Language
https://arxiv.org/abs/1704.02341v1
(from Abstract) "Data frames in scripting languages are essential abstractions for
processing structured data. However, existing data frame solutions are either not
distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are
not tightly integrated with array computations (e.g., Spark SQL). ..."

HPAT's page (called from Python)
https://github.com/IntelLabs/hpat

[*1] In this sense, it is interesting that many recent statically typed languages
are trying the opposite way (to get the ease of use of Python somehow).
Recent C++ also seems to be struggling to be more and more Python-like (to me)
via various new syntaxes and make the programming easier (at least than old C++ ;)

spectrum

unread,

Feb 4, 2018, 5:10:18 AM2/4/18

to

If the purpose is to learn basic algorithms of machine learning etc, I guess any
languages would be fine in principle. If one is already familiar with modern Fortran,
I believe it should also be very useful (for array handling).
But in practice, I guess people do not want to spend much time to learn
languages themselves (because of time limitation), so prefer simpler languages
to achieve the goal (analyze data and get useful info).
So in that sense, I think it is very reasonable that Matlab/Python has popularity
while Fortran not for that purpose (particularly if the calculation is not too heavy).
In my case, one group member keeps on using Octave even for the computation
that takes days to finish... It seems that Matlab/Octave is very appealing
for him. So the learning cost is another (very) important factor, I guess.

Some related articles...

Want to Make It as a Biologist? Better Learn to Code
https://www.wired.com/2017/03/biologists-teaching-code-survive/

https://www.quora.com/Do-people-still-use-Python/answer/Patrycja-Okowicka?share=8b0fdd61&srid=hjXef

xkcd: python
https://xkcd.com/353/

spectrum

unread,

Feb 4, 2018, 5:52:16 AM2/4/18

to

# Just for clarification of my post above:

I mentioned Matlab/Octave "simpler", and it is not only for syntactic things
but I mean the coding itself is simpler (takes shorter time to code) because
of rich support of libraries (I'm not familiar with Matlab, but I believe it will
support linear algebra, signal processing, and some statistics builtin, plus
toolbox with additional cost if necessary). For example, one can diagonalize
a matrix by eig() (or something similar), but it takes much more coding
to do the same in Fortran. This is one thing that is really missing in
curent Fortran and loses potential "customer" (new users) to the language.

Thomas Koenig

unread,

Feb 4, 2018, 6:00:41 AM2/4/18

to

Have you used emacs extensively, especially its gud and EDE modes?

herrman...@gmail.com

unread,

Feb 4, 2018, 9:45:11 AM2/4/18

to

On Sunday, February 4, 2018 at 2:52:16 AM UTC-8, spectrum wrote:

(snip regarding Matlab)

> For example, one can diagonalize
> a matrix by eig() (or something similar), but it takes much more coding
> to do the same in Fortran. This is one thing that is really missing in
> curent Fortran and loses potential "customer" (new users) to the language.

One claimed advantage of Fortran is compatibility with libraries
written years ago, maybe up to 50 or 60 years ago.

Linear algebra routines, including eigenvalues and eigenvectors,
are definitely popular from years ago. The calling sequence
might not be as convenient, as you need to pass array dimensions
and such. (For the older ones, at least.)

Gary Scott

unread,

Feb 4, 2018, 10:50:34 AM2/4/18

to

:) we used to have an emacs fanatic that built all sorts of analysis
tools using emacs...I had a great time implementing the same
functionality in xedit/kedit and getting 100 fold increase in
performance. What would take his implementation hours to do, mine would
do in seconds.

michael siehl

unread,

Feb 4, 2018, 11:54:47 AM2/4/18

to

| How can Fortran become more popular for Data Science and Machine Learning?

Simple answer: Focus on those things that can be done best with Fortran. And don't aim to (re)do things that are better left to SAS, Stata, R, etc.

Deep Learning through neural networks, Big Data using regression methods, all such do focus on predictive data analysis. The other side of the Data Science-coin is called causal analysis. In statistics both of these are addressed by the same (regression) methods, only the rules for applying them are very different.
Predictive data analysis through statistical regression methods can be a simple but efficient way for making forecasting / predictions. It is easy to make a qualitative comparison between the results from a neural network analysis and those from a regression analysis. To increase the quality of a predictive analysis it is usually good to include as much as possible data into the (regression) analysis, hence the naming Big Data.

Nevertheless, usually we do not only want to know what will happen, given a certain input, but also what are the real causes behind the outcome of a process. The results from simple predictive data analysis can't be used for making any causal inference. Causal analysis is no Big Data: Only a small fraction of the big data (a medium-sized sample) can be used. There are two main sources where the data may come from: 1. from a randomized experiment (experimental data), and 2. from observations (observational data). Experimental data is the highly preferred form of data for doing causal analysis. Traditionally, causal analysis based on observational data can be very poor, due to the problem of omitted variables. Nevertheless, in many situations a randomized experiment may not be applicable. Then, the only (poor) choice might be to use observational data for doing causal analysis.

Since more than a decade, and for certain kinds of processes, the situation has improved for using observational data then: If the outcome of a process is mainly determined by within-object (subject)-variations and not by between-object (subject)-variations, Fixed Effects Regression methods can be applied through the use of traditional (regression) methods of statistics (and thus, just by using existing Fortran (77) library codes), they only require another different use of them.

So the traditional Fortran statistics library routines can be used for this (relatively) newly discovered Fixed Effects methods. But why should we then prefer Fortran libraries over, say, SAS of R? One problem, not only but also, with Fixed Effects methods might be that we don't get important test statistics with them (simple example: an influential cases diagnostics). To be more specific: we don't get the approximations of such test statistics. Then, to still get the required test statistics, we could calculate them by using (more or less) 'exact' values instead of the (missing) approximations.

Sure, this would require much more computation then it would for using approximations. In principal, one could compute an influential cases diagnostics based on such 'exact' values relatively easily and fast running using threading. But what if we want to implement more sophisticated test statistics and other new classes of methods?
I could certainly see the requirement for new methods in parallel software development: Distributed objects implemented through coarrays.

Python is hype, Fortran is cutting-edge.

Recommended reading:
https://en.wikipedia.org/wiki/Machine_learning
https://en.wikipedia.org/wiki/Deep_learning
https://en.wikipedia.org/wiki/Fixed_effects_model

dpb

unread,

Feb 4, 2018, 11:57:57 AM2/4/18

to

Base MATLAB is built around the LINPACK and EISPACK libraries; the
"inventor" is, in fact, one of the leading experts in the field. For an
interesting read on the history/background that is also pretty
informative about just what MATLAB entails as it basis see Cleve's
reminisces:

<https://www.mathworks.com/company/newsletters/articles/the-origins-of-matlab.html>

<https://www.mathworks.com/company/newsletters/articles/the-gatlinburg-and-householder-symposia.html>

The third is much more just the evolution of TMW as the company so may
not have as much interest here, but just for completeness:

https://www.mathworks.com/content/dam/mathworks/tag-team/Objects/t/72887_92020v00Cleve_Growth_MATLAB_MathWorks_Two_Decades_Jan_2006.pdf

If one is into numerical methods; poking around Cleve's writings and
texts is quite enlightening sometimes. I don't follow the blog nearly
as much; used to read his "Cleve's Corner" issues religiously back in
the day when sending out printed stuff was about the only way; I've not
evolved so much myself plus having retired from the actual consulting
gig changed my focus significantly...hard to fathom that's been 20+ yr
ago now.

As for the other comment re: the colleague using Octave and algorithms
taking hours; Octave is still open-source and hasn't benefited from the
continuing evolution in the compute algorithms that Matlab has; it has
advanced some of the syntax further than TMW and is, for the most part,
syntax equivalent but also doesn't have as rich a set of toolboxen.
OTOH, the price point is hard to argue with... :)

But, quite often the problem in Matlab performance is one that the user
doesn't properly make use of the ability to vectorize and get into the
compiled libraries efficiently. Not saying this is definitely your
colleague's problem; some algorithms just can't be vectorized but it is
something seen often.

But, as I've written before, the convenience and bundling makes it very
productive environment for many purposes. But, like everything,
sometimes "when one all one has is a hammer, every problem looks like a
nail".

--

dpb

unread,

Feb 4, 2018, 1:49:27 PM2/4/18

to

On 2/4/2018 10:57 AM, dpb wrote:
...
...[big snip for brevity--dpb]...

> As for the other comment re: the colleague using Octave and algorithms
> taking hours; Octave is still open-source and hasn't benefited from the
> continuing evolution in the compute algorithms that Matlab has; it has
> advanced some of the syntax further than TMW and is, for the most part,
> syntax equivalent but also doesn't have as rich a set of toolboxen.
> OTOH, the price point is hard to argue with... :)
>
> But, quite often the problem in Matlab performance is one that the user
> doesn't properly make use of the ability to vectorize and get into the
> compiled libraries efficiently. Not saying this is definitely your
> colleague's problem; some algorithms just can't be vectorized but it is
> something seen often.

...

Although I should not that newer releases of Matlab do suffer
performance loss relative to earlier versions from "feature bloat" and,
more significantly (I think altho I don't have inside access to be able
to prove it) from the migration from procedural interfaces to
object-oriented methods/properties.

TMW also introduced a completely new graphics engine (termed HG2 for
"Handle Graphics 2" vis a vis "HG" or "HG1" kinda' like new and old Coke
for a while) a few releases (years) back; it definitely shows
performance degradation although TMW is making improvements it's still a
bottleneck for large datasets.

Also, newer data classes such as the |TABLE| are, while quite convenient
general high-level collections with many useful features, clearly quite
a lot slower than base DOUBLEs.

I don't know where Octave might be in its development/support cycle
along similar lines to keep pace with newer MATLAB features; there have
been a few things Octave has implemented first that MATLAB has now
picked up--one being a script can now contain internal functions which
is handy for throwaway code or quick feasibility checks on ideas before
full-fledged development.

I'll note that one thing that is feasible with MATLAB (and I presume
Octave) for performance is if one can illustrate through profiling a
given code section as a bottleneck and can isolate that functionality
sufficiently, one can rewrite it into a "MEX" file in Fortran and/or
C/C++ and have both worlds of the interactive front end/graphics and the
compiled "engine" in the back room doing the heavy lifting. (There I
got Fortran back into the topic!!! :) )

--

David Duffy

unread,

Feb 4, 2018, 7:00:53 PM2/4/18

to

Beliavsky <beli...@aol.com> wrote:
> Fortran has been be used for statistical analysis ... "Data Science"

> is partly just a cool new name for statistics, but beyond statistics
> is encompasses the acquisition and cleaning of data and the use of
> machine learning algorithms

R still contains a lot of Fortran - a post describing R 2.13.1

https://www.r-bloggers.com/how-much-of-r-is-written-in-r/

has the core code

% code in R: 22.259705
% code in C: 51.626379
% code in Fortran: 26.113916

There are a few thousand R packages that add on functionality, and Fortran is
well represented there as well. A certain amount of the C code is
translations from the original Fortran. It is very easy to call Fortran code
from interpreted R.

Much statistical analysis is done in an iterative interactive fashion,
one liners with a look at graphical and other outputs. Compilation is
for the highly optimized engines underlying the one liners, and for
repetitive stuff. Re graphics, we want them now, not after recompiling
(pace JIT etc), and want to play with them on the fly.

Real "Big Data", ie too big to reside in memory, AIUI use older style
work arounds - incremental analysis of running summary statistics (eg the
weights in a neural network model), randomized linear algebra (there is
Fortran eg https://arxiv.org/pdf/1608.02148.pdf) and state of the art
parallelization strategies.

Cheers, David Duffy.

Lynn McGuire

unread,

Feb 5, 2018, 1:53:56 AM2/5/18

to

On 2/4/2018 5:00 AM, Thomas Koenig wrote:

No.

Lynn

Thomas Koenig

unread,

Feb 6, 2018, 1:53:28 AM2/6/18

to

Lynn McGuire <lynnmc...@gmail.com> schrieb:

>>>> Note I have hardly used Visual Studio. Emacs, make and
>>>> gdb constitute my preferred build environment :-)
>>
>>> But they are not integrated. In my experience, integrated build
>>> environments give another level of ease of use.
>>
>> Have you used emacs extensively, especially its gud and EDE modes?
>
> No.

Their integration is actually pretty good. You have to overcome
the initial hurdle of using Emacs, though, which can be high.

Stefano Zaghi

unread,

Feb 6, 2018, 3:08:05 AM2/6/18

to

Vim rocks :-)

herrman...@gmail.com

unread,

Feb 6, 2018, 9:51:53 PM2/6/18

to

On Sunday, February 4, 2018 at 4:00:53 PM UTC-8, David Duffy wrote:

(snip)

> % code in R: 22.259705
> % code in C: 51.626379
> % code in Fortran: 26.113916

Your TA would take points of for
excessive significant digits.

Mohammad

unread,

Feb 7, 2018, 11:10:22 PM2/7/18

to

There are great IDE free around!

Code::Blocks with Fortran plugin (C::B) and Photran (Eclipse) are great integrated development environment (IDE). I specially recommend C::B for ease of use and the minimal space required and a buch of options for customization!

Lynn McGuire

unread,

Feb 9, 2018, 11:56:54 PM2/9/18

to

OK, I will take a look at it.

Thanks,
Lynn

Beliavsky

unread,

Apr 17, 2018, 9:22:33 AM4/17/18

to

Here is an article about machine learning applied to astronomy and astrophysics.

https://www.nextplatform.com/2018/04/16/gpus-mine-astronomical-datasets-for-golden-insight-nuggets/
GPUs Mine Astronomical Datasets For Golden Insight Nuggets
April 16, 2018 James Cuff

...

Critical to all of this research were GPU accelerators – specifically the Tesla P100s used in the DGX-1 server from Nvidia – which enabled accelerated training of neural networks. They used the Wolfram Language neural network functionality, built a top of the open-source MXNet framework, that in turn uses the cuDNN library for accelerating the training on Nvidia GPUs. ADAM was deployed as the underlying learning algorithm. The significant horsepower of the Blue Waters system, which is also GPU accelerated, was brought to bear for their modeling data and for solving Einstein’s equations via simulation. The group are also looking into generative models GANs (generative adversarial networks) to further reduce the multi-week time taken (even for Blue Waters) for these specific steps.