Go for Data Science

408 views
Skip to first unread message

Slonik Az

unread,
Jul 16, 2019, 1:18:21 PM7/16/19
to golang-nuts
Hi Gophers!
I was thinking to start a Go project in the area of Data Science that would allow for convenient and easy concurrent data processing but at the end decided against it mainly because of two reasons:

(1) Almost all data science projects start with a data exploratory analysis of some sort. Unfortunately, Go does not have REPL. Go Playground is not a substitute, for it does not preserve state. On every iteration Playground recompiles and relaunches the entire program, reads all the data anew, performs all the calculations. Not good for an interactive "rapid fire".
REPL in a static AOT compiled language is hard, yet Swift somehow managed to implement it.

(2) Even if somebody implements incremental Go compiler and provides a proper REPL, people will be longing for data analysis "at your fingertips", missing rich pandas-like API, overloaded operators (python style) and dynamical scoping (like in R). Minimalistic design of Go is unlikely to accommodate all of these "convenience" constructs and for a good reason.

I think Go has a place in highly performant concurrent data pipelines and transformations but I am less optimistic it would ever play in the field dominated currently by Python and R and possibly by Julia in the future. I am curious of what am I missing in this line of thinking?

Thanks,
--Leo

Michael Jones

unread,
Jul 16, 2019, 3:31:12 PM7/16/19
to Slonik Az, golang-nuts
Leo,

R is implemented in C and FORTRAN plus R on top of that. SAS is in C (and some Go here and there) plus the SAS language in top of that. Mathematica is implemented in C/C++ with the "Wolfram Language" on top of that. PARI/GP is implemented in C plus some GP-language code. Macsyma, Maple, Octave, Python,... follow this pattern too:

3 [glue-like meta-tools that combine various "full stack" tools]: Sage
  :
2 [interactive exploration environment with scripting]: many and various, including R, SAS, MMA, GP, Macsyma, Axciom, Maple, Python, ...
  :
1 [performant heavy duty computation in compiled language]: C/C++
  :
0 [ultra-performant kernels in C/Assembler/..]: GMP, LAPACK, BLAS, ATLAS, ...

You say Data Science is an application domain where Level 2 features make sense, where they facilitate understanding by providing an interactive environment. The evidence supports you, though understand that none of your examples (or in my expanded set) actually do much at that level: this is where the "convolve a with b" is specified, but the actual doing is lower, in Level 0 and 1, where Go-like compiled software in C, C++, or FORTRAN does the heavy lifting. (I make this point only to clarify what some people seem not to understand in blogs where they write "my Python giant matrix solver is just as fast as C/C++/Go, I don't see why C/C++/Go is not faster" or "I don't see advantage in compiled languages.")

If Go has a place in interactive, interpretive data science it seems to me that it would be as the substrate language (Levels 0 and 1). Go certainly has a place in statistics, applied mathematics, and other realms related to data science if you want to include apps that do work and act on results--control systems, analysis tools, etc. But to create an interactive "play" space I'd (again, just me) be inclined to follow the PARI/GP model with a Go kind of PARI and a domain-friendly GP. 

The high-level GP (Mathematica, Maple, GP, SAS, ...) in the existing systems often seems to me to be weak, not designed as a first-class programming language but more like an endless accretion of script enabling fixes and patches. I feel this especially in the way local variables are defined which often feels brutish and awkward, but that extends to many subtleties. It is natural that it tends this way--developers were focused on the core and just needed "a way" to bind it all together. The successful projects span decades and unanticipated new application domains so have accumulated the most duct tape.

Another goodness of this two-level scheme is that the top language can be "faulty" in ways that are comfortable. For example, think how many scalar variables you see in C/C++/FORTRAN/Go: "i:= 3" is the bulk of variables. But in R, there are (at least when I last looked) no scalar variables(!), but you can get by with vectors of length 1. This would not do, generally, but for R, it may be perfect. The two-level strata design of which PARI/GP is one of the best implementations, makes this kind of field-of-use tailoring work fine in practice. That's important, it is matching the language's exposed concepts to the problem domain.

I don't see any of this as a weakness or strength of Go, or as something to address in the case of a REPL, because it's not Go that you'd want a REPL for, instead something that knows about data, or Diophantine equations, or moon rocks, or whatever the domain may be and its natural forms of notation.

Michael

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/6009a15f-d944-449e-8bd7-e167b5e7d84d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Michael T. Jones
michae...@gmail.com

Anca Emanuel

unread,
Jul 16, 2019, 4:07:29 PM7/16/19
to golang-nuts
Use the right tool for the job. https://julialang.org/ 
or ask on gonum-dev

Jesper Louis Andersen

unread,
Jul 16, 2019, 4:45:39 PM7/16/19
to Slonik Az, golang-nuts
On Tue, Jul 16, 2019 at 7:18 PM Slonik Az <slon...@gmail.com> wrote:
REPL in a static AOT compiled language is hard, yet Swift somehow managed to implement it.


I must disagree. The technique is somewhat well known and has a long history. See e.g., various common lisp, and standard ml implementations. If you are willing to accept a hybrid of a byte-code interpreter with a native code compiler at your disposal, then ocaml and haskell will suffice in addition. When a function is defined in the REPL you just call the compiler, and it emits assembly language. You then mark this region as executable in the memory, and you just jump to this when the function is invoked. In some cases, a dispatch table is used so a function can be replaced post-facto. It has fallen somewhat out of favor for the hybrid approaches. Probably because modern computers are fast enough when you are exploring.

In my experience, most data science is about processing of data, so it is suitable for doing science. Exploratory tools are good for understanding the model you are working in. However, real world data processing can require you to work on several terabytes of data (or more!). There is a threshold where it starts becoming beneficial to optimize the processing pipeline, especially the pre-processing parts. And lower level languages, such as Go, tend to fare really well here. These lower level tools can then be hooked into e.g., R and Python, empowering the exploratory part of the system.

Another important point is that modern computational kernels, for instance TensorFlow, are really compilers from a data-flow graph representation to highly optimized numerical routines. Some of which executes on specialized numerical hardware (8-32bit floating point SIMD hardware). You can define such a graph in Python, but then export it and use it in other systems and pipelines. As such Python, your exploratory vehicle, provides a plug-in for a more low-level processing pipeline. This also allows part of the graph to run inside a mobile client. The plug in model is also followed by parallel array processing language, see e.g., Futhark (https://futhark-lang.org/). You embed your kernel in another system. If you read Michael Jones post, there are important similarities.

-- 
J.

Leo R

unread,
Jul 16, 2019, 5:23:15 PM7/16/19
to golang-nuts
Hi Michael,
thanks for your reply. The current problem with Data Science ecosystem (from Data Analysis all way to GPU based ML) is that it employs a whole stack of languages from low-level like C (and sometimes assembler) all way to scripting like Python or R. In parallel, there are Big Data tools, Spark/Scala being most popular that can process massive data sets but only provided a computation nicely fits its computation model (Map-Reduce and friends). Working with pandas or scipy or R does not feel like programming any longer:-) You are calling into massive APIs written by other people in other languages. There is realization in Deep Learning community that Python does not quite cut it. A proverbial saying is that "the worst thing about Pytorch is Python". Thus, attempts to create monolingual stacks like Julia, or more recently Swift TensorFlow - not quite monolingual - but with an ambition to gradually eat into the territory of  C++ based TF kernel. The same goes for Scala/Spark -- JVM with its high memory pressure is not the best choice for near-bare-metal calculations (a possible opening for Go).

I am curious whether Go will toss its hat in the ring or will leave the field to other players.

--Leo

To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Leo R

unread,
Jul 16, 2019, 5:46:48 PM7/16/19
to golang-nuts
My point is that contemporary Data Science stack is using too many different languages all way from scripting (R, Python) to statically compiled C/C++ and sometimes Fortran (R, some scipy algos are in Fortran) and even JVM based Scala. This creates artificial barriers -- data scientists play the Python/R game but struggle with Scala, software engineers write pipelines in Spark/Scala but have no interest in R. Often deploying to production requires recoding from one language to another. I hope as the field matures there would be more consolidation and unification across the language zoo. Language barriers in scientifically heavy fields are not healthy. In Statistics, Python's stats.models is a pale shadow of R's CRAN. Science community is split along the language lines that spreads already thin resources even further.
--Leo

Dan Kortschak

unread,
Jul 16, 2019, 6:29:50 PM7/16/19
to golang-nuts
We'd (gonum-dev) likely advise not to use julia for reasons that I
won't go into here.

However, I can suggest that the OP checks out the data-science channel
on https://gophers.slack.com/

Also note that gorgonia does data-flow graph compilation described
Jesper, and there are REPLs that are available for this kind of work in
Go.

On Tue, 2019-07-16 at 13:07 -0700, Anca Emanuel wrote:
> Use the right tool for the job. https://julialang.org/
> <https://julialang.org/blog/> or ask on gonum-dev

Leo R

unread,
Jul 16, 2019, 7:06:25 PM7/16/19
to golang-nuts
Regarding REPL in Go, it is complicated. Currently, lgo seems to be broken as of go-1.12 (and go-1.13), see README.md in their repo https://github.com/yunabe/lgo. Until there is an official REPL blessed by the Go core team and included as part of tools, a random unexpected breakage of REPL is a sad possibility. 

gonum is a  very interesting project which plays in the same space as numpy. But is there anything that can replace pandas in the Go-universe?

--Leo

Dan Kortschak

unread,
Jul 16, 2019, 7:55:38 PM7/16/19
to Leo R, golang-nuts
There is a project that is intended to implement pandas-like data
manip: https://github.com/ptiger10/pd

Jason E. Aten

unread,
Jul 21, 2019, 1:12:06 PM7/21/19
to golang-nuts
Hello Leo,

There is a quite capable Go REPL available; it is called GoMacro. It is actually fairly mature. Massimiliano Ghilardi has
done a great job with it. There is even a Jupyter kernel for it.


However, using Go directly lacks many niceties that I miss when doing data analysis, the least of which is named parameters. You can extend python and R with Go as you wish. Here one of my projects from some time ago that demonstrates how to write R extensions in Go.  Romain Fancois was also doing a parallel effort at one point; you might search for it if you are interested.


Extensions for python are also doable, but are a little tricky (possible, but lots of extra makefile wrangling) if you need portability to e.g. Windows.

-J

Jason E. Aten

unread,
Jul 21, 2019, 8:11:49 PM7/21/19
to golang-nuts
At the risk of mentioning my own work, there are various other extension languages are useful for doing data sciencey things with Go:

https://github.com/gijit/gi is an interactive repl for Go that is JIT compiled using LuaJIT (like GoMacro, there is no state reload on each line, and it runs only about 3x slower than compiled Go on average). Works on windows as well as OSX and linux. I'm not maintaining this any further, but you could take it in many directions.

A lisp in Go that has can easily integrate compiled Go routines.

Very promising, Alan Donovan's mini-python, written in Go:

Bindings of the very mature (and fast, incremental compiler) Chez Scheme to Go:

Jon Conradt

unread,
Jul 24, 2019, 4:13:02 PM7/24/19
to golang-nuts
There is also https://github.com/containous/yaegi described as:

Yaegi is Another Elegant Go Interpreter. It powers executable Go scripts and plugins, in embedded interpreters or interactive shells, on top of the Go runtime.


Features

  • Complete support of Go specification
  • In pure Go, using only standard library
  • Simple interpreter API: New()Eval()Use()
  • works everywhere Go works
  • All Go & runtime resources accessible from script (with control)
  • Security: unsafe and syscall packages not used or exported by default
  • Support Go 1.11 and Go 1.12 (the latest 2 major releases)
Reply all
Reply to author
Forward
0 new messages