(py)torch-like autograd

350 views

Skip to first unread message

ml...@topiq.es

unread,

Oct 12, 2017, 10:53:50 AM10/12/17

to clojure-cortex

Hi,

I am writing my master thesis in Bayesian Neural Networks and Inference (including GANs) and work with pytorch every day. I would like to use Clojure for my research, but when I wrote boltzmann (1) a few years ago during my Bachelor thesis, python+theano was way faster on the CPU alone than clatrix (2x-5x) and no reasonable GPU support was available. We have come a bit further along in the meantime in Clojure land and cortex is a nice library. Unfortunately it is not really interesting for me since it models high-level neural network abstractions instead of building them on top of a simple autograd-library like pytorch does. Also almost the whole research group (4) I am in has moved away from AOT-compiled theano and tensorflow to pytorch because autograd along the control-flow of the language is so much more flexible and the performance penalties are mostly neglegible. So I have gone ahead, looked at the reverse autograd literature for scheme and ported the pytorch concept to Clojure over the weekend (2). I have also over the last few days reactivated Mike Anderson's core.matrix backend for neanderthal (3) since I want and need to go on the GPU. I think neanderthal is fairly unique and can provide an edge with custom OpenCL kernels over things in ways that are impossible in high-level environments like Python atm. Clojure is also very well positioned for industry-focused deployment of machine learning and I understand that cortex focuses on this use case. I would like to know whether you can imagine to collaborate and whether cortex could maybe be implemented on top of a general autograd library, so we avoid friction? I really would like to show that I can do really cool research and maths with Clojure instead of reimplementing the whole stack. I hope we can push Clojure further as a good language for state-of-the-art performance.

Best,
Christian

(1) https://github.com/whilo/boltzmann
(2) https://github.com/whilo/clj-autograd
(3) https://github.com/whilo/denisovan
(4) https://hciweb.iwr.uni-heidelberg.de/publications-ial

Chris Nuernberger

unread,

Oct 13, 2017, 12:41:28 PM10/13/17

to clojure-cortex

We actually completely agree that building out the auto-diff system for clojure is a great first step. I have been pushing ThinkTopic to make the time to start down this path so we are in 100% alignment there.

Have you checked out the tensor backend that cortex uses already?

This requires a bit more setup but it exposes a few concepts that are necessary in order to optimize these operations enough for us to train larger models:

1. Those backends do not expose some basic operations on their base data layers - so for example I want to be able to offset a buffer and get back a new buffer and furthermore I want to know which device a buffer is one and have a transfer mechanism from device to device.

2. A separation of the data storage mechanism from the description of what is going on. So note that the cortex tensor system uses descriptions and these descriptions are where things like transpose, reshape, and select happen. This means that all the cortex tensor backends, as long as they interpret the description in a standard way do not have to rebuild the code required to do all of this. So for example cortex has numpy style generic transpose and it worked exactly correctly across the cpu and the gpu implementations because they interpret the shape/stride combination in a uniform way.

3. Concepts associated with devices and streams so that doing multi-gpu work is easier and schedule work across several gpu streams is easier - these are concepts that are in fact implementable across openCL as well and as such would provide a unified mechanism to work with multiple gpus (or web workers or webgl contexts) that would mean several things do not reimplement the same code.

A concrete example of using almost all of these concepts at once was this PR:

https://github.com/thinktopic/cortex/pull/218

In order to run ResNet50 on small GPU's we have to reuse buffers across layers that aren't necessary so the traversal mechanism implements part of the solution and the binding does another part but basically I reuse the backing storage by assoc-ing different descriptions to it thus exponentially reducing the memory usage because completely unrelated layers can reuse the IO buffers at different times. Cortex also interleaves uploading data to the GPU so that we achieve absolute 100% overlap between computation and upload which is pretty tough for most apps.

Another piece of the above abstraction hierarchy is the ability to get generic data into cortex as fast as possible. A lot of NN systems are bottleneck by their ability to take raw data and get it to the GPU; cortex is due to our datatype library which doesn't know anything about tensors and just has fast paths for taking potentially non-contiguous data and copying it as fast as possible into contiugous buffers; this is important for a lot of reasons and obviously benefits from the separation of data from description that is inherent in the cortex tensor design but completely complected in the core.matrix backend designs.

Support for multiple datatypes...While this is possible with the various backends most of the time this is implemented by copying a class hierarch; note the tensors use macros or c++ templates to do this giving us *all* datatypes we want in very dense code. This has already helped in that for instance the random number generation on CUDA does not work for all engines in double arithmetic and has extremely serious performance concerns. Marshalling assignments are part of the cortex tensor specification, part of numby, but completely missing from the core.matrix design.

Lastly, the ability to make complex operations directly to the cudnn subsystems is very important and allows us to train models in 1 day that would take a week otherwise. Note that cudnn has primitives for entire RNN sections, the completely forward and backward for convolution that are *extremely* optimized (hand coded assembly in some cases) and so are a massive gain *if* you can map your mathematical descriptions to these operations. Note also the they aren't simple; they require in some cases a setup step and data stored specific to that operation which is certainly not part of the pure math description of what is going on.

Aside from all that (and all that could change with time), I think that arguing about the backends is a waste of time for a few reasons: First, we have very good reasons for not using arbitrary backends in cortex. Second, for auto-diff, I don't think a discussion about a backend is very important.

Have you seen this:

https://aws.amazon.com/blogs/ai/introducing-nnvm-compiler-a-new-open-end-to-end-compiler-for-ai-frameworks/

I had envisioned the auto-diff system being a pure, datastructure->datastructure translation mechanism separate from binding to any execution context. In this way it can map to cortex tensors, potentially to the nnvm, tensorflow operations, etc. It is also easier to document and to test if it is purely a ds->ds system and not bound to any specific backend.

So I think *part* of clj-autograd is the perfect direction. But you aren't going to get the performance that is competitive by binding it to neanderthal; most likely only binding it to nnvm, cortex tensors, or tensorflow is going to get you there. What about a separation of the actual auto-diff algorithms from the execution context that is responsible for allocating the data and scheduling execution (potentially across machines or GPUS)?

I would love to help you with this but to get there I need to make sure that some of the harder lessons learned in cortex aren't repeated.

So, to recap:

1. Agreed 100% on auto-diff. People here (at TT) are sick of me preaching about it.

2. Disagree on choice of backends and the implicit binding of the actual auto-diff algorithm to the execution context. If you want great openCL support, use nnvm, not neanderthal.

3. Please take a look at the cortex tensors; we had a reason for building them and it wasn't just random decisions. We have multiple gpu support, scheduling, great performance competitive with any NN system out there for the use cases we support. That being said we do not have openCL support but I think point 2 covers that.

4. See 2 again. My opinion is the most important thing is separation of the auto-diff algorithm (which as I said should be a pure, ds->ds translation) from any binding to any execution context. Note that this is precisely the model that cortex *does* in fact follow in that we have a graph that is mapped into an execution context by the traversal system allowing in part the optimization I noted above.

ml...@topiq.es

unread,

Oct 15, 2017, 9:05:34 PM10/15/17

to clojure-cortex

Hi Chris.

Thanks for the detailed write-up. I think it would help to pin down the requirements of a scientific optimization toolbox and what kinds of problems we are trying to solve. I think the problem we are tackling, numeric optimization, is actually fairly complex, so writing down the requirements and the groups of problems might help. I couldn't find the actual requirements in the cortex documentation (design.md), so maybe we can extend it along the way.

I will try to go from generic (mathematical) to concrete (Clojure primitives). I will also try to address your points along the way.

Requirements

A. Numeric optimization primitives

While the topic of numeric optimization is fairly large, we can be somewhat concrete by looking what people do programmatically in the field. In general we could say that Julia is a very good proxy of the kinds of programs that people in optimization are writing. If we try to break optimization apart, we want to have optimization of concrete and continuous functions. The latter is relevant for us here and usually involves a form of approximative gradient descent along the linear approximation of the objective function (tangential spaces). This immediately means we need to have strong numerical routines for linear algebra to do the computations in the tangential spaces. It is not sufficient though. Often we can derive analytical expressions and cut corners in implementing algorithms, so we also need to be able to write custom numerical routines. As a sidenote: Ideally we also want to have good numeric integrators as a lot of physical/chemical/financial problems involve integration of differential equations.
All these primitives need to be as fast as possible and numerically stable, as they will be the building blocks of our toolbox. We do not have a free choice here, as the hardware (vendors) determine the abstractions they provide to us. In general this is BLAS, cuDNN and OpenCL or Cuda. Neanderthal tries to provide bindings to these interfaces for us, so we can call them from Clojure. cortex does it as well, but in a less exposed way. neanderthal does not provide any convenience API (like core.matrix) and it does not hide performance critical nobs, most importantly memory layout and copying. nnvm completely outsources these problems and is for me on a completely different, much more declarative level (see C. below). Relying on a declarative DSL only is not how Clojure programmers usually address performance issues. Instead of going completely outside the JVM it is prefered to expose low-level primitives to achieve Java-like (or whatever the problem is) performance by gradually opting out of the declarative or functional programming style inside the execution context of Clojure. Neanderthal's focus on making the low-level primitives compatible with OpenCL and allowing to load custom kernels from Clojure inside its control-flow (dynamism) is crucial to make this happen in my opinion. Otherwise you just have some special purpose AOT compiled pipeline that will not be good for researchers, Clojure hackers or optimization people in general.

B. Composing functions - Autograd

Now that we have basic primitives that we can use, we would like to define continuous functions to optimize. Forming derivatives is fairly mechanical and often not a lot can be gained by doing it manually (except for bugs ;) ), so we would like to have an autograd mechanism. We do not care how it exactly works, but it should make efficient use of A. and it should not get in our way, because we are focusing on the function semantics. pytorch very much gets that right and it is still able to train resnets and densenets efficiently (my colleage is doing it all day), so I don't think we should dismiss having this intermediary level as being inefficient. pytorch also directly computes all values in the control flow of the host language (Python), so values are not calculated lazily and can be used to determine control-flow of the host language! This is important for all kind of recursive architectures, e.g. recurrent/recursive/graph neural networks, gradients over decision trees or differentiable MCMC samplers. clj-autograd is currently written along these lines, although it is still naive and does not make intelligent re-use of memory or smart broadcasting. In comparison to pytorch, computation does not have to be eager and can be lazy, as long as it is always possible to yield the value of the forward pass, e.g. through calling a protocol on the compute graph up to the current node. The graph can be (and almost is for clj-autograd, except for the backward-fn) completely data and leave translation into lower-level primitives of A up to the execution context as long as it is not invoked.
As a sidenote: Julia probably also has many examples of higher-level abstractions that still make efficient use of the underlying hardware. For the following Autograd is central though. Note that again, Autograd, like in pytorch, should allow to opt out and implement derivatives more efficiently with the help of A., e.g. for units in ResNets where you can make buffer reusage, which you do as far as I understand. It would make sense to have a forum where these things are discussed with knowledgeable people from the field. It can still be declarative in doing so (read it describes the operations in form of Clojure data).

C. A high-level declarative graph API

This is how I perceive cortex atm., although it has its tensor-pipeline internally. The goal of the high-level API, as the one for pytorch, Caffe, keras etc. (inputs to nnvm), is to describe only what the model (deep neural network usually) looks like as minimally as possible and leave the rest to any implementation. It should be easy to use for people who do not care about optimization or control over gradient descent (the execution context), but are educated enough to do some model selection, e.g. to select a standard pre-trained model for computer vision or to specify the number of layers. This API is perfect for any automatic optimizer. In fact what we are doing is writing a kind of compiler down from this declarative API towards the execution context, that you are modelling atm., and the primitives of our environment. If we can AOT compile our optimizer (because we know the compute graph ahead of time) then we can completely go out of our runtime transparently and emit e.g. native code and compile it, like theano or tensorflow do. This would also allow to target nnvm or any platform that targets it, like Caffe or mxnet themselves. But if we do this by default, we lose all the flexibility needed for experienced optimization people to build their own optimizers.

This last point is what is important for me right now. I understand that your main focus is performance and ease-of-use, but I think we could lay out how to build a general pipeline/toolbox, especially since Clojure allows to do describe interfaces as plain data. I think cortex tensors fit in there, but I am not sure whether they belong to A, B and/or C. On a quick dive through the codebase it looked like the high-level neural network semantics (like activation functions) leak all through to the dispatch in the execution context. For me neanderthal or your low-level JavaCPP bindings could be a potential building block for your execution pipeline, but I think the buffer management is independent of the wrappers around the native libs and one should be very thankful for every consistent effort to bring high-perf low-level bindings to Clojure. Or do you think Clojure should give up the control flow of the operations completely in AOT fashion? For me pytorch shows that this is not necessary to achieve good performance and is hated by researchers.

Best,
Christian

Ben Kamphaus

unread,

Oct 15, 2017, 9:46:45 PM10/15/17

to ml...@topiq.es, clojure-cortex

Christian,

This is a great response and there's a lot of good info to mine here, so I'm sure I'll be re-reading it. I have two main follow on questions, but first some context. I'm not associated with ThinkTopic or Cortex any more but I am very interested in solving this problem. I think you're correct that there's not a lot of design-first work in Cortex and that the requirements it's trying to address are not clearly enumerated at present. That aside, I'd rather not dive into the minutia there but peel apart one of the assumptions underlying your post.

It seems like you're presenting an either/or framework design which assumes the approach will be derivative of other frameworks, and I lean strongly in the TensorFlow rather than the Torch/PyTorch direction on the either/or. That is, I'm of a similar mind to Kovas and his post here on TensorFlow's design tradeoffs: https://hackernoon.com/ml-framework-design-reliability-composition-flexibility-9314f72d2c73 -- I know that researchers are not big fans of TF's tradeoffs, but researchers in general are not trying to solve the same problems that software engineers are when they wire machine learning into production systems, or stick it in embedded devices, etc. and Clojure is to me a language heavily tilted that way as well - e.g., shipping working software > algebraic data types.

So question (1) is, re: "if we do this by default, we lose all the flexibility needed for experienced optimization people to build their own optimizers." -- is there some better middle ground? Is it possible we could be better at shipping models than PyTorch and better at allowing custom optimization, etc. than TensorFlow, with Cortex or whatever other deep learning solution turns out to be the best for Clojure?

And question (2) is, what exactly is the appeal for you in doing this in Clojure? Just the language design? Some lispy aspect of generating code, etc.? I'm about at the point personally where I don't think a small group of clever and conscientious hackers in Clojure is going to keep up with the momentum in the Python ecosystem and the 100x+ development effort there (or the expanded ecosystem which takes in the various C/C++, Lua, maybe Julia work as well), and if Cortex or w/e is just supposed to be like TensorFlow or like Torch, or even TensorFlow bundled with Keras I'd much rather put that effort personally (and make the case for) getting a viable Clojure on CPython, like we have for JavaScript today, and which gives front end Clojure devs reach to React, Electron, etc. and would give ML devs access to everything.

The ideal case for Clojure (at least the present JVM instance of that language) would be that you had a clear answer to (2) that you could hoist into a better solution for (1) than just following TF or PyT/Torch. The case you make for how to approach the problem, i.e. discarding the Clojure abstractions that get in the way of perf, does not really sound like a compelling case to go to all the trouble to build the framework in Clojure in the first place. If it's just going to boil down writing Clojure like its C/C++, well I can get things done in both TensorFlow, PyTorch, and hell throw Caffe2 and MXNet in the mix and solve interesting problems rather than dig down the YAF rabbit hole.

Best,

-B

--
You received this message because you are subscribed to the Google Groups "clojure-cortex" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure-corte...@googlegroups.com.
To post to this group, send email to clojure...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clojure-cortex/6bca2423-74cd-4a15-b50b-cf1456c4b2a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Nuernberger

unread,

Oct 16, 2017, 1:35:19 PM10/16/17

to clojure-cortex

Christian, Ben -

This is a good discussion so far so let's keep it going:

1. First, Christian, neanderthal is written basically as a datatype-specific core.matrix backend. It does not allow me to, for instance, create a generic buffer of data and put a banded matrix on it at one point and a dense matrix on it at another point. It is also written with some significant java bindings similar to how the existing core.matrix backends are implemented. With the tensors I can create a floating point buffer and associated a description to it in an ad-hoc manner which is a far more clojure-esque way of doing computation than creating a strided matrix java class and then implemented a set of interfaces. Furthermore with the cortex tensors it is easy to schedule computation in various execution environments. So neanderthal gives me as the engine writer far fewer options to do clever things. Please note how the cortex tensors are built the distinction between the heavy data buffers which have no definition at all (aside from device, datatype, and size) and a description mechanism that tells the system how to interpret those buffers. Here is the completely implementation of transpose in cortex:

https://github.com/thinktopic/cortex/blob/master/src/cortex/tensor/dimensions.clj#L399

This re-works the strides and the shape and productions a new tensor which shares the heavy buffer with the original but has a different description. Note that I can take the same buffer and assoc different unrelated descriptions to it and that is just a clojure assoc operation; it isn't creating a subclass of the engine of the implementation of the this or the that. Not also that it is a completely datatype independent operation, meaning I can operation on floats, or bytes or shorts and this doesn't change.

Here is the definition of transpose in neanderthal:

https://github.com/uncomplicate/neanderthal/blob/master/src/clojure/uncomplicate/neanderthal/core.clj#L740

This delegates to a backend-specific implementation which means you have to go engine by engine to get to what is happening because the types of the underlying system actually have to change; it is a far less clojure-esque way of doing things. This is less power in the user's hands, not more.

As far as the dynamic kernel loading goes, both systems load kernels dynamically as you have to for cuda, openCL. I don't understand where you are going here and I stated that the cortex tensors are in fact designed to be compatible with openCL. Whether I personally have the time to implement this or not is another question.

Basically, we have designed the tensors for cortex to enable everything that we need and nothing else has those features in the clojure ecosystem. They implement everything required for a what our NN engine supports in far *far* fewer lines of code (and interfaces and layers) than any other implementation and beside that they can operate on bytes, shorts, integers, etc. So let's at this point pause the conversation on using neanderthal or cortex tensors as I do not believe it is actually that important of a discussion at this point nor do I see it resolving.

The benefit of investing in nnvm is that you have offloaded a large amount of very difficult work, fundamentally. You also have a simpler way to run the inference models of other systems in clojure, open the door of running models trained in clojure in contexts that potentially do not support the JVM in the first place. NNVM is still lower level than tensorflow so there is that aspect also. It is basically a low-level common ground where we can reuse work efficiently done in other ecosystems. Having written many importers and imported many models into clojure for cortex I find this appealing as I find the possibility of running complex models in embedded environments from a language other than clojure.

The entire above discussion I feel is actually is independent of your points A,B,C.

A - Choose a good set of primitives. These seem to me to be implementation independent; nothing to do with neanderthal, nnvm, TF, or cortex save for the fact that I would probably carefully consider the TF or caffe2 primitives as a reference set.

B - Build a transformation system that transforms a graph of primitives from A -> a graph of primitives A' potentially encoding runtime information (which branch of a control flow statement or something like that) discovered during A.

C - A high level framework that allows people to be ignorant of A,B for most use cases.

I think what you are arguing is that you would like to be able to add to the set of primitives in an ad-hoc fashion. You are concerned that relying on the primitives exposed by nnvm will artificially limit your ability to add to A,B in ad-hoc manner and I can certainly sympathize with this.

I am not as concerned with this as I am with the more practical abilities to train/infer efficiently and reuse existing models along with export new models for other systems.

Is this an accurate description of where the conversation lies, excluding the arguments around which exact implementation to use to write a backend the the primitives in A in?

ml...@topiq.es

unread,

Oct 18, 2017, 5:58:30 PM10/18/17

to clojure-cortex

I will try to address your assumptions first, before making further proposals and suggestions. I still think, that, no matter what we come up with in the end, this discussion should be documented in structured form. We have also done some discussion here: https://gitter.im/metasoarous/clojure-datascience

Now to your points:

On Monday, October 16, 2017 at 3:46:45 AM UTC+2, Ben Kamphaus wrote:

Christian,

This is a great response and there's a lot of good info to mine here, so I'm sure I'll be re-reading it. I have two main follow on questions, but first some context. I'm not associated with ThinkTopic or Cortex any more but I am very interested in solving this problem. I think you're correct that there's not a lot of design-first work in Cortex and that the requirements it's trying to address are not clearly enumerated at present. That aside, I'd rather not dive into the minutia there but peel apart one of the assumptions underlying your post.

It seems like you're presenting an either/or framework design which assumes the approach will be derivative of other frameworks, and I lean strongly in the TensorFlow rather than the Torch/PyTorch direction on the either/or. That is, I'm of a similar mind to Kovas and his post here on TensorFlow's design tradeoffs: https://hackernoon.com/ml-framework-design-reliability-composition-flexibility-9314f72d2c73 -- I know that researchers are not big fans of TF's tradeoffs, but researchers in general are not trying to solve the same problems that software engineers are when they wire machine learning into production systems, or stick it in embedded devices, etc. and Clojure is to me a language heavily tilted that way as well - e.g., shipping working software > algebraic data types.

Actually I have not tried to propose a technology buy-in, but I see that my arguments for neanderthal look that way. Rather I tried to make the point that just bolting something on top of a current state-of-the-art framework does not yield composable abstractions that feel very Clojury or are very research friendly. I wanted to point out that neanderthal actually provides the computing primitives that *all* current frameworks use directly in Clojure. It is a big misconception, that tensorflow, mxnet or any other framework provides you with much more than wrapped function calls to BLAS, cuDNN, Cuda or OpenCL. In fact most frameworks only built on the low-level nvidia APIs. tensorflow cannot even do gradient calculations in the JVM version, because they have somehow hacked it half in Python and half in C++. I think that we are not just appealing to some high-level users (which is important), but that Clojure developers value access to low-level primitives and knobs to tune their language to the problem and gain reasonable performance themselves. pytorch for me is the proof that something like this can be done with competitive performance, even in a very slow language like Python.

In all regard for Kovas Boguta, who seems to be knowledgeable of Clojure, his post mostly describes deployment problems of Python, which notoriously sucks. If you bet on any framework today to have it deployable in 5 years time, I have to disappoint you. All of them did breaking changes over the last 2 years and the battle is far from settled. It is also very complected in my opinion to try to reach portability between versions by hard-wiring the high-level API of a framework into your language. Take a full Clojure stack on top of neanderthal as comparison. The only thing that needs to stay stable is the native libraries that vendors provide. Those are actually fairly stable and neanderthal only wraps them. They have barely changed in the last 20 years (for BLAS) and cuDNN also will probably stay the way it is and only be extended. Cuda and OpenCL are standardized and usually you have some backwards compatible mode to load your code. So I find his arguments for tensorflow not much better than the fanboyism that he criticizes. Having a jar file with Clojure + the native libs with neanderthal will very likely run in a JVM in 5-10 years time just fine or with fairly minimal adjustments, because Clojure is a language that does not break by design. This is a big deal for researchers in the longer run.

So question (1) is, re: "if we do this by default, we lose all the flexibility needed for experienced optimization people to build their own optimizers." -- is there some better middle ground? Is it possible we could be better at shipping models than PyTorch and better at allowing custom optimization, etc. than TensorFlow, with Cortex or whatever other deep learning solution turns out to be the best for Clojure?

I think, and my autograd already does this (I have no problem in describing everything lazily and the backward function as data), we should have a data description of the compute graph. But it should be composable by normal functions and be executable just-in-time whenever you need values during the creation and execution of a graph. Read: It should not be AOT compiled by default, but this compilation should be transparent and always available in the control-flow of Clojure. The problem with external frameworks is that they are mostly not good at this and I think they overvalue their compilation value, because atm. everybody tries to establish their tool as the future standard (typical fight in stages of a network economy).

neanderthal + auto-grad makes it fairly easy to build reasonable low-level primitives into a graph that is just-in-time executable from Clojure. Nothing speaks against writing an emitter (compiler) of this graph description to external tools like nnvm or tensorflow and even of adding passes to improve numerical stability or apply optimizations on this "AST".

And question (2) is, what exactly is the appeal for you in doing this in Clojure? Just the language design? Some lispy aspect of generating code, etc.? I'm about at the point personally where I don't think a small group of clever and conscientious hackers in Clojure is going to keep up with the momentum in the Python ecosystem and the 100x+ development effort there (or the expanded ecosystem which takes in the various C/C++, Lua, maybe Julia work as well), and if Cortex or w/e is just supposed to be like TensorFlow or like Torch, or even TensorFlow bundled with Keras I'd much rather put that effort personally (and make the case for) getting a viable Clojure on CPython, like we have for JavaScript today, and which gives front end Clojure devs reach to React, Electron, etc. and would give ML devs access to everything.

Well, I think that way, too, and turned my back on Clojure for optimization tasks and scientific computing during my master studies of the last two years. On the other hand these languages are mostly doomed, as is pointed out by Julia's design objectives. The ideas you have, have been tried before. CPython is way too slow to host Clojure, that is why Timothy Baldrige has abandoned clojure-py and written Pixie three years ago. He really wanted to make it work. I have used Hy in my Bachelor thesis and it is neat, but it is just no match to Clojure. Imperative Lisps are not nearly the same as a functional one like Clojure (you get the decoupling which makes exploratory development feasible in the large). The JVM is a very serious advantage also.

I have also done R and a bit of Julia. Of all contenders Julia is probably the best in the future. CPython is heavily bound to its legacy as a scripting language for C libraries (pypy never really happened for this reason), so is R. On the other hand I have tried to help a bit on JyNI (1) to try to bring it to the JVM. This effort would probably be the most pragmatic right now, but not really fun to code. I think this integration is the way forward and renjin (2) gives me also some hope in this directions, to not fight an uphill battle against the large momentum and value these languages have in scientific computing. But I think they will still not be as approachable for large data processing as Clojure is.

(1) https://github.com/Stewori/JyNI/

(2) http://www.renjin.org/

So having a Clojury toolbox that can be used with direct native access for optimal performance will make it possible to beat these languages. For Julia it is a bit different story, but also still early. Maybe it can be a host for a Clojure-dialect or maybe (better option), it could be made callable from the JVM without too much overhead.

The ideal case for Clojure (at least the present JVM instance of that language) would be that you had a clear answer to (2) that you could hoist into a better solution for (1) than just following TF or PyT/Torch. The case you make for how to approach the problem, i.e. discarding the Clojure abstractions that get in the way of perf, does not really sound like a compelling case to go to all the trouble to build the framework in Clojure in the first place. If it's just going to boil down writing Clojure like its C/C++, well I can get things done in both TensorFlow, PyTorch, and hell throw Caffe2 and MXNet in the mix and solve interesting problems rather than dig down the YAF rabbit hole.

Honestly I have no interest in YAF, I am doing machine learning research. But Python as a language is really lacking when you for example want to start to do things distributed or do interactive programming (imperative OO does not work with it really). pytorch on the other hand is fairly lean, its concept can be reasonably well ported to Clojure with the help of neanderthal (hopefully optional as one of several backends). I think performance in this area is not determined so much by specializing loops with type hints, but by doing linear algebra in a vectorized fashion. This even speeds things up in Julia and they have not yet beaten the old BLAS libraries and native building blocks. In fact you can view them as the optimal primitives, because this is what the actual hardware vendors provide for *everybody*, including tensorflow or MXNet.

I hope this clearifies a bit, feel free to ask questions though :)

Best,

Christian

Ben Kamphaus

unread,

Oct 18, 2017, 8:07:58 PM10/18/17

to clojure-cortex

Hi Christian,

Today I can train a model in TensorFlow from Python, export the graph, optimize it through multiple steps (freezing to constants, removing gradients, in-place substitutions, emitting XLA code), etc., and export it in a format I can load in a microservice, monolith app, phone, w/e for inference. It supports all the expression primitives I need to implement, well, any paper I've had an interest in since it came out, and which I would've run (with success) in Theano before. Which has admittedly not been anything with heavily customized optimizers, but does include all kinds of wacky loss functions and data flow models through the network. So it's a bit silly to talk about how one can theorize that it shouldn't have properties it clearly has, like being cross platform or deployable, etc. Moreover I can do tons of stuff for real (FCNNs, non-trivial RNN/CNN hybrids, loss functions arbitrarily assigned to layers, etc.) which are mostly not realized in the Clojure ecosystem (a few working things in Cortex, hard-won by Chris N. and others at TT) - nor are their decent documented examples in DL4J, etc. in the Java ecosystem. Many of these things are already commodity level drop-in training with TensorFlow, e.g.: https://github.com/tensorflow/models/tree/master/research/object_detection

The Clojure data science/ML ecosystem has tolerated a lot of untested discussion about how we can just hit the same fast perf targets, or Neanderthal will wrap this and core.matrix will wrap that and it will solve all our problems. We're several years in without much traction in that direction. These, for example, seem to be the CUDNN bindings in Neanderthal you're talking about: https://github.com/uncomplicate/clojurednn -- I totally agree that Clojure is built for stability and its deps, etc. are unlikely to break in comparison with Python's, that containerization is putting a band-aid on the problem, etc. but again you're hand-waving re: native lib deps. If it's CUDNN you want, this is generally one of the most painful points of setup, whether you're on the JVM or in Python, and there are tons of little specific bells and whistles you want to go along with the native libs, specific to your platform, architecture, w/e if you want to push it for all you can. Which is, again, as Chris N. mentions, something you have to take seriously if you want to train in hours instead of days or weeks.

I totally agree that all the TF stuff is just a local optimum, a snapshot of today! And I'm not trying to come across as overly critical here, I love Clojure and would prefer to work in it over Python, even when I'm "just training" a model. But no one will get to that point if they don't take the engineering problems in this space seriously. Assuming, for example, the engineers who built TensorFlow at Google are just clueless simpletons who have never felt the joy of functional programming and built a silly system that doesn't solve real problems is a recipe for disaster. It may be overcomplicated, it may be a local optimum, but there is a hard problem behind it and a reason they landed where they did, and if you can't enumerate the pieces (without hand-waving or glossing over some of the hardest problems, like pushing performance to the max and not assuming you can let it slack 10% here, and 5% here, and 22% over here, and still train a net), and just shoot from the hip when proposing alternatives, you're unlikely to do any better yourself.

Re: interactivity in Python, it's amusing because I agree with you! But no one in the python ecosystem would! The ad hoc command line-ish ness of Jupyter notebooks and IPython at the terminal seemed OK to me before I started using a language with a real REPL, wired into my editor or IDE or w/e for eval/re-eval versus dev, etc. I don't think the average Python developer would think there's much difference between that and using %%autoreload and would probably yell back that you can't have real interactivity in Clojure because the lack of objects means you can't type pd.DataFrame(x). ... and then have it complete all the possible methods for you :) It's true that the perf comes from native BLAS libraries, etc. which is why the Python ecosystem leverages them heavily and more seamlessly than in the JVM experience, (really, no one's been bitten by native bytebuffers or off-heap memory before). And besides, spark, hadoop, etc. give you Python APIs, sklearn will release the GIL, and distributed deep learning and parameter averaging have not proved that interesting in general, etc. etc.

Again: I am not on team Python! It's not ahead because it's objectively good -- it is a mediocre language that I find more and more painful to use the better I get as a generalist dev and Clojurist. But programming language adoption and ecosystem health has a lot more to do with social dynamics, marketing, least cost paths taken by herds of humans, etc. than the direction taken by a few rational optimizers. JavaScript is a case in point, lots of better languages have failed to replace it, it itself is morphing into something different, and people have come up with ways to transpile away its pain or inject themselves into its ecosystem.

Re: Tim Baldridge's previous efforts, yes Python's own mechanics are not great if you're targeting it as a general purpose language, but he was primarily interested in possible performance gains from a tracing JIT, provided by PyPy, and this was always highlighted in the readme, rationale, etc. Being able to take the Clojure-in-Clojure stuff from Cljs as a starting point for persistent data structures, etc. too was not an option during those early efforts. Getting access to CPython for numerics, ML, etc. is a problem with a different shape, and to me not that different from the specific scope of Cljs. That said, I'm just justifying my interest here, less concerned with winning anyone over, until I have something more to show/talk about here.

Also, all of that said, if your goal is to make things work in Clojure, I think getting some consensus between where you are with autodiff and the work (more battle tested w/real perf problems than most stuff in the Clojure ecosystem) that Chris N. has put into the tensor/compute stuff in Cortex.

Best,

-B

ml...@topiq.es

unread,

Oct 19, 2017, 4:51:09 AM10/19/17

to clojure-cortex

Hi Chris,

I have some remarks, but first of all I want to emphasize that I think that we do not necessarily have a conflict here, so we don't have to argue about too many details, as long as we do not build something unperforming or inflexible in the end. I will try to sketch a proposal at the end.

On Monday, October 16, 2017 at 7:35:19 PM UTC+2, Chris Nuernberger wrote:

Christian, Ben -

This is a good discussion so far so let's keep it going:

1. First, Christian, neanderthal is written basically as a datatype-specific core.matrix backend. It does not allow me to, for instance, create a generic buffer of data and put a banded matrix on it at one point and a dense matrix on it at another point. It is also written with some significant java bindings similar to how the existing core.matrix backends are implemented. With the tensors I can create a floating point buffer and associated a description to it in an ad-hoc manner which is a far more clojure-esque way of doing computation than creating a strided matrix java class and then implemented a set of interfaces. Furthermore with the cortex tensors it is easy to schedule computation in various execution environments. So neanderthal gives me as the engine writer far fewer options to do clever things. Please note how the cortex tensors are built the distinction between the heavy data buffers which have no definition at all (aside from device, datatype, and size) and a description mechanism that tells the system how to interpret those buffers. Here is the completely implementation of transpose in cortex:

https://github.com/thinktopic/cortex/blob/master/src/cortex/tensor/dimensions.clj#L399

I understand your point about buffer management, but one should be aware that in regard to the actual computing primitives everybody has to use (Cuda, OpenCL, BLAS, cuDNN), you will call these low-level APIs that neanderthal is using anyway, no matter what you do. There is no generic buffer management for BLAS, except for the fact that you can put data in a dense matrix and than operate on it and create some views through these low-level APIs. The specific matrix types have an internal memory layout that is not just an nd-array, but BLAS provides some ways to do this access efficiently itself. I am not sure whether I am getting your banded matrix argument, but I think you mean this kind of generalization. I see that this makes sense in this particular case, but maybe you can clearify a bit more how you do that without support from the low-level backends. neanderthal can also do constant time transpose (even my core.matrix backend can then do it for this reason) or give you these mappings that BLAS itself provides. neanderthal is very different from core.matrix in that it just exposes low-level APIs to Clojure with a minimal (nothing which could hurt performance) of convenience at the interface and Dragan is very keen on this. In that sense it also does not introduce additional abstractions that would have to change in the future and these low-level APIs are very standardized and stable. It exposes the concepts for buffer management low-level very openly, i.e. you are supposed to explicitly copy and transfer data whenever you want it. core.matrix has its own generalizing assumptions and more of a numpy/matlab-like high-level API and implements linear algebra routines itself and supports things like reshaping and fallbacks, that can hurt performance. I think core.matrix should still be part of the story though, because it is a general polymorphic interface and established.

This re-works the strides and the shape and productions a new tensor which shares the heavy buffer with the original but has a different description. Note that I can take the same buffer and assoc different unrelated descriptions to it and that is just a clojure assoc operation; it isn't creating a subclass of the engine of the implementation of the this or the that. Not also that it is a completely datatype independent operation, meaning I can operation on floats, or bytes or shorts and this doesn't change.

Here is the definition of transpose in neanderthal:

https://github.com/uncomplicate/neanderthal/blob/master/src/clojure/uncomplicate/neanderthal/core.clj#L740

This delegates to a backend-specific implementation which means you have to go engine by engine to get to what is happening because the types of the underlying system actually have to change; it is a far less clojure-esque way of doing things. This is less power in the user's hands, not more.

I am not sure that I get your argument, you cannot do higher-level buffer management without having support from the low-level API to understand your mapped buffers in this way. So in a sense you have to be backend-specific and neanderthal also already abstracts this away in a polymorphic fashion, since all the APIs it is wrapping have the properly performing low-level operations. If I have a buffer that I want to use as a dense-matrix and then want to use it as a special block-diagonal matrix, I cannot just map it to low-level memory in a "general" case, because there is no general low-level memory layout. These backends gain there performance by laying things out in memory themselves.

As far as the dynamic kernel loading goes, both systems load kernels dynamically as you have to for cuda, openCL. I don't understand where you are going here and I stated that the cortex tensors are in fact designed to be compatible with openCL. Whether I personally have the time to implement this or not is another question.

Sure. We agree that we want to expose a way to load kernels for bare-metal operations and neanderthal, compared to core.matrix stresses the need for this and exposes the low-level APIs nicely so you can just write a bit of Clojure code to integrate the kernel and the kernel itself. See here: http://clojurecl.uncomplicate.org/ https://github.com/uncomplicate/clojurecl/blob/master/test/clojure/uncomplicate/clojurecl/examples/openclinaction/ch05.clj

Basically, we have designed the tensors for cortex to enable everything that we need and nothing else has those features in the clojure ecosystem. They implement everything required for a what our NN engine supports in far *far* fewer lines of code (and interfaces and layers) than any other implementation and beside that they can operate on bytes, shorts, integers, etc. So let's at this point pause the conversation on using neanderthal or cortex tensors as I do not believe it is actually that important of a discussion at this point nor do I see it resolving.

Ok, but honestly you could document this in a clear fashion somewhere. I am not totally sure that your design is really optimal, because these low-level details matter a lot and optimization people deal with them a lot to get reasonable performance. Just abstracting with a general tensor-library has not been done by any of the hardware vendors and general purpose tensor operations have different tradeoffs depending on how you calculate them. tensorflow for example is not necessarily very fast, although it has this tensor attitude upfront. That is why cuDNN is specialized on convolution and not a generic tensor library extension to BLAS.

The benefit of investing in nnvm is that you have offloaded a large amount of very difficult work, fundamentally. You also have a simpler way to run the inference models of other systems in clojure, open the door of running models trained in clojure in contexts that potentially do not support the JVM in the first place. NNVM is still lower level than tensorflow so there is that aspect also. It is basically a low-level common ground where we can reuse work efficiently done in other ecosystems. Having written many importers and imported many models into clojure for cortex I find this appealing as I find the possibility of running complex models in embedded environments from a language other than clojure.

Loading models is annoying, but more of a serialization issue, right? It should not be bolted into the a tensor library. I think we should be able to target NNVM though and I would like to discuss this further with you.

The entire above discussion I feel is actually is independent of your points A,B,C.

A - Choose a good set of primitives. These seem to me to be implementation independent; nothing to do with neanderthal, nnvm, TF, or cortex save for the fact that I would probably carefully consider the TF or caffe2 primitives as a reference set.
B - Build a transformation system that transforms a graph of primitives from A -> a graph of primitives A' potentially encoding runtime information (which branch of a control flow statement or something like that) discovered during A.
C - A high level framework that allows people to be ignorant of A,B for most use cases.

I think what you are arguing is that you would like to be able to add to the set of primitives in an ad-hoc fashion. You are concerned that relying on the primitives exposed by nnvm will artificially limit your ability to add to A,B in ad-hoc manner and I can certainly sympathize with this.

It does, because it is not adjustable from the REPL and from the Clojure environment, but requires very hefty lifting.

I am not as concerned with this as I am with the more practical abilities to train/infer efficiently and reuse existing models along with export new models for other systems.

I understand.

Is this an accurate description of where the conversation lies, excluding the arguments around which exact implementation to use to write a backend the the primitives in A in?

I will try to do a design sketch that I have in mind. It might miss some points, so feel free to bluntly criticize it :)

The requirements are:

1. state-of-the-art performance

2. mapping to managed machine learning infrastructure, e.g. nnvm or tensorflow

3. dynamic evaluation inside the control flow of Clojure if needed, but allow AOT if not (most FNN, CNNs and some RNNs)

4. functional composition through autograd

5. flexible model loading of pretrained stuff

I would also say that we should view the optimization pipeline like a compiler operating on ASTs (graphs), rather than a direct computation engine, but being available at runtime (like eval), so it can be transparent.

A. Declarative NN DSL (like cortex) -> B. autograd functional lazy graph DSL (like pytorch, but data only and lazy) -> either C. just-in-time execution of this graph in Clojure+neanderthal (calling forward and backward) or D. a graph that is AOT compilable to a backend like Clojure+neanderthal (AOT), nnvm, tensorflow

Comments:
- C. or D. depends on the model. For it to be AOT compilable the autograd function with the graph will be evaluated once and the "forward" and "backward" will be executed on the resulting graph repeatedly (or there will be external evaluation).
- I think the graph DSL of B should describe general buffers (tensors) generally like cortex does (but with generalized autograd) and the backends would then take this graph and transform it into something executable and add the computed values to it in mutable cells like pytorch, or if they are outside the JVM they only provide a subset of them, e.g. just the loss.
- A pure Clojure backend with neanderthal (and maybe even going through a core.matrix wrapper for it, I would try to do it first than maybe separate the two) will be sufficient for researchers. They can take the data description of the graph and write transforms for it or map some nodes more efficiently to a direct Cuda/OpenCL kernel in neanderthal. As long as the graph is built with a standardized autograd API that all backends understand and know the primitives of, this should be doable and avoid buy-in for anybody.
- Serialization is not dealt with here, but honestly I think this should be done in a separate library, that should understand the low-level backend formats and be able to directly stream them into another format.

I hope we can figure that out together, as we then can focus on the parts of the pipeline that are important to each of us, while still working together in the big picture.

Best,
Christian

ml...@topiq.es

unread,

Oct 19, 2017, 7:56:10 AM10/19/17

to clojure-cortex

Hi Ben,

On Thursday, October 19, 2017 at 2:07:58 AM UTC+2, Ben Kamphaus wrote:

Hi Christian,

Today I can train a model in TensorFlow from Python, export the graph, optimize it through multiple steps (freezing to constants, removing gradients, in-place substitutions, emitting XLA code), etc., and export it in a format I can load in a microservice, monolith app, phone, w/e for inference. It supports all the expression primitives I need to implement, well, any paper I've had an interest in since it came out, and which I would've run (with success) in Theano before. Which has admittedly not been anything with heavily customized optimizers, but does include all kinds of wacky loss functions and data flow models through the network. So it's a bit silly to talk about how one can theorize that it shouldn't have properties it clearly has, like being cross platform or deployable, etc. Moreover I can do tons of stuff for real (FCNNs, non-trivial RNN/CNN hybrids, loss functions arbitrarily assigned to layers, etc.) which are mostly not realized in the Clojure ecosystem (a few working things in Cortex, hard-won by Chris N. and others at TT) - nor are their decent documented examples in DL4J, etc. in the Java ecosystem. Many of these things are already commodity level drop-in training with TensorFlow, e.g.: https://github.com/tensorflow/models/tree/master/research/object_detection

Ok, to be clear. I have worked with tensorflow before pytorch and I assumed that it would be a very solid building target. I wanted to stick to it half a year ago. I have built several models with it, one commercially in a Python microservice. I took exactly the arguments you make for granted and just used the high-level API. But most researchers I talk to atm. from different groups, consider moving to pytorch, if they have not already and I see that it is much better tailored and factorized for the actual problem of doing optimization (and not a ton of other stuff around). My point is not about pytorch, but that tensorflow advertised these features as totally superior and as an "enterprisey" platform, but I think it is more of an effort of Google to establish a large framework to get people on their platform and they are not winning there atm. anymore. Facebook drives torch and pytorch also in the large-scale machine-learning direction (scale-out) and while you can load things on the phones with tensorflow, they usually map directly to the low-level APIs of the devices. GPUs on mobile are also very proprietary and support mostly OpenCL or some libraries. There is work done from Google there for sure, but it is not them who actually provide the complicated primitives, they wrap them in a tensor library.

My rethinking came when I have worked on https://github.com/kieranbrowne/clojure-tensorflow earlier this year and there just is no gradient computation in tensorflow outside of Python and it will not come to the JVM version anytime soon. This really changed my opinion of tensorflow, how serious can they be if they don't take the JVM or other runtimes serious? I also know people who have studied how they do backprop in the codebase and they were not seeing it as particularly consistent and clean. tensorflow has a lot of polish as a product, but I think its design has too much of the typical framework problems, that make it clunky and in the end not even very efficient because their requirements change a lot during development. Can you show me reasonable benchmarks where it performs well and that are not just done by tensorflow authors? Most papers in this area just report the results that are best for their framework, but it is very difficult to judge general performance. Also if you access the operations directly from Clojure in the JVM I see no real source of overhead, to be frank. Most is matrix multiplications and cuda kernels applied to buffers.

So to summarize, I am not against tensorflow per se, my proposal allows to have it as an emission target. But I think a) it is too early to bet on a framework (and Kovas claim that it will be stable is Python related and probably false) and b) the assumptions made about tensorflow are not simply true. I don't think its engineers are stupid or that it does not satisfy Google, but it is not as good for the general community as they portray it and pytorch has revealed this. A much more modular and flexible approach can be built that still has the reach of tensorflow, but would also allow to move between vendors (I am not talking about a Clojure framework here). In fact nnvm is about that already. c) tensorflow also has no guarantee for a stable API and its python one has at least changed in part in the last two years as I had to rewrite parts of my code. Clojure has a guarantee for a stable API, the JVM and the underlying primitives also promise this.

The Clojure data science/ML ecosystem has tolerated a lot of untested discussion about how we can just hit the same fast perf targets, or Neanderthal will wrap this and core.matrix will wrap that and it will solve all our problems. We're several years in without much traction in that direction. These, for example, seem to be the CUDNN bindings in Neanderthal you're talking about: https://github.com/uncomplicate/clojurednn -- I totally agree that Clojure is built for stability and its deps, etc. are unlikely to break in comparison with Python's, that containerization is putting a band-aid on the problem, etc. but again you're hand-waving re: native lib deps. If it's CUDNN you want, this is generally one of the most painful points of setup, whether you're on the JVM or in Python, and there are tons of little specific bells and whistles you want to go along with the native libs, specific to your platform, architecture, w/e if you want to push it for all you can. Which is, again, as Chris N. mentions, something you have to take seriously if you want to train in hours instead of days or weeks.

But everybody uses cuDNN (Dragan hasn't wrapped it yet, but he has it on his agenda), there is no way around it. All you are saying is that tensorflow makes deployment for you easier, which is not true, because you need to manually download cuDNN and put it on the library path because of Nvidia licencing. I don't say you should use low-level primitives for modelling, but I say that tensorflow (or framework *) provides you linear algebra primitives (BLAS), convolutions (cuDNN) and some access to program low-level (cuda or opencl). All that they really provide beyond that is gradients and a bit of general purpose tensor operations which are not necessarily optimal from the view of the hardware vendor. Tensorflow isn't actually doing gradients in its core library, which I found really odd. Doing back-propagation is actually not that hard, I am sorry. That is not the reason why Clojure has not been successful with neural network attempts. Getting the low-level memory management right is much harder, but also doable and that is why I am so thankful for neanderthal. Python does not generally have something like it as a building block (numpy is high-level), pytorch has primitives like it included though. These external frameworks are also in part not very fast and once you bolt all your stuff on top of them, there is no way to tweak the memory management anymore with reasonable effort. Or do you really dive into tensorflow and start to hack its C++ memory management and primitives? I want to have this in Clojure, not because I like Clojure, but because I find these framework attempts lacking and I have a rough idea of how to built an adaptable functional pipeline. I don't think Google engineers are stupid, but they are very biased towards C++ and their own way of managing infrastructure, not necessarily of building mathematical toolboxes.

I totally agree that all the TF stuff is just a local optimum, a snapshot of today! And I'm not trying to come across as overly critical here, I love Clojure and would prefer to work in it over Python, even when I'm "just training" a model. But no one will get to that point if they don't take the engineering problems in this space seriously. Assuming, for example, the engineers who built TensorFlow at Google are just clueless simpletons who have never felt the joy of functional programming and built a silly system that doesn't solve real problems is a recipe for disaster. It may be overcomplicated, it may be a local optimum, but there is a hard problem behind it and a reason they landed where they did, and if you can't enumerate the pieces (without hand-waving or glossing over some of the hardest problems, like pushing performance to the max and not assuming you can let it slack 10% here, and 5% here, and 22% over here, and still train a net), and just shoot from the hip when proposing alternatives, you're unlikely to do any better yourself.

Well the deeplearning4j people also emphasize these engineering problems and try to sell to enterprises, because pure Python researchers don't even know what a REST API is (they say). But I think this does not help to get the core parts right, which is not just an engineering problem and exposing an API, but doing the math and optimization right and in a flexible way. It is not the same kind of engineering as writing a scalable information processing infrastructure, building a modular design or optimizing tight-loops. It is linear-algebra and low-level numerical math and if you don't put that into the center but hide it behind a framework you are actually making the engineering harder/impossible. Similar to Java frameworks vs. composition of Clojure functions. So I am careful with this real-world industry bandwagon. Parallelization of model training for example is an active research area that I happen to have some interest in and I think it would be fairly reasonable to build something like it with Clojure compared to the other frameworks. I don't see the hard problem for the core pipeline though, as I said and pytorch is, if nothing else, the empirical counter example. So what is actually the hard problem tensorflow or these frameworks solve for you? If it is neither performance, nor deployment, nor a flexible design?

Re: interactivity in Python, it's amusing because I agree with you! But no one in the python ecosystem would! The ad hoc command line-ish ness of Jupyter notebooks and IPython at the terminal seemed OK to me before I started using a language with a real REPL, wired into my editor or IDE or w/e for eval/re-eval versus dev, etc. I don't think the average Python developer would think there's much difference between that and using %%autoreload and would probably yell back that you can't have real interactivity in Clojure because the lack of objects means you can't type pd.DataFrame(x). ... and then have it complete all the possible methods for you :) It's true that the perf comes from native BLAS libraries, etc. which is why the Python ecosystem leverages them heavily and more seamlessly than in the JVM experience, (really, no one's been bitten by native bytebuffers or off-heap memory before). And besides, spark, hadoop, etc. give you Python APIs, sklearn will release the GIL, and distributed deep learning and parameter averaging have not proved that interesting in general, etc. etc.

I think it is not necessarily amusing that we agree. I hope that we share some consistent reasoning that we can built upon. I also don't want to run in some coding spree just to built something nobody else is interested in or that solves the problem poorly. As I said, I am unhappy with the modelling primitives in Python beyond pytorch. I have been thinking about this over the last for years repeatedly and used these other environments honestly. I see the strength of Python in its scientific computing ecosystem and in the nice matlab-like slicing DSL that allows you to work with tensors expressively, for example. What do you mean with the byte-buffer arguments?

Again: I am not on team Python! It's not ahead because it's objectively good -- it is a mediocre language that I find more and more painful to use the better I get as a generalist dev and Clojurist. But programming language adoption and ecosystem health has a lot more to do with social dynamics, marketing, least cost paths taken by herds of humans, etc. than the direction taken by a few rational optimizers. JavaScript is a case in point, lots of better languages have failed to replace it, it itself is morphing into something different, and people have come up with ways to transpile away its pain or inject themselves into its ecosystem.

I agree. But Clojure in general has shown that there is a lot of leverage if you engineer wisely in these environments. I am for integration of Python and R, and thankfully R at least is already integrated on the JVM (with renjin), which I find to be the most compelling approach. Python is hard, because it is actually just a wrapper around C-structs that are banged on from native code in arbitrary ways, which made pypy impossible to realize, although very smart people are in the Python community and tried to fix it for a long time. JyNI is an effort in this pragmatic direction and might work to bring at least stable core libraries of scientific computing to the JVM. But developing it is no fun. Having multiple runtimes on the other hand is a big pain, especially if they are as slow as Python, which, I would say is the actual liability for machine learning pipelines.

Re: Tim Baldridge's previous efforts, yes Python's own mechanics are not great if you're targeting it as a general purpose language, but he was primarily interested in possible performance gains from a tracing JIT, provided by PyPy, and this was always highlighted in the readme, rationale, etc. Being able to take the Clojure-in-Clojure stuff from Cljs as a starting point for persistent data structures, etc. too was not an option during those early efforts. Getting access to CPython for numerics, ML, etc. is a problem with a different shape, and to me not that different from the specific scope of Cljs. That said, I'm just justifying my interest here, less concerned with winning anyone over, until I have something more to show/talk about here.

Have you talked to him? I have a few years back when I really wanted to bring my Clojure workflow to my Python environment. He was exactly after the scientific computing people and he couldn't get it to work. He then moved to PyPy to get at least is own Lisp (Pixie) done, but he really wanted to integrate the scientific computing Python ecosystem into the Clojure community. Clojure-in-Clojure or selfhosted cljs will not help you if CPython is too slow, his port in fact was probably better tailored to Python that a general approach like this would be (as the Clojure compiler is still in part written in Java for performance).

Also, all of that said, if your goal is to make things work in Clojure, I think getting some consensus between where you are with autodiff and the work (more battle tested w/real perf problems than most stuff in the Clojure ecosystem) that Chris N. has put into the tensor/compute stuff in Cortex.

I am trying to do this, see my proposal in my reply to Chris. My approach, as discussed with Dragan on the uncomplicate slack channel, my strategy would be to keep people in Clojure who have to leave today, but don't want to. I don't think Clojure will be attractive for researchers doing work in Python or R in the short term. What kind of models would be interesting for you? I would like to frame this discussion in terms of deliverables and then start working on a pipeline for a small subset of the whole agenda, but large enough to deliver value to the participants.

Best,

Christian

Chris Nuernberger

unread,

Oct 19, 2017, 5:25:23 PM10/19/17

to clojure-cortex

Great response Christian:

I think you missed my point slightly about the design of the backends and the description and such, and I missed the information in neanderthal about the low level primitives.

Here are some problems I do not know how to solve in neanderthal:

Could you show me how to create a byte or integer tensor in neanderthal like what I would need to operate with either a BufferedImage (int) an opencv image (bytes)?

Also, where are these copy primitives in neanderthal exactly? Currently, in neanderthal do a schedule a copy operation on one stream in cuda while I am doing computation in another stream in cuda? Can I reuse this code on the cpu and openCL? In cortex the precise same code (streams are implemented for the cpu) works so the same overall architecture works.

Given I have a large buffer on the device. How do I use it as a dense matrix 2d at one point and a vector at another and a general dense 4d tensor at another point?

cortex code:

```clojure

(let [tens (ct/tensor whatever-shape-is-largest-thing)

vec-tens (assoc tens {:shape [vec-size]})

n-tensor (assoc tens {:shape [1 2 3 4] :strides [204 102 36 9]})]

;;go on with your life

)

```

Note that this is *not* a linear algebra operation; the shapes of tens and n-tensor do not have to coincide in any meaningful way.

How do I use the same Cuda context from multiple threads? Cortex's cuda binding layer allows this transparently.

How do I schedule device->device copies?

Let's say I want to add another binary operation (y = a *op* b). In order to add this operation in cortex I basically decide which keyword it gets, and change specifically three places past that:

1. https://github.com/thinktopic/cortex/blob/master/src/cortex/compute/cpu/tensor_math.clj#L239

2. (map to integer in cuda space) https://github.com/thinktopic/cortex/blob/master/src/cortex/compute/cuda/tensor_math.clj#L41

3. Add the operation into the cpp: https://github.com/thinktopic/cortex/blob/master/resources/operations.h#L22

3.a (recompile the tensor.cu files) - maybe another step but not another change :).

The operation will work across all datatypes at this point. Across integers, floating point numbers, etc on both cpu and gpu. It will work across any strides, any sub-matrix, etc as these are all done purely in the description mechanism. Note also that the operation will work for:

a. y = (scalar) op b.

b. y = y op b (potentially summing into y due to broadcast rules) should the underlying datatype support CAS operations.

c. y = b op y (even if the operation is not commutative).

Now, what changes do you need to make in neanderthal to make that happen?

The design of the cortex tensors are built upon a hierarchy of abstractions and thus you can choose to work at the level of abstraction you need (this is documented in the various files) but perhaps I should have a design document to make this clear. There is a lot of overlap between the cortex tensors and neanderthal and the clojure community would benefit some if there was less for sure and I think everyone agrees on this. I am not sure we are very good at communicating the specific differences between these designs and finding common ground on which to move forward.

The rest of your post I agree with heartily especially this line:

```

I think this is a very exact comment (replacing neanderthal with cortex tensors of course) and this is really the heart of everything. Really, it doesn't look to me like what you have done currently is very far from this aside that maybe D doesn't exist; where do you feel the clj-autograd project stands?

Ben Kamphaus

unread,

Oct 20, 2017, 11:26:13 AM10/20/17

to clojure-cortex

Hi Christian,

Another great response and there are only a few things I disagree with, but I those few things are quite important:

> My rethinking came when I have worked on https://github.com/kieranbrowne/clojure-tensorflow earlier this year and there just is no gradient computation in tensorflow outside of Python and it will not come to the JVM version anytime soon. This really changed my opinion of tensorflow, how serious can they be if they don't take the JVM or other runtimes serious?

Again, I feel you don't identify the real design premise of TensorFlow here. They take those other environments seriously as a target for _inference_, but not for training. These things are decoupled by design in TensorFlow. PyTorch has not taken this seriously, punting on the easy export path to Caffe2, and when you hack production inference concerns on later it's not likely to work out. I think you should re-read Kovas's article more carefully - he's working at the Cortex team on Twitter, he understands the production pains in the Torch/PyTorch ad hoc ecosystem. This also applies to:

> All you are saying is that tensorflow makes deployment for you easier, which is not true, because you need to manually download cuDNN and put it on the library path because of Nvidia licencing.

Oftentimes for the perf requirements it's sufficient to do CPU inference with or without the model compression steps, but even in those cases it's a file copy and environment var at the worst. If I need a Clojure run-time for performance because "someone will solve perf later" it really limits what I can do, but I think recognize this with your (C) and (D) split with (D) targeting nnvm in your requirements list so I won't belabor that point. I just want to emphasize you have to design for this capability early on, hacking it on later is not some automatic home run.

I did a count through my repos and I've trained 63 different TF models since it came out (not counting toy examples like learning XOR, or things you can retrieve with some get dataset call built into the framework like MNIST or CIFAR-10), about 48 of those with Keras. Those 15 cases I needed the escape hatch from Keras to TF for, and possibly 2-3 I wouldn't have needed it for if Keras's functional API had been fully formed when I worked on those models. I never did anything more than basics with RNNs/LSTM models, so maybe that would have pushed me to the Torch escape hatch or made me feel the pain of lack of TF's alleged lack of perf issues.

I'm sure someone doing more serious deep learning research -- and I have _no doubt_ this includes you -- have a need for the escape hatches to do things and I'm sure TF probably sucks for that. Your position throughout this has come off to me as saying these escape hatches are so essential that you should center you design around them. For low level work this is fine, it is not unlike the difference between when to use query vs. datoms in Datomic, or when to jam things around here and there in reagent or used a disciplined state management framework like re-frame. And also as a result I am super wary that needing an occasional escape hatch turns into a design effort that prioritizes the escape hatches because escape hatch means imperative control flow, lots of knobs, mutable state, etc. and these programs become a mess to reason about super fast. It is absolutely not the way I like to program and why I prefer TF to Torch/PyTorch and why I prefer Clojure to other languages.

> Doing back-propagation is actually not that hard, I am sorry. That is not the reason why Clojure has not been successful with neural network attempts.

The problems the Clojure ecosystem has had are social, not technical. And yes, it is absolutely one of the reasons why Clojure has not been successful in neural network attempts because no one has gotten it working with a NN library - which is I'm sure for social reasons since a similarly sized community focused on numerics (Julia) has several options for symbolic, tape, etc. autodiff. If it's not technically challenging for the right people, the issue is that the Clojure community has not been able to point the right people at the problem (though maybe that's changing given your work).

A serious social problem the Clojure ecosystem has is a willingness to say "Oh sure it shouldn't be hard, just do X, Y, and Z" --> and then no one ever follows up and does it. Or it turns out the goal doesn't actually flow from X, Y, and Z. There's been a lot of hand waving and smoke and mirrors about data science and machine learning for years in Clojure, multiple conjs, not being realistic about Incanter's perf and viability versus even R or pandas, outright lies about performance vs. other things, a tendency to litter its ecosystem with mediocre wrappers for libraries (how much Spark and TensorFlow builder API crap is out there now? How much of it can you use in a real system?) -- and the laundry list of "Well with Clojure you can do infinite training sets!" (you can do this in C++ with while (true) {} and model checkpoints ffs not to mention it's very clean and reasonable in Python with generators).

I am not blaming you for this or anything like that, I just want to set up this context. The Clojure community w/r/t ML and data science has a long history of getting the facts wrongs, over-promising and under-delivering, and has IMO a very serious credibility gap to make up for. When I hear "X is not actually that hard" or "all you have to do is..." type claims for Clojure ML I'm conditioned such that my alarm bells go off at full force.

Anyways, there are disagreements and information disparities here that won't be resolved with text on google group posts. If the pieces you've identified as easy are easy and you can use them to meet the requirements you listed, you will solve it, with or without the consensus from the TT crew and Chris N. in particular or others in the community. It's a good set of requirements, and the Clojure community would then have a good deep learning/general ML/AI/numerics framework that I and others will happily switch to.

Best,
-B

Chris Nuernberger

unread,

Oct 20, 2017, 12:38:12 PM10/20/17

to clojure-cortex

Ben, I take issue with a few things.

First, Cortex is successful in terms of being a framework you can use to do image processing type machine learning. It has fewer features than other ones but it does work and it performs very well in any sane comparison in terms of training or inference times. It's limiting factor is the few number of contributors, mainly.

Second, having to deal with the ops issues of deploying python containers is definitely a nontrivial cost that never goes away. Furthermore writing the things like dataset augmentation and manipulation in clojure (as well as the algorithms to wire the output of several machine learning methods together) means that you can do things like decode video in one thread, augment across all remaining cores, train to the GPU's max potential, etc.

Unlike specifically every other toolkit cortex really is clojure all the way down; you don't have to switch to c++ for other things aside from some cuda kernels.

I agree that the problems are community but mainly in the realm of having good machine learners contribute and not just complain about missing features. Potentially even contributing to roadmaps and such would be a large benefit. I find it odd that Christian has taken the time and effort to not only proof out an auto-diff system but taken the risk of coming into cortex and potentially getting into a flame war in order to honestly help contribute to the pathway forward for ML in clojure and your response is basically:

a. We shouldn't even bother working towards doing machine learning in clojure (!!)

and then

b. Well if someone else makes something nice for me then I will use it.

Fundamentally, the important thing is that the pathway to doing machine learning in clojure will make anyone walking it a better machine learner themselves. This is a point I think you missed and it is humbling in that you have to realize how much of the foundation you don't understand and be willing to learn at times even basic concepts. So python parity at this point doesn't really matter, what matters is that we are specifically learning and producing great software while we are at it. Cortex, for all its flaws does have some very strong points in simplicity and performance (it is around 20,000 lines of code total). Small, well performing systems are specifically great tools to learn with.

In one small library Christian has contributed concretely more to the ML pathway for clojure than you have in years which I find telling. You are right in the sense of the community is the problem but a sense of entitlement and entitled judgement doesn't help. You yourself aren't contributing anything other than rather negative energy trying to dissuade any visible contribution to ML in clojure. I find it continually disappointing that you take the positions above and striking considering I know that you consider yourself an advanced machine learner.

Ben Kamphaus

unread,

Oct 20, 2017, 2:49:36 PM10/20/17

to clojure-cortex

Hi Chris,

> Second, having to deal with the ops issues of deploying python containers is definitely a nontrivial cost that never goes away.

- Containers are not the portability story for TensorFlow. This was an operational reality of one model contingent on not being able to justify time spent working towards a cleaner inference story.
- And batching data preprocessing steps being done in parallel? Also not a real limitations in Python, as hackish as it might be: https://github.com/fchollet/keras/blob/58d1d0678f1f0dfb2dca976b84fc3d419d9f4618/tests/test_multiprocessing.py#L48-L54
- You can also build preprocessing directly into the TensorFlow graph, whose execution context is not subject to the GIL, where that problem also goes away.
- You are also not correct about how Cortex stacks up against other frameworks, it provides sub-commodity capabilities for vision, where by commodity I mean what you can accomplish via APIs, boilerplate, or just putting data in a directory, e.g. https://github.com/tensorflow/models/tree/master/research/object_detection

Again, these kind of statements only fall out of a lack of reading, study, use and understanding of other frameworks. You can have different opinions about Python, other frameworks' design calls and tradeoffs, etc. but you do not get to have different facts about them.

Re: (a) and (b) points, I think you completely misread my intent here. I'm just trying to reiterate literally the first point I started out with, that Clojure approaches should solve real problems in deep learning frameworks and not just the problem of YAF, but in Clojure. That also extends to PyTorch, but in Clojure. Or TensorFlow, but in Clojure. They certainly shouldn't reproduce the bad portions of other frameworks because they failed to learn from them. Anyways, I am doing work in that direction and hope I'll be at a point where I have something to show in a few months. And it's based on building bridges to other communities, not more islands in Clojure.

Otherwise, my specific complaint with what Christian has outlined mostly boil down to (1) I think he greatly underestimates the difficulty of getting cross-platform numerics code performant and correct in a fully-featured DL framework -- you've spent enough time in array stride error and gradient checker hell to be similarly suspicious of that comment, and (2) he's a researcher who considers production concerns mostly out of scope and is dismissive or inaccurate w/r/t the approaches TF takes (I am completely open to accurate criticisms here). He also sounds like a really smart dude, has raised points I wasn't aware of re: PyTorch, and he may know a lot more than me and pull it off all the way to point (D) where using (and yes contributing) could be a reality, but I don't find his premise interesting or grounded enough today for me to contribute to anything at this phase, and I was trying to be polite in closing the discussion given my current assessment.

Christian, I can totally see Chris's point about how that comes off as an entitled judgment, and this is not my intent at all. I really just want to take my reservations and get out of your way right now and if I check in later and you've proven me wrong, you'll likely win me over. That's all I'm trying to express.

Re: personal attacks, I'd like to think I was above mentioning or being affected by it, but of course the truth is more complicated, and your criticism re: contributions isn't unfair, whatever other constraints on that I could enumerate. Anyways, I absolutely want to set the record straight about considering myself an advanced machine learner. I am a very low bar to clear, and my capabilities are definitely right at that commodity/API level (for vision) or below it (for NLP, RNNs), as best as I can tell. _That_, ultimately, is the source of my frustration -- how much less credible would a deep learning researcher at a top tier company or research institution find the Clojure ML community? How _much worse_ would it be at meeting their expectations and needs?

And from there, how can we work from premises that are grounded in reality to solve those problems and make it attract some of the real experts (not me), instead of turning them away.

Best,

-B

Ben Kamphaus

unread,

Oct 20, 2017, 5:43:35 PM10/20/17

to clojure-cortex

I just want to clarify that the above point about what Cortex provides today is not meant to be taken as a statement about its potential, or what can be achieved with its Tensor abstraction, nor is it meant to disparage any work that has gone into it. I would just like to encourage people to use _facts_ as a basis for design decisions, and as a means for justifying why libraries exist or how they work.

Otherwise, it's apparent that I'm not doing any of the parties in this conversation (me included) any good by participating, so I am stepping out.

Best,

-B

Chris Nuernberger

unread,

Oct 20, 2017, 6:41:27 PM10/20/17

to clojure-cortex

Hi Christian

I took some time today to create a couple design documents that outline some of the key concepts:

https://github.com/thinktopic/cortex/tree/tensor-design-doc/docs

The compute document is first and the tensor document is second and builds on the concepts in the compute document. I was hoping you could find some time to peruse them because they outline the larger picture regarding the cortex compute abstraction and how the tensor abstraction specifically relates to this.

ml...@topiq.es

unread,

Oct 21, 2017, 11:00:44 AM10/21/17

to clojure-cortex

Hi Chris,

I have a feeling that the google group type communication (mail) is a bit too heavy to discuss the issues involved and does not allow direct communication. I think in part we had stronger opinions, because everybody was writing against a perceived argument of the other and I think my position against tensorflow intensified too much and the topic shifted. I pointed out your arguments on the uncomplicate clojurians slack to Dragan, where he replied, and joined the cortex slack. I suggest that we try to continue discussing there, at least when we are online together. slack is not optimal, but I think we can exchange information there in a better flow.

It is an MVP in this direction, but I would have to agree on a proper data description for tensor-based operations so that it is pluggable (importantly with broadcasting). E.g. one for cortex and one for core.matrix and/or neanderthal. Is this reasonable to you?

ml...@topiq.es

unread,

Oct 23, 2017, 6:04:59 AM10/23/17

to clojure-cortex

Great! I have just seen this post. I will reply here as soon as I have some time.

Alistair Roche

unread,

Oct 27, 2017, 12:10:38 PM10/27/17

to ml...@topiq.es, clojure-cortex

I've been following this conversation with great interest! I'd pay money to watch you guys chat about it on Slack ;)

Maybe it's worth scheduling a time to have a chat on there?

--

You received this message because you are subscribed to the Google Groups "clojure-cortex" group.

To unsubscribe from this group and stop receiving emails from it, send an email to clojure-cortex+unsubscribe@googlegroups.com.
To post to this group, send email to clojure-cortex@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clojure-cortex/b224eebc-e74f-46dd-a11e-d11bf63a75c5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alistair

ml...@topiq.es

unread,

Oct 27, 2017, 1:34:29 PM10/27/17

to clojure-cortex

Hi Chris,

thanks for taking the time to draft more of the design of cortex, this is really helpful in understanding the design objectives. I think your main focus of the tensor backend is about

https://github.com/thinktopic/cortex/blob/master/docs/compute.md

1. effective buffer management on device (Buffer)
2. management of execution contexts (Stream & Event)
3. a reasonable high-level API to implement a NN framework or autograd (tensor.md)

I would like to focus on one question first. What would be necessary to be done in neanderthal so you could use it? I understand that the copy and transfer primitives in neanderthal are not sufficient for your buffer management and it seems to be the main concern (understandably, because memory ops are critical for deep learning). You want to be able to do buffer offsetting and parallel copying. I am not sure whether you really need a complete execution context on the device, at least GPUs are often not good at control flow atm. (I have heard that newer architectures start to fix that though).
I would then keep the tensor abstractions as Clojure datastructures/objects (as it is currently done due to tensor.md) that can also be mapped to other backends. This will allow to implement the reshaping and broadcasting on top of this low-level API while still leaving the exact on-device implementation and buffer handling to the backend.

I understand that you prefer not to take this approach and have your own Cuda pipeline, but maybe you can still figure out a way to join forces? I think Dragan is covering a lot more ground for a general optimization toolbox in would be really powerful to be able to take Cortex primitives and use neanderthal's functions with them (e.g. matrix decompositions and general BLAS routines). You mention that cortex is more generic here and I think you are right when you just talk about tensors and buffer management. Still having these libraries work with our primitives will make a big difference.
I might still be mistaken by the feasibility though. If you think this is a bad idea, please point again out which exact point(s) are making it impossible. I don't think the different tensor-types are a big problem for instance, in general you will only want to have float (32bit) or double precision atm. and not short, byte or integer matrices except for index sets.

I would be willing to do part of the work in that direction in coordination with Dragan, but only if you are committed to it and help with the buffer management parts, as I am not sufficiently experienced there.

Does this make sense?

Best,
Christian

P.S.: Ping me on the cortex slack if you like.

ml...@topiq.es

unread,

Oct 27, 2017, 1:36:30 PM10/27/17

to clojure-cortex

Hehe, I have nothing against chatting and also nothing against money :).

On Friday, October 27, 2017 at 6:10:38 PM UTC+2, Alistair Roche wrote:

I've been following this conversation with great interest! I'd pay money to watch you guys chat about it on Slack ;)

Maybe it's worth scheduling a time to have a chat on there?

On 23 October 2017 at 04:04, <ml...@topiq.es> wrote:

Great! I have just seen this post. I will reply here as soon as I have some time.

On Saturday, October 21, 2017 at 12:41:27 AM UTC+2, Chris Nuernberger wrote:
Hi Christian

I took some time today to create a couple design documents that outline some of the key concepts:

https://github.com/thinktopic/cortex/tree/tensor-design-doc/docs

The compute document is first and the tensor document is second and builds on the concepts in the compute document. I was hoping you could find some time to peruse them because they outline the larger picture regarding the cortex compute abstraction and how the tensor abstraction specifically relates to this.

--
You received this message because you are subscribed to the Google Groups "clojure-cortex" group.

To unsubscribe from this group and stop receiving emails from it, send an email to clojure-corte...@googlegroups.com.
To post to this group, send email to clojure...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/clojure-cortex/b224eebc-e74f-46dd-a11e-d11bf63a75c5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Alistair

Chris Nuernberger

unread,

Nov 2, 2017, 2:05:20 PM11/2/17

to clojure-cortex

When I looked through neanderthal there were a couple things and I have said them a few times but they are subtle.

Rebuilding neanderthal on top of the compute and tensor abstraction would be the fastest way to allow the two systems to work together. I think this is unlikely due to non-technical reasons but here is the reasoning:

Basically, I agree there are some higher level math constructs in neanderthal but it's lower level architecture is confused and combines concepts in a way to produce provably less flexibility. I could implement the neanderthal math constructs on top of the cortex tensors but the reverse isn't true.

We don't need 2 different cuda device contexts. I have done extensive testing and have a working, multithread aware cuda context and it was finished long before Dragan announced his jcuda backed one. This is a level beyond the neanderthal implementation. Furthermore I have the entire stream and event abstraction that is implementable across openCL *and* allows you to schedule computations and memory transfers on the device. Note also the compute abstraction is built to be a general purpose programming abstraction for any GPU algorithm; this is a much larger vision than neanderthal's binding to JCuda and JOpenCL.

The architecture of neanderthal is confused in that as I said you don't need a protocol for something like in-place transpose. The tensor design doesn't have this and all backends still work. I know this is a hard thing to understand but there aren't two different types of memory on a GPU; there shouldn't be a matrix 'class' at all. A matrix is a construct that combines some heavy data buffer with a description. If I change the description, then it isn't a 2d matrix any more but the memory the system points to is independent. Dragan and Mike Anderson confused this concept many times and it limits the overall programmatic flexibility of the system greatly; it forces duplication in algorithms because multiple things are implementing a class that do not need to. You really just need a binding to gemm and the rest of the unary, binary, and ternary operations defined in mutable fashion. We went to clojure to *avoid* large superfluous class hierarchies, not to re-create them.

The different datatypes are necessary for a lot of things you aren't working with at the moment or don't have experience with I guess. For one, most image processing is always going to involve converting between datatypes. For another, you are correct manipulating index buffers will probably involve integer arrays. The harder thing perhaps that isn't as visible is that I have marshalling assignments where I can assign float data to double arrays (which works with the tensor abstraction). This is necessary because in cuda, for instance, a subset of the random number generators are either not available or far slower in double land so it is far faster to generate all of your random numbers in floating point space and then copy them into double buffers. In any case, the cortex tensors are designed to work with algorithms outside the context of cortex and many many high performance algorithms deal with different datatypes. Now granted the other datatypes don't have the full math backend; I am not implementing gemm for short datatypes. I know from experience, however, that it is short sighted and artificially limiting to dismiss multiple datatypes if you take the wider context of HPC programming into account.

Taking those points together it seems to me the shortest path would be to implement an opencl backend for cortex. Then rebuild neanderthal on some of the cortex abstractions (potentially moving them outside of cortex). As I said earlier, this is unlikely due to nontechnical reasons *and I don't think it is the best path at this point*...

I think both the cortex tensors *and* neanderthal are not the best possible thing at this point. I have spent a bit of time researching tvm (the 'tensor vm' underlying nnvm) and I am certain their overall direction is better than any of the existing math systems for any language. They basically designed a generic high performance programming interface to multiple backends (cpu, cuda, opencl, ROC, apple metal, and verilog).

You can do things like express a set of operations and then try different scheduling constructs to change the performance profile of those operations in the case where you need a new operation (like gemm). You can do kernel combining (combining for instance max pooling with the activation that follows) into one kernel and let tvm's code generator solve the rest. Here is gemm (an extremely hard method to write well) expressed purely in python:

https://github.com/dmlc/tvm/blob/master/tests/python/integration/test_gemm.py

This is simply incredible; if we had this for clojure you could write complex high performance algorithms purely in clojure without needing to directly write cuda code or anything like that and they would be competitive with any implementation in any language, plus you would have multiple hardware backends (cuda, opencl, metal, webasm, verilog).

TVM is based on this library:

http://halide-lang.org/

Check out this video:

https://www.youtube.com/watch?v=3uiEyEKji0M&feature=youtu.be

Potentially the best thing for clojure and something that could bring Dragan, Mike Anderson, and myself together would be working towards implementing a unified high performance computing system for clojure on top of this system. Neural networks and math libraries are subset of the high performance computing algorithm set so we would be able to refactor cortex, core.matrix, and neanderthal after that easily enough to use that and that would be large, measurable gains across the community not just for cortex or neanderthal. This is the only thing that will give us competitive expressiveness *and* performance across many backends with a small amount of purely clojure code; everything else means potentially implementing a ton of things across each new backend (in their specific c-based language) and nothing at all is going to perform as well.

ml...@topiq.es

unread,

Nov 5, 2017, 3:46:35 PM11/5/17

to clojure-cortex

Hi Chris.

Thanks for reiterating your points.

Building on TVM sounds reasonable to me, because then we have a solid builing block for tensor operations on all the interesting hosts and move on from there. Building on neanderthal will need the replication of all the broadcasting and tensor handling that you already implemented yourself. Also having a language like halide is very beneficial compared to just wrapping primitive APIs, the video was really educating.

Outsourcing in this way in general sounds very good, and judging from my (still limited) vision experience, halide looks amazing. I think Ben would also agree on this. What do you think is necessary? I haven't had a close enough look yet, I guess it would make sense to implement an autograd library on top to make the optimization algorithms transparent and then build a convenient NN architecture on top?

Best,
Christian

ml...@topiq.es

unread,

Nov 28, 2017, 9:14:08 AM11/28/17

to clojure-cortex

Hi Chris.

As you have probably seen, there is another take at pytorch autograd, which is a bit further than my attempt probably: http://aria42.com/blog/2017/11/Flare-Clojure-Neural-Net . Do you still plan to go into the TVM direction?

Best,
Christian

Chris Nuernberger

unread,

Nov 29, 2017, 1:13:56 PM11/29/17

to clojure-cortex

I was sure you were behind that :-).

Yes for sure, I intend to refactor cortex to remove the old (pre-tensor) math system so that using it in it's current form is simpler (I added indexed dimensions, so you can use a tensor as the indexes for a given dimension in another tensor. We needed this for the center-loss implementation). But you can expect that past trying to simplify cortex I won't be adding huge features myself as I think any time I spend there would be better spent working through and documenting TVM so that I can help enable more clojurians to contribute to using it in lots of different ways; one if which will hopefully be cortex 2.0.

That is unless of course ThinkTopic has a business need that requires extending cortex; then I will have to work through that.

Reply all

Reply to author

Forward

0 new messages