PxxxxR0: SG19 Machine Learning Layered list - Invitation to edit

150 views
Skip to first unread message

Michael Wong (via Google Docs)

unread,
Jan 15, 2019, 11:56:54 AM1/15/19
to sg...@isocpp.org
Michael Wong has invited you to edit the following document:
Google Docs: Create and edit documents online.
Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
You have received this email because someone shared a document with you from Google Docs.
Logo for Google Docs

Jordi Inglada

unread,
Jan 17, 2019, 2:41:47 AM1/17/19
to sg...@isocpp.org
Hi all,

I hope this is not off-topic, but since I first read

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1360r0.pdf

it is not clear to me what this SG tries to tackle:

1. Language constructs needed for ML
2. Data structures for ML algorithms (tensors, graphs, etc.)
3. A library of common building blocks for ML (optimizers, linear algebra)
4. A full-fledged ML and data science ecosystem (i.e. CERN's Root on steroids)

Also, which community are we targeting? C++ developers
building/deploying production systems or scientists designing the
algorithms and pipelines for data analytics and prediction? These are
completely different communities and the latter does not use C++ to my
knowledge (because they are not taught C++ and are afraid of it, see
SG20).

I would very much like being able to implement data science and ML
pipelines in C++ instead of doing it in Python (because of the type
system, the performance, etc.), but the lack of a
scikit-learn+pandas+dask C++ equivalent makes it nearly impossible.

I may be approaching things in the wrong way in my work, but nowadays,
colleagues using Python are much more productive, not because of the
language, but because there are /de facto/ standards
(numpy+pandas+matplotlib) which have allowed building an ecosystem for
data science and machine learning where all higher level libs (SciPy,
scikit-learn, scikit-image) are easy to compose together.

If we want to do this with C++, we spend lots of time and energy choosing
the appropriate lib (I like the Random Forest from Shark, but prefer the
NN from Dlib or the logistic regression from mlpack) and then my code
has to convert data from shark vector to eigen arrays and then to
armadillo ...

In C++ we don't even have the equivalent of GNU GSL but there are 3 or 4
implementations of tensor libs with expression templates for deep
learning.

There are efforts, like the Apache Arrow project
(https://arrow.apache.org/) which will make easier to share in memory
data between languages and maybe the need to do ML in C++ will not exist
anymore.

I find the existence of this SG really thrilling, but I am afraid that,
from the outside, the goals and perimeter are not clear. Maybe this is
absolutely OK at this point in time (I have never taken part in such a
group), so don't take my thoughts too seriously!

Thanks for this amazing initiative.

Jordi






On Tue 15-Jan-2019 at 17:56:53 +01, "Michael Wong (via Google Docs)" <fragga...@gmail.com> wrote:
> Michael Wong has invited you to edit the following document:
> *
> PxxxxR0: SG19 Machine Learning Layered list
> Open in Docs
>
>
> Google Docs: Create and edit documents online. Logo for Google Docs
> Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA
> You have received this email because someone shared a document with you from Google Docs.
> Logo for Google Docs

Michael Wong

unread,
Jan 17, 2019, 10:37:12 AM1/17/19
to SG19 - Machine Learning

HI Jordi, here is my opinion, not necessarily set in stone. Indeed, I would say much of this can be added to the Layering paper we just started as I am sure a lot of other people will have similar questions and needs these answers.

On Thursday, January 17, 2019 at 2:41:47 AM UTC-5, jordi.inglada wrote:
Hi all,

I hope this is not off-topic, but since I first read

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1360r0.pdf

it is not clear to me what this SG tries to tackle:

1. Language constructs needed for ML
Likely
2. Data structures for ML algorithms (tensors, graphs, etc.)
Very Likely
3. A library of common building blocks for ML (optimizers, linear algebra)
Definitely 
4. A full-fledged ML and data science ecosystem (i.e. CERN's Root on steroids)
Maybe
 
Also, which community are we targeting? C++ developers
building/deploying production systems or scientists designing the
algorithms and pipelines for data analytics and prediction? These are
I think yes to the first as an immediate goal, and the second is a longer term goal
completely different communities and the latter does not use C++ to my
knowledge (because they are not taught C++ and are afraid of it, see
SG20).
Yes,I see that most data scientists end up using python or R,  but often its a case of the language suits their needs. I am not sure how many will use C++ if similar facilities are available, but we will only know if we build it. However, I also think that field will become even more diversified as more languages and tools try to meet the insatiable demand: Swift, mathlab, etc.


I would very much like being able to implement data science and ML
pipelines in C++ instead of doing it in Python (because of the type
system, the performance, etc.), but the lack of a
scikit-learn+pandas+dask C++ equivalent makes it nearly impossible.
This area or a portability layer is something that will need discussion as to how to best satisfy those needs. We should state that in the paper.


I may be approaching things in the wrong way in my work, but nowadays,
colleagues using Python are much more productive, not because of the
language, but because there are /de facto/ standards
(numpy+pandas+matplotlib) which have allowed building an ecosystem for
data science and machine learning where all higher level libs (SciPy,
scikit-learn, scikit-image) are easy to compose together.

If we want to do this with C++, we spend lots of time and energy choosing
the appropriate lib (I like the Random Forest from Shark, but prefer the
NN from Dlib or the logistic regression from mlpack) and then my code
has to convert data from shark vector to eigen arrays and then to
armadillo ...

In C++ we don't even have the equivalent of GNU GSL but there are 3 or 4
implementations of tensor libs with expression templates for deep
learning.

There are efforts, like the Apache Arrow project
(https://arrow.apache.org/) which will make easier to share in memory
data between languages and maybe the need to do ML in C++ will not exist
anymore.

This as an in-memory replacement should be discussed, but in the mean time we should just put these thoughts in the paper to see if there are reactions.

I find the existence of this SG really thrilling, but I am afraid that,
from the outside, the goals and perimeter are not clear. Maybe this is
absolutely OK at this point in time (I have never taken part in such a
group), so don't take my thoughts too seriously!
Please add these thoughts directly in line in the paper. We don't have to have answers for everything yet, but putting them down will enable a useful discussion to begin.

Thanks for this amazing initiative.
Thanks for your support.

Michael Wong

unread,
Jan 17, 2019, 10:42:46 AM1/17/19
to SG19 - Machine Learning
Further to GNU GSL, we do have in the Standard the Special Math Library which contains many of the same scientific library functions where we can patch in anything missing. The other thing we can do is formally adopt the C++ wrappers:

This is the reason to have these questions and issues laid out in writing, either in this forum/reflector or in the paper (i like both using the second one as an intermediate checkpoint) so we can find directions together.
Thanks.

Richard Dosselmann

unread,
Jan 17, 2019, 12:13:47 PM1/17/19
to SG19 - Machine Learning, fragga...@gmail.com
When a person talks about "machine learning" in the modern era, they often mean "deep learning" and "artificial neural networks", as this is where much of the action is. We should ensure that this proposed library/system has direct support for deep learning and artificial neural networks to allow C++ to be a leader in machine learning.

Michael Wong

unread,
Jan 17, 2019, 12:18:00 PM1/17/19
to Richard Dosselmann, SG19 - Machine Learning
Please add these  words to the document. Thanks.

lennard...@student.uni-tuebingen.de

unread,
Jan 17, 2019, 3:07:31 PM1/17/19
to SG19 - Machine Learning
Hello Jordi, I would like to express my opinion on this too 
1. Language constructs needed for ML
Eventually 
2. Data structures for ML algorithms (tensors, graphs, etc.)
Definitely. 
3. A library of common building blocks for ML (optimizers, linear algebra)
Definitely. Once data structures and algorithms are set, it should be very easy to tackle this.
4. A full-fledged ML and data science ecosystem (i.e. CERN's Root on steroids)
Most likely not part of the SG 

Also, which community are we targeting? C++ developers
building/deploying production systems or scientists designing the
algorithms and pipelines for data analytics and prediction? These are
completely different communities and the latter does not use C++ to my
knowledge (because they are not taught C++ and are afraid of it, see
SG20).  
The question here is: who could benefit from the proposal, eventually leading them to use it.
In my mind I have C++ developers alongside devops deploying models from scientists into a fast (low-overhead) and massively parallel environment (features that e.g python lacks).


I would very much like being able to implement data science and ML
pipelines in C++ instead of doing it in Python (because of the type
system, the performance, etc.), but the lack of a
scikit-learn+pandas+dask C++ equivalent makes it nearly impossible.

I may be approaching things in the wrong way in my work, but nowadays,
colleagues using Python are much more productive, not because of the
language, but because there are /de facto/ standards
(numpy+pandas+matplotlib) which have allowed building an ecosystem for
data science and machine learning where all higher level libs (SciPy,
scikit-learn, scikit-image) are easy to compose together.
 
I think that this is the right approach. Usually, implementing standard by example would not be advised. 
In the field of ML however, without a preexisting, well-known ecosystem, there will be little to zero chance for adoption.
 


If we want to do this with C++, we spend lots of time and energy choosing
the appropriate lib (I like the Random Forest from Shark, but prefer the
NN from Dlib or the logistic regression from mlpack) and then my code
has to convert data from shark vector to eigen arrays and then to
armadillo ...
 
I think a serious proposal should dig into building up all the ML foundations (data structures, then algorithms, possible language constructs) one by one.
All selected libraries should be as future STL-compliant as possible, e.g for data structures, it is crucial that these will be using the excellent range proposal, for algorithms supporting coroutine will become a must.
An excellent example of such a library is https://github.com/cbbowen/graph/.

With all the right things in place it should be trivial to implement ML building blocks.
 

In C++ we don't even have the equivalent of GNU GSL but there are 3 or 4
implementations of tensor libs with expression templates for deep
learning. 

There are efforts, like the Apache Arrow project
(https://arrow.apache.org/) which will make easier to share in memory
data between languages and maybe the need to do ML in C++ will not exist
anymore.
 
I am unsure whether this should be definite goal of this proposal. 
Imho it will be better to produce (or consume) libraries that supports moving models between the language barrier, such as https://onnx.ai/.
ML is surely not the right place to tackle cross-language INPROC/IPC.

William Tambellini

unread,
Feb 13, 2019, 12:44:45 PM2/13/19
to SG19 - Machine Learning, lennard...@student.uni-tuebingen.de
I am also voting for that group to keep an eye on fully open source cross industry standards like onnx : onnx relies on Tensors and Graphs so it should be compatible/friendly with the goal of that group I guess.
Their working group :
Kind
Reply all
Reply to author
Forward
0 new messages