Message from discussion
Best Practices for passing numpy data pointer to C ?
Received: by 10.205.139.2 with SMTP id iu2mr1133818bkc.7.1343491045410;
Sat, 28 Jul 2012 08:57:25 -0700 (PDT)
X-BeenThere: cython-users@googlegroups.com
Received: by 10.204.7.213 with SMTP id e21ls3982470bke.2.gmail; Sat, 28 Jul
2012 08:57:22 -0700 (PDT)
Received: by 10.205.139.2 with SMTP id iu2mr1133797bkc.7.1343491042142;
Sat, 28 Jul 2012 08:57:22 -0700 (PDT)
Received: by 10.205.139.2 with SMTP id iu2mr1133796bkc.7.1343491042110;
Sat, 28 Jul 2012 08:57:22 -0700 (PDT)
Return-Path: <sturlamol...@yahoo.no>
Received: from mail-forward2.uio.no (mail-forward2.uio.no. [129.240.10.71])
by gmr-mx.google.com with ESMTPS id e23si1515341bks.0.2012.07.28.08.57.21
(version=TLSv1/SSLv3 cipher=OTHER);
Sat, 28 Jul 2012 08:57:22 -0700 (PDT)
Received-SPF: neutral (google.com: 129.240.10.71 is neither permitted nor denied by best guess record for domain of sturlamol...@yahoo.no) client-ip=129.240.10.71;
Authentication-Results: gmr-mx.google.com; spf=neutral (google.com: 129.240.10.71 is neither permitted nor denied by best guess record for domain of sturlamol...@yahoo.no) smtp.mail=sturlamol...@yahoo.no
Received: from exim by mail-out2.uio.no with local-bsmtp (Exim 4.75)
(envelope-from <sturlamol...@yahoo.no>)
id 1Sv9OL-0002QU-LQ
for cython-users@googlegroups.com; Sat, 28 Jul 2012 17:57:21 +0200
Received: from mail-mx1.uio.no ([129.240.10.29])
by mail-out2.uio.no with esmtp (Exim 4.75)
(envelope-from <sturlamol...@yahoo.no>)
id 1Sv9OL-0002QR-KX
for cython-users@googlegroups.com; Sat, 28 Jul 2012 17:57:21 +0200
Received: from ip-18-9-179-93.dialup.ice.net ([93.179.9.18] helo=[192.168.0.3])
by mail-mx1.uio.no with esmtpsa (TLSv1:DHE-RSA-CAMELLIA256-SHA:256)
user sturlamo (Exim 4.80)
(envelope-from <sturlamol...@yahoo.no>)
id 1Sv9OF-000693-Du
for cython-users@googlegroups.com; Sat, 28 Jul 2012 17:57:21 +0200
Message-ID: <50140BE6.8040907@yahoo.no>
Date: Sat, 28 Jul 2012 17:57:26 +0200
From: Sturla Molden <sturlamol...@yahoo.no>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2
MIME-Version: 1.0
To: cython-users@googlegroups.com
Subject: Re: [cython-users] Re: Best Practices for passing numpy data pointer
to C ?
References: <CALGmxEJOLLx8VtNtK3jBt-AJ=kNU5i1kbEqgyTv9OErBgOG...@mail.gmail.com> <4666d921-974c-4236-83b3-5dd6ed98d09c@googlegroups.com> <CALGmxE+oftOv4sN6PGE=sp+sVLnmarLTEopEF85QCof9_GS...@mail.gmail.com> <5012E12A.7040...@yahoo.no> <070ba075-6fec-4f47-a293-a753ca4a1...@email.android.com> <50131250.9050...@yahoo.no> <CAMKS98_9KphiTfmKb_6qFh7uXLCAW55+21PN87DYgcj+kpZ...@mail.gmail.com> <5013F9E6.3080...@yahoo.no> <501402E1.7010...@yahoo.no> <CAMKS98_PuTddqs4SGMLLyODB9KR2VnSAz7bge_9LzCHKMgb...@mail.gmail.com>
In-Reply-To: <CAMKS98_PuTddqs4SGMLLyODB9KR2VnSAz7bge_9LzCHKMgb...@mail.gmail.com>
Content-Type: multipart/alternative;
boundary="------------070405010601070103070805"
X-UiO-SPF-Received:
X-UiO-Ratelimit-Test: rcpts/h 2 msgs/h 2 sum rcpts/h 2 sum msgs/h 2 total rcpts 1705 max rcpts/h 15 ratelimit 0
X-UiO-Spam-info: not spam, SpamAssassin (score=-5.0, required=5.0, autolearn=disabled, FREEMAIL_FROM=0.001,FSL_RCVD_USER=0.001,HTML_MESSAGE=0.001,UIO_MAIL_IS_INTERNAL=-5, uiobl=NO, uiouri=NO)
X-UiO-Scanned: D72A57537069FC2ED276D8784CA277B80CDC8AB7
X-UiO-SPAM-Test: remote_host: 93.179.9.18 spam_score: -49 maxlevel 80 minaction 2 bait 0 mail/h: 2 total 47 max/h 12 blacklist 0 greylist 0 ratelimit 0
This is a multi-part message in MIME format.
--------------070405010601070103070805
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
I finally managed to make git/github work...
https://github.com/sturlamolden/memview_benchmarks
You got a pull request. I'm not sure if you already updated your code.
I'm very happy with the speed of memoryviews too, particularly slicing.
The slowness of slicing np.ndarray was the reason I never could use
Cython+NumPy instead of Fortran 95.
I now want to see a more realistic benchmark. I'm not sure if porting
Scimark will be too much work. I want preferably to compare these on a
set of real-world problems:
Python
Python with NumPy
C
C++ using STL
Fortran 77
Fortran 95
Cython with memoryviews
Java (perhaps)
C#.NET (perhaps)
MATLAB (perhaps)
Or perhaps we could use the Debian shootout?
Sturla
Den 28.07.2012 17:39, skrev Jake Vanderplas:
> Sturla,
> Thanks for looking at this. I'm still learning the details of
> optimizing memviews - these are very impressive benchmarks! I've
> updated my github repository with your changes:
> https://github.com/jakevdp/memview_benchmarks
> Thanks
> Jake
>
> On Sat, Jul 28, 2012 at 8:18 AM, Sturla Molden <sturlamol...@yahoo.no
> <mailto:sturlamol...@yahoo.no>> wrote:
>
> I found another issue, the memoryview slices were not declared
> contiguous. This reduced the runtime from 1.86 to 1.83 seconds.
> That puts the overhead from using memoryview slices to 2.2%
> compared to raw C pointer arithmetics. The benchmark creates two
> million memoryview slices and computes one million dot products,
> each with vector lengths of 1000. I am more than willing to accept
> those 2.2 % to avoid those pesky pointers, but it remains to be
> seen how memoryviews perform on a more realistic problem.
>
> Sturla
>
>
>
>
> Den 28.07.2012 16:40, skrev Sturla Molden:
>
>
> I prepared some quick-and-dirty benchmarks of the behavior
> I need at https://github.com/jakevdp/memview_benchmarks/
> -- I'd be interested if people more familiar with
> memory-views could take a look and let me know if I'm
> missing anything there.
> Jake
>
>
>
> I took the liberty to update your banchmarks (see attachment).
> For example I noticed that GCC was clever enough to optimize
> out all the loops in your pointer_arith.pyx...
>
> Here are the timings I got from the updated version in the
> attachment. I think this gives the correct picture:
>
> D:\memview-benchmarks\new>python runme.py
> numpy_only: 6.86 sec
> cythonized_numpy: 5.74 sec
> cythonized_numpy_2: 10.4 sec
> cythonized_numpy_2b: 6.25 sec
> cythonized_numpy_3: 2.43 sec
> cythonized_numpy_4: 1.78 sec
> pointer_arith: 1.79 sec
> memview: 1.86 sec
>
> There is a table in the attached PDF that should be easier to
> read.
>
> The overhead from the numpy versions comes from slicing the
> ndarray. In comparison, slicing the memoryview has a very
> small overhead. If we slice the ndarray in Cython, this is not
> much better than just using plain numpy in Python. But if we
> use memoryviews, slicing is just a little bit slower than
> using C style pointer arithmetics.
>
> And consider this: Numerical code using array slicing in
> Fortran90 with gfortran is often 2x slower than the same code
> using pointer arithmetics in C with GCC. At least in my
> experience (Fortran 77 is another matter.)
>
> If you wonder why using np.dot was faster than writing out the
> loop in Cython, that is due to Intel MKL in Enthought.
>
> Conclusion:
>
> Memoryviews are extremely fast, comparable to pointer
> arithmetics in C.
>
> Now we need a real benchmark, e.g. some linear algebra solver
> or an FFT. Something like Scimark perhaps. Cython vs. C vs.
> Fortran 90.
>
> Sturla
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
--------------070405010601070103070805
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
I finally managed to make git/github work...<br>
<a class="moz-txt-link-freetext" href="https://github.com/sturlamolden/memview_benchmarks">https://github.com/sturlamolden/memview_benchmarks</a><br>
<br>
You got a pull request. I'm not sure if you already updated your
code.<br>
<br>
I'm very happy with the speed of memoryviews too, particularly
slicing. The slowness of slicing np.ndarray was the reason I never
could use Cython+NumPy instead of Fortran 95. <br>
<br>
I now want to see a more realistic benchmark. I'm not sure if
porting Scimark will be too much work. I want preferably to compare
these on a set of real-world problems:<br>
<br>
Python<br>
Python with NumPy<br>
C <br>
C++ using STL<br>
Fortran 77<br>
Fortran 95 <br>
Cython with memoryviews<br>
Java (perhaps)<br>
C#.NET (perhaps)<br>
MATLAB (perhaps)<br>
<br>
Or perhaps we could use the Debian shootout?<br>
<br>
<br>
Sturla<br>
<br>
<br>
<br>
Den 28.07.2012 17:39, skrev Jake Vanderplas:
<blockquote
cite="mid:CAMKS98_PuTddqs4SGMLLyODB9KR2VnSAz7bge_9LzCHKMgb...@mail.gmail.com"
type="cite">Sturla,<br>
Thanks for looking at this. I'm still learning the details of
optimizing memviews - these are very impressive benchmarks! I've
updated my github repository with your changes:<br>
<a moz-do-not-send="true"
href="https://github.com/jakevdp/memview_benchmarks">https://github.com/jakevdp/memview_benchmarks</a><br>
Thanks<br>
Jake<br>
<br>
<div class="gmail_quote">On Sat, Jul 28, 2012 at 8:18 AM, Sturla
Molden <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:sturlamol...@yahoo.no" target="_blank">sturlamol...@yahoo.no</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
I found another issue, the memoryview slices were not declared
contiguous. This reduced the runtime from 1.86 to 1.83
seconds. That puts the overhead from using memoryview slices
to 2.2% compared to raw C pointer arithmetics. The benchmark
creates two million memoryview slices and computes one million
dot products, each with vector lengths of 1000. I am more than
willing to accept those 2.2 % to avoid those pesky pointers,
but it remains to be seen how memoryviews perform on a more
realistic problem.<br>
<br>
Sturla<br>
<br>
<br>
<br>
<br>
Den 28.07.2012 16:40, skrev Sturla Molden:
<div class="HOEnZb">
<div class="h5"><br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
I prepared some quick-and-dirty benchmarks of the
behavior I need at <a moz-do-not-send="true"
href="https://github.com/jakevdp/memview_benchmarks/"
target="_blank">https://github.com/jakevdp/memview_benchmarks/</a>
-- I'd be interested if people more familiar with
memory-views could take a look and let me know if I'm
missing anything there.<br>
Jake<br>
</blockquote>
<br>
<br>
I took the liberty to update your banchmarks (see
attachment). For example I noticed that GCC was clever
enough to optimize out all the loops in your
pointer_arith.pyx...<br>
<br>
Here are the timings I got from the updated version in
the attachment. I think this gives the correct picture:<br>
<br>
D:\memview-benchmarks\new>python runme.py<br>
numpy_only: 6.86 sec<br>
cythonized_numpy: 5.74 sec<br>
cythonized_numpy_2: 10.4 sec<br>
cythonized_numpy_2b: 6.25 sec<br>
cythonized_numpy_3: 2.43 sec<br>
cythonized_numpy_4: 1.78 sec<br>
pointer_arith: 1.79 sec<br>
memview: 1.86 sec<br>
<br>
There is a table in the attached PDF that should be
easier to read.<br>
<br>
The overhead from the numpy versions comes from slicing
the ndarray. In comparison, slicing the memoryview has a
very small overhead. If we slice the ndarray in Cython,
this is not much better than just using plain numpy in
Python. But if we use memoryviews, slicing is just a
little bit slower than using C style pointer
arithmetics.<br>
<br>
And consider this: Numerical code using array slicing in
Fortran90 with gfortran is often 2x slower than the same
code using pointer arithmetics in C with GCC. At least
in my experience (Fortran 77 is another matter.)<br>
<br>
If you wonder why using np.dot was faster than writing
out the loop in Cython, that is due to Intel MKL in
Enthought.<br>
<br>
Conclusion:<br>
<br>
Memoryviews are extremely fast, comparable to pointer
arithmetics in C.<br>
<br>
Now we need a real benchmark, e.g. some linear algebra
solver or an FFT. Something like Scimark perhaps. Cython
vs. C vs. Fortran 90.<br>
<br>
Sturla<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
</blockquote>
<br>
</div>
</div>
</blockquote>
</div>
<br>
</blockquote>
<br>
</body>
</html>
--------------070405010601070103070805--