Yeppp

Pierre-Yves Gérardy

unread,

Jan 20, 2014, 3:06:55 PM1/20/14

to juli...@googlegroups.com

Yepp is a low level, high performance library for vectorized operations. It may be worth a look for elementwise operation on arrays.

It supports the x86(-64), ARM and MIPS architectures, and takes advantage of a lot of SIMD extensions (from SSE to AVX2 on x86, NEON on ARM).

http://www.yeppp.info/

— Pierre-Yves

Stefan Karpinski

unread,

Jan 20, 2014, 3:42:29 PM1/20/14

to Julia Dev

Interesting. Looks like it's MIT licensed, which is good.

Marat Dukhan

unread,

Jan 20, 2014, 7:16:24 PM1/20/14

to juli...@googlegroups.com

Hi, Yeppp! author here.

The goal of Yeppp! project is to provide an optimized set of high-level operations, which can be used as a target by high-level language compilers. In a sense, Yeppp! functions are high-level analogs of LLVM instructions, which operate on vectors instead of individual elements. Due to higher level of operation, Yeppp! functions can exploit more optimization opportunities than C compilers (e.g. the unroll factor for dot product functions in Yeppp! is adjusted individually for each microarchitecture to maximize the throughput of floating-point instructions). As operations on vectors are an important use-case for Julia, I am sure that it could benefit from optimized functions in Yeppp!

I previously considered providing Julia bindings for Yeppp!, and with help from Miles Lubin created a small benchmark to demonstrate that it is beneficial for performance. The demo is available on www.yeppp.info/resources/julia-bindings/Yeppp.tgz (Julia bindings) and www.yeppp.info/resources/julia-bindings/yeppp.jl (the benchmark). This plot below demonstrates performance on AMD FX-6300 (Piledriver microarchitecture), Julia 0.2.0-prerelease+3788, and Yeppp! 1.0.0:

Unfortunately, due to lack of time in the past semester, I didn't complete this work, and the bindings cover only a small part of Yeppp! If you want to try this benchmark on your machine, use the following commands (assuming you are on Linux or Mac):

wget https://bitbucket.org/MDukhan/yeppp/downloads/yeppp-1.0.0.tar.bz2

tar -xjf yeppp-1.0.0.tar.bz2

source yeppp-1.0.0/set-vars.sh

wget http://www.yeppp.info/resources/julia-bindings/Yeppp.tgz

tar -xzf Yeppp.tgz

mv Yeppp $HOME/.julia/

wget http://www.yeppp.info/resources/julia-bindings/yeppp.jl

julia yeppp.jl

I welcome your feedback, and if Julia community is interested, I would be willing to bring Yeppp!-powered speedups to Julia.

Regards,

Marat

Marat Dukhan

unread,

Jan 20, 2014, 7:17:35 PM1/20/14

to juli...@googlegroups.com

Yeppp! is BSD-licensed. However, it uses a custom runtime library (in runtime subdirectory of Yeppp! source tree) which is MIT-licensed.

Regards,

Marat

Stefan Karpinski

unread,

Jan 20, 2014, 7:26:29 PM1/20/14

to Julia Dev

This is really impressive stuff – and right up our alley. Fortunately, MIT and BSD are quite compatible. It may make sense to include Yeppp as a compile-time dependency of Julia itself and use its fast implementations of various vector operations. Our multiple dispatch system provides a super convenient way of calling Yepp functions.

Viral Shah

unread,

Jan 24, 2014, 6:37:15 AM1/24/14

to juli...@googlegroups.com

In general, we have avoided writing our array library in C, and keeping everything in julia. However, in cases where such high performance libraries are available, it certainly is worthwhile to integrate them, and multiple dispatch provides for a clean implementation.

-viral

Kevin Squire

unread,

Jan 24, 2014, 1:13:25 PM1/24/14

to juli...@googlegroups.com

It would be great to see a comparison between Yeppp and the @simd macro branch (https://github.com/JuliaLang/julia/pull/5355)

Simon Kornblith

unread,

Jan 24, 2014, 1:46:31 PM1/24/14

to juli...@googlegroups.com

This would be interesting for the basic arithmetic functions, but Julia's transcendental functions are all calls to openlibm, so they are not vectorizable. The Yeppp implementations are likely to be appreciably faster than calling libm on each element individually, but there may be differences in accuracy.

Kevin Squire

unread,

Jan 24, 2014, 3:00:04 PM1/24/14

to juli...@googlegroups.com

Thanks, Simon.

Stefan Karpinski

unread,

Jan 24, 2014, 3:02:54 PM1/24/14

to Julia Dev

Marat, do you have any info about the accuracy of Yeppp transcendendal functions?

Marat Dukhan

unread,

Jan 24, 2014, 5:06:00 PM1/24/14

to juli...@googlegroups.com

The plots on Yeppp! website give information about the measured error of Yeppp! functions vs other implementations.

Here are the plots for log function:

Intel Haswell
Intel Sandy Bridge
Intel Nehalem

AMD Piledriver
AMD Bobcat
AMD K10

And plots for exp function:

Intel Haswell

Intel Sandy Bridge
Intel Nehalem
AMD Piledriver

AMD Bobcat
AMD K10

In summary, the errors for log and exp are within 2 ULP. The errors for sin and cos are within 3 ULP on [-51471.85.0, 51471.85], but quickly becomes inaccurate beyond this range (the range can be extended to |x|<1.6e+6 without affecting performance). Yeppp! is slightly more accurate on processors with FMA (Intel Haswell and AMD Piledriver on the plots above).

How to interpret these numbers? The best possible accuracy is correctly rounded 0.5 ULP (as far as I know only CRLibM achieves this level of accuracy). Most scalar libraries aim to be within 1 ULP. For SIMD and vector libraries the accuracy requirements are usually more relaxed: e.g. auto-vectorization by Intel Compiler by default uses functions from Intel SVML with guaranteed accuracy within 4 ULP (measured accuracy is better, see Intel SVML/LA on the plots). Unless it makes things very complicated, I would suggest to use OpenLibM for scalar computations and Yeppp! for vector computations (Yeppp! is optimized to work on vectors of 100 elements or more, and it is quite slow for scalar computations).

Regards,

Marat

Viral Shah

unread,

Jan 24, 2014, 11:26:12 PM1/24/14

to juli...@googlegroups.com

This is really amazing.

I am not sure I like the idea of using openlibm for scalar and yepp for vector computations for the reason that the julia user will see different results depending on how they write code. Maybe documenting these differences could be sufficient. In any case, we should certainly integrate Yeppp, and these are things we can certainly address as we go along.

-viral

Viral Shah

unread,

Jan 24, 2014, 11:35:43 PM1/24/14

to juli...@googlegroups.com

Marat,

The build process for Yeppp! looks pretty daunting. I guess the best thing to do is simply use the precompiled libraries that you ship?

-viral

On Saturday, January 25, 2014 3:36:00 AM UTC+5:30, Marat Dukhan wrote:

Aron Ahmadia

unread,

Jan 24, 2014, 11:47:54 PM1/24/14

to juli...@googlegroups.com

Speaking of build processes, I'm on the ground at MIT next Wednesday. Is there any interest from any MIT-local Julia developers in meeting to discuss trying to get hashdist/hashstack2 going for your development stacks?

https://github.com/hashdist/hashstack2

Aron

Marat Dukhan

unread,

Jan 25, 2014, 12:07:26 AM1/25/14

to juli...@googlegroups.com

Yes, building Yeppp! is quite complicated (it needs upstream versions of toolchains), so I recommend that users use pre-built binaries unless they have a security requirement to build everything from source. On Linux and Windows Yeppp! does not have any shared library dependencies (on OS X it depends on LibC), and pre-built binaries should work on any modern system (Linux minimum: 2.6+ kernel; Windows minimum: Windows XP; OS X minimum: 10.5).

If your worries are about backward compatibility, I can assure that Yeppp! takes them seriously. All 1.x releases will be API-compatible, and all 1.0.x releases will be ABI-compatible. To make auto-downloads easier for you, I will create a link which will redirect to the latest Yeppp! releases in 1.x and 1.0.x series.

Regards,

Marat

Marat Dukhan

unread,

Jan 25, 2014, 12:39:44 AM1/25/14

to juli...@googlegroups.com

I have set up redirects and suggest that you use them in Julia build system:

http://get.yeppp.info/stable/yeppp-1.0.x.tar.bz2 redirects to the latest stable release in 1.0.*.* series (ABI and API-compatible with 1.0.0) in tar + bz2 format

http://get.yeppp.info/stable/yeppp-1.0.x.zip redirects to the latest stable release in 1.0.*.* series (ABI and API-compatible with 1.0.0) in zip format

http://get.yeppp.info/stable/yeppp-1.x.tar.bz2 redirects to the latest stable release in 1.*.*.* series (API-compatible with 1.0.0) in tar + bz2 format

http://get.yeppp.info/stable/yeppp-1.x.zip redirects to the latest stable release in 1.*.*.* series (API-compatible with 1.0.0) in zip format

Regards,

Marat

Viral Shah

unread,

Jan 25, 2014, 12:47:39 AM1/25/14

to juli...@googlegroups.com

Marat, I took your sources and have created the Yeppp.jl package. Would be great if you can help with it. I will also experiment with it, and I am sure many others will too.

-viral

Viral Shah

unread,

Jan 25, 2014, 12:47:59 AM1/25/14

to juli...@googlegroups.com

Forgot the URL:

https://github.com/JuliaLang/Yeppp.jl

Marat Dukhan

unread,

Jan 25, 2014, 1:14:42 AM1/25/14

to juli...@googlegroups.com

Thanks, I will have a look.

Most of Yeppp! bindings for Fortran, .Net and JVM are auto-generated, and I could generate a large portion of Julia bindings as well (for all functions in yepCore and yepMath modules).

However, to make it work I will need a different file structure, e.g.

$PACKAGE_ROOT/$Module/$FunctionGroup.jl where $Module is Core or Math (probably in lowercase) and $FunctionGroup is a name of a group of functions (e.g. Add, Subtract, Multiply, DotProduct, SumAbs, Log, Exp, Sin, EvaluatePolynomial).

Is such change feasible?

It would be good to call yepLibrary_Init when the package is loaded and yepLibrary_Release when it is unloaded automatically. Does Julia support that?

Also, when I tested it ~6 month ago I noticed significant overhead in passing arrays from Julia to Yeppp! (it worked substantially slower than equivalent C code which called Yeppp! on the same machine). Probably it improved since then, but it would be good to check. There is an Entropy computation example in $YEPROOT/examples/c/source/Entropy.c which performs similar computation as my benchmark for Julia + Yeppp!

Regards,

Marat

Viral Shah

unread,

Jan 25, 2014, 3:23:51 AM1/25/14

to juli...@googlegroups.com

Within src, you can have any structure you want. It would be great if you can auto-generate the wrappers with the directory structure you need, and we will just include those files into the package top-level file.

For now, the init and release will need to be manual, but I believe there is an open issue on this.

On passing arrays from julia to C, it would be great to quantify the cost. We certainly want this to be as low as possible. Do you have some kind of a benchmark for this? We can even include it in our performance suite. A lot has changed in the last 6 months, so it may have improved from what you last saw.

-viral

Stefan Karpinski

unread,

Jan 25, 2014, 2:18:55 PM1/25/14

to Julia Dev

On Sat, Jan 25, 2014 at 1:14 AM, Marat Dukhan <mar...@gmail.com> wrote:

Also, when I tested it ~6 month ago I noticed significant overhead in passing arrays from Julia to Yeppp! (it worked substantially slower than equivalent C code which called Yeppp! on the same machine).

This doesn't make much sense since we just emit call instructions and pass array data as pointers to the C code. The only thing that might be going on is if the C compiler is inlining calls, which we can't do, but that seems somewhat unlikely, especially for large vector operations. Do you have any more details about this?

Stefan Karpinski

unread,

Jan 25, 2014, 2:30:23 PM1/25/14

to Julia Dev

On Fri, Jan 24, 2014 at 5:06 PM, Marat Dukhan <mar...@gmail.com> wrote:

In summary, the errors for log and exp are within 2 ULP. The errors for sin and cos are within 3 ULP on [-51471.85.0, 51471.85], but quickly becomes inaccurate beyond this range (the range can be extended to |x|<1.6e+6 without affecting performance). Yeppp! is slightly more accurate on processors with FMA (Intel Haswell and AMD Piledriver on the plots above).

How to interpret these numbers? The best possible accuracy is correctly rounded 0.5 ULP (as far as I know only CRLibM achieves this level of accuracy). Most scalar libraries aim to be within 1 ULP. For SIMD and vector libraries the accuracy requirements are usually more relaxed: e.g. auto-vectorization by Intel Compiler by default uses functions from Intel SVML with guaranteed accuracy within 4 ULP (measured accuracy is better, see Intel SVML/LA on the plots). Unless it makes things very complicated, I would suggest to use OpenLibM for scalar computations and Yeppp! for vector computations (Yeppp! is optimized to work on vectors of 100 elements or more, and it is quite slow for scalar computations).

On Fri, Jan 24, 2014 at 11:26 PM, Viral Shah <vi...@mayin.org> wrote:

I am not sure I like the idea of using openlibm for scalar and yepp for vector computations for the reason that the julia user will see different results depending on how they write code. Maybe documenting these differences could be sufficient. In any case, we should certainly integrate Yeppp, and these are things we can certainly address as we go along.

I agree that this is the single biggest blocker that prevents the usage of Yeppp in base Julia. Being within 2-3 ulps on transcendental functions isn't good enough for the default behavior – speed is important, but correctness needs to come first whenever it can without sacrificing too much speed. Even if the error could be reduced to within 1 ULP – which is where we aim to be with openlibm – the fact that scalar and vector operations give different answers prevents using the fast vectorized versions by default. So I think that Yuppp should be an opt-in package that replaces the relevant vector operations with its faster but less accurate versions. The possibility of disagreement between scalar and vector computations should be put up-front-and-center in the Yeppp package documentation and people can make an informed decision about the tradeoff involved.

Tobi

unread,

Jan 25, 2014, 2:40:56 PM1/25/14

to juli...@googlegroups.com

I just want to point out that the autovectorization of loops using SIMD instructions that will hopefully land in Julia soon, will also change the result of vectorized instructions compared to scalar instructions. This is because SIMD registers are "tight" and use e.g. 64bit for a double while scalar floating point ALUs are 80bit.

Stefan Karpinski

unread,

Jan 25, 2014, 2:43:02 PM1/25/14

to Julia Dev

Fortunately, annotating with @simd is also an opt-in thing. If there is actually a risk of it changing results, then that needs to be very clearly documented, and the transformation cannot be done automatically without opting in with a macro.

Tobi

unread,

Jan 25, 2014, 2:48:49 PM1/25/14

to juli...@googlegroups.com

I actually thought that the @simd macro would only be needed for some loops while others are automatically vectorized in that PR.

Coming more from a C background I don't have a problem with that. If you do "-O3" with GCC it will also vectorize.

Stefan Karpinski

unread,

Jan 25, 2014, 2:52:46 PM1/25/14

to Julia Dev

We'll have to be careful about what kinds of automatic optimizations we allow. Some will be ok, but others definitely won't.

Tobi

unread,

Jan 25, 2014, 3:19:46 PM1/25/14

to juli...@googlegroups.com

True, but SIMD seem to be unavoidable. We could not use BLAS otherwise.

Dahua

unread,

Jan 25, 2014, 7:06:39 PM1/25/14

to juli...@googlegroups.com

Hi Marat,

This is very interesting, and I wholeheartedly support the integration of such libraries to Julia.

Our of curiosity, how's the performance of Yeppp! compared with Intel IPP or VML? They seem to provide a similar set of functionalities.

Dahua

Marat Dukhan

unread,

Jan 25, 2014, 9:12:18 PM1/25/14

to juli...@googlegroups.com

Hi Dahua,

On AMD machines Intel libraries do not perform well, so Yeppp! is always faster (and it is also faster than AMD LibM on those machines).

Relative performance of vector elementary (mathematical) functions depends on the microarchitecture: the algorithms employed in Yeppp! are very friendly to wide SIMD units. On Haswell (256-bit units + FMA) Yeppp! is faster than VML, on Sandy Bridge (256-bit units, no FMA) they are about on par, on Nehalem (128-bit units, no FMA) Yeppp! is between 4-ULP and 1-ULP versions of VML. IPP is slower than MKL and Yeppp! on Intel platforms.

Attached is raw performance data from mid-November 2013 with the latest versions of libraries available at the time (AMD LibM 3.1, SLEEF 2.80, MKL 11.1.0, IPP 8.0.1, FDLibM 5.3 (OpenLibM is based on this library), Cephes 2.7, CRLibM from CVS, last commit on April 11, 2011). SLEEF, FDLibM, Cephes, CRLibM were recompiled for each processor with gcc-4.8.1 -O3 -march=native.

If you want to benchmark the libraries on your own machine, checkout the Hysteria utility from BitBucket.

Regards,

Marat

Yeppp-vs-libm-nov-2013.log

Dahua

unread,

Jan 25, 2014, 10:25:42 PM1/25/14

to juli...@googlegroups.com

Hi Marat,

Thanks for the info. This library is amazing.

The general convention of Julia is to create a package on Github (say Yeppp.jl). You may provide wrappers of functions in the package. The interface may look like:

sum(x::Array{Float64}) = ccall(some_yeppp_function_for_f64, ....)
sum(x::Array{Float32}) = ccall(some_yeppp_function_for_f32, ....)

When people want to enjoy the benefit of Yeppp, they can then do

using Yeppp

a = rand(10^6)

...
s = sum(a)
...

And they will find that the statement ``using Yeppp`` will boost the performance for multiple times.

Also, having the package on Github would allow people to file issues or contribute their pull requests. When things stabilize, we may then discuss migrating the functions to the Julia Base.

- Dahua

Jake Bolewski

unread,

Jan 25, 2014, 10:58:11 PM1/25/14

to juli...@googlegroups.com

I'm curious, why is gcc's performance relative to clang's is so much worse for your dot product benchmark on your home page?

-Jake

Marat Dukhan

unread,

Jan 26, 2014, 5:03:36 AM1/26/14

to juli...@googlegroups.com

gcc doesn't unroll enough to hide FMA latency. But it does vectorize the code, if that was your concern.

Regards,

Marat

Toivo Henningsson

unread,

Jan 26, 2014, 8:10:45 AM1/26/14

to juli...@googlegroups.com

I don't think that we want to monkey-patch sum and other functions like this. That would mean that if anyone loads Yeppp.jl, all code will be using those alternate definitions. Better to provide functions with separate names, or e.g. a different sum function which falls back to the regular one if it doesn't know how to handle the input.

Dahua Lin

unread,

Jan 26, 2014, 9:05:25 AM1/26/14

to juli...@googlegroups.com

Ah, yes. Using the way I proposed, once some body import the module, no way to opt out. I agree that approach is problematic.

What about the Yeppp package defines its own sum function (not extending the Base.sum). Then people can write Yeppp.sum( x ), this provides an explicit way to leverage Yeppp while maintaining the familiar syntax.

Dahua

Dahua Lin

unread,

Jan 26, 2014, 9:07:13 AM1/26/14

to juli...@googlegroups.com

Fior this approach, Yeppp should never export the sum function and other functions with the same name as the Base functions.

Jonathan Malmaud

unread,

Jan 26, 2014, 10:37:59 AM1/26/14

to juli...@googlegroups.com

On the other hand, having all code use the alternative definitions might be exactly what the user wants: A general speed-up in all loaded packages that operate on arrays, at the cost of some accuracy.

Kevin Squire

unread,

Jan 26, 2014, 11:08:14 AM1/26/14

to juli...@googlegroups.com

That could still be provided in, e.g., a sub module of Yeppp.jl.

Jonathan Malmaud

unread,

Jan 26, 2014, 1:06:47 PM1/26/14

to juli...@googlegroups.com

Good idea.

Jed Brown

unread,

Jan 26, 2014, 2:36:07 PM1/26/14

to Tobi, juli...@googlegroups.com

Tobi <tobias...@googlemail.com> writes:

> I just want to point out that the autovectorization of loops using SIMD
> instructions that will hopefully land in Julia soon, will also change the
> result of vectorized instructions compared to scalar instructions. This is
> because SIMD registers are "tight" and use e.g. 64bit for a double while
> scalar floating point ALUs are 80bit.

Which compilers are generating legacy x87 instructions for modern (e.g.,
x86-64) hardware? Not LLVM or GCC; they use SSE instructions (addsd,
mulsd, etc., are scalar instructions).

On 32-bit x86, GCC uses x87 by default, but -mfpmath=sse is recommended
when supported.

Reply all

Reply to author

Forward