BLIS 0.6.1 now available

30 views
Skip to first unread message

Field Van Zee

unread,
Jan 14, 2020, 5:08:20 PM1/14/20
to blis-devel
Friends of BLIS,

BLIS 0.6.1 is now available.

This is mostly a bugfix release. However, there are three new features that may be of interest to some of you: a high-performance dgemm microkernel for the IBM Power9; the ability to pre-broadcast during the packing of B; and the option to selectively pack A and/or B in the existing gemmsup framework. This third feature was made possible thanks to contributions facilitated by our partnership with AMD.

A complete list of improvements present in 0.6.1 follows.

Framework:
- Added support for pre-broadcast when packing B. This causes elements of B to be repeated (broadcast) in the packed copy of B so that subsequent vector loads will result in the element already being pre-broadcast into the vector register.
- Added support for selective packing to gemmsup (controlled via environment variables and/or the rntm_t object). (AMD)
- Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar and a separate bug in the order of typecasting intermediate products in sdsdot_(). (Simon Lukas Märtens, Devin Matthews)
- Fixed an obscure bug in bli_acquire_mpart_[mn]dim(). (Minh Quan Ho)
- Fixed a subtle and complicated bug that only manifested via the BLAS test drivers in the generic subconfiguration, and possibly any other subconfiguration that did not register complex-domain gemm ukernels, or registered ONLY real-domain ukernels as row-preferential. (Dave Love)
- Always use sumsqv to compute normfv instead of the "dot product trick" that was previously employed for performance reasons. (Roman Yurchak, Devin Matthews, and Isuru Fernando)
- Fixed bug in thrinfo_t debugging/printing code.

Kernels:
- Implemented and registered an optimized dgemm microkernel for the 'power9' kernel set. (Nicholai Tukanov)
- Pacify a 'restrict' warning in the gemmtrsm4m1 reference ukernel. (Dave Love, Devin Matthews)

Build system:
- Fixed parsing in vpu_count() on some SkylakeX workstations. (Dave Love)
- Reimplemented bli_cpuid_query() for ARM to use stdio-based functions instead of popen(). (Dave Love)
- Use -march=znver1 for clang on 'zen2' subconfig.
- Updated -march flags for sandybridge, haswell subconfigurations to use newer syntax (e.g. 'haswell' instead of 'core-avx2' and 'sandybridge' instead of 'corei7-avx'.
- Correctly use -qopenmp-simd for reference kernels when compiling with icc. (Victor Eikjhout)
- Added -march support for select gcc version ranges where flag syntax changes or new flags are added. The ranges we identify are: versions older than 4.9.0; versions older than 6.1.0 (but newer than 4.9.0); versions older than 9.1.0 (but newer than 6.1.0).
- Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang.
- Updated MC cache blocksizes used by 'haswell' subconfig.
- Updated NC cache blocksizes used by 'zen' subconfig.
- Fixed a typo in the context registration of the 'cortexa53' subconfiguration in bli_gks.c. (Francisco Igual)
- Output a more informative error when the user manually targets a subconfiguration that configure places in the configuration blacklist. (Tze Meng Low)
- Set execute bits of shared library at install-time. (Adam J. Stewart)
- Added missing thread-related symbols for export to shared libraries. (Kyungmin Lee)
- Removed (finally) the 'attic/windows' directory since we offer Windows DLL support via AppVeyor's build artifacts, and thus that directory was only likely confusing people.

Testing:
- Fixed latent testsuite microkernel module bug for 'power9' subconfig. (Jeff Hammond)
- Added 'test/1m4m' driver directory for test drivers related to the 1m paper.
- Added libxsmm support to test/sup drivers. (Robert van de Geijn)
- Updated .travis.yml and do_sde.sh to automatically accept SDE license and download SDE directly from Intel. (Devin Matthews, Jeff Hammond)
- Updated standalone test drivers to iterate backwards through the specified problem space. This often helps avoid the situation whereby the CPU doesn't immediately throttle up to its maximum clock frequency, which can produce strange discontinuities (sharply rising "cliffs") in performance graphs.
- Pacify an unused variable warning in blastest/f2c/lread.c. (Jeff Hammond)
- Various other minor fixes/tweaks to test drivers.

Documentation:
- Added libxsmm results to docs/PerformanceSmall.md.
- Added BLASFEO results to docs/PerformanceSmall.md.
- Added the page size and location of the performance drivers to docs/Performance.md and docs/PerformanceSmall.md. (Dave Love)
- Added notes to docs/Multithreading.md regarding the nuances of setting multithreading parameters the manual way vs. the automatic way. (Jérémie du Boisberranger)
- Added a section on reproduction to docs/Performance.md and docs/PerformanceSmall.md. (Dave Love)
- Documented Eigen -march=native hack in docs/Performance.md and docs/PerformanceSmall.md. (Sameer Agarwal)
- Inserted multithreading links and disclaimers to BuildSystem.md. (Jeff Diamond)
- Fixed typo in description for bli_?axpy2v() in docs/BLISTypedAPI.md. (Shmuel Levine)
- Added "How to Download BLIS" section to README.md. (Jeff Diamond)
- Various other minor documentation fixes.

Special thanks to Dave Love, Devin Matthews, Nicholai Tukanov, Jeff Hammond, Roman Yurchak, Isuru Fernando, Minh Quan Ho, Simon Lukas Märtens, Victor Eikjhout, Tze Meng Low, Francisco Igual, Adam J. Stewart, Kyungmin Lee, Robert van de Geijn, Jérémie du Boisberranger, Sameer Agarwal, Jeff Diamond, Shmuel Levine, and the CPU libraries group at AMD for their numerous helpful bug reports, suggestions, contributions, and help in tracking down issues. (And please forgive me if I inadvertently left out your name.) Please continue to give us feedback--it is always appreciated!

As always, a full list of changes is available in the CHANGELOG:

  https://github.com/flame/blis/blob/master/CHANGELOG

Please let us know if you have any questions.

Field
Reply all
Reply to author
Forward
0 new messages