Hello Mahmoud
Further improvements, optimization
Attached updated
fastpro.c
MatrixAddSubMul.ring // fix ==> C:\ring\samples\UsingFastPro
Even ImagePixel.ring runs faster. Changed timing to msecs to see new speed.
==============================
SPEED IMPROVEMNETS
================================
ImagePixel.ring
Image W-H: 1800-1200 Size: 2160000
Size (bytes): 6480000
Width : 1800
Height: 1200
Channels: 3
GetPixelColors.....: Total Time: 0.050 seconds
Change-ColorValue..: Total Time: 0.050 seconds <=== Old
DrawRBGAImagePixels: Total Time: 0.060 seconds
Image W-H: 1800-1200 Size: 2160000
Size (bytes): 6480000
Width : 1800
Height: 1200
Channels: 3
GetPixelColors.....: Total Time: 28 msecs
Change-ColorValue..: Total Time: 18 msecs <=== New
DrawRBGAImagePixels: Total Time: 43 msecs
================================
RING Append Axis: 1 500x500 Time 350 millisecs
FastPro Append: Axis: 1 500x500 Time 42 millisecs
RING Append Axis: 0 500x500 Time 389 millisecs
FastPro Append: Axis: 0 500x500 Time 10 millisecs
========================
RING AtLeast2D 900x900 Time 1214 millisecs
FastPro AtLeast2D 900x900 Time 348 millisecs
RING AtLeast2D 900x900 Time 1469 millisecs
FastPro AtLeast2D 900x900 Time 171 millisecs
===========================
FastPro Ravel 900x900 Time 228 millisecs
RING Ravel 900x900 Time 551 millisec
FastPro Ravel 900x900 Time 18 millisecs
RING Ravel 900x900 Time 562 millisecs
=========================
FastPro SoftMax 500x500 Time 301 millisecs
RING Softmax 500x500 Time 36914 millisecs
FastPro SoftMax 500x500 Time 124 millisecs
RING Softmax 500x500 Time 42708 millisecs
=========================
MultDot Speed Test:1846 millisecs
FastPro Speed Test:71 millisecs
MultDot Speed Test:2374 millesecs
FastPro Speed Test:34 millesecs
==================================
RING Transpose Time: 131
FastPro Transpose Matrix Time: 97
RING Transpose Time: 148
FastPro Transpose Matrix Time: 33
=========================
Ring AllSum: Sum: 4084699.60 900x900 Time 1483 millisecs
FastPro AllSum: Sum: 4084699.60 900x900 Time 9 millisecs
Ring AllSum: Sum: 4085597.00 900x900 Time 1663 millisecs
FastPro AllSum: Sum: 4085597.00 900x900 Time 3 millisecs
==========================
/* DETAILS
** OPTIMIZATIONS APPLIED (2026):
**
** ROUND 1 — First pass:
** 1. ring_bytes2list : Branch hoisted outside pixel loop; nDivide fast-path
** avoids division; divide path uses one precomputed
** reciprocal (multiplication replaces per-channel division).
** 2. ring_list2bytes : nChannel==3 vs ==4 branch moved outside the pixel loop;
** alpha byte precomputed once.
** 3. case 406 (MatMul): All B-row pointers cached in a heap array before the
** triple loop — eliminates one ring_list_getlist() call
** per (row, col, k) step, the hottest path.
** 4. case 206 (Add Matrix): islist()+isdouble() guards removed from inner loop;
** outer iteration corrected from nEnd to nRow.
** 5. Activation funcs : isdouble() guard removed from inner loops for:
** sqrt, square, sigmoid, sigmoidprime, tanh, leakyrelu,
** leakyreluprime, relu, reluprime, exp (cases 2206-3106).
** 6. case 2106 (Mean) : isdouble() guard removed from inner loop.
** 7. case 4306 (AllSum): isdouble() guard removed from inner loop.
** 8. case 4506 (EMul) : isdouble() guard removed from inner loop.
**
** ROUND 2 — Second pass:
** 9. case 306 (Sub Matrix): same islist()/isdouble() removal + outer bound fix.
** 10. case 1406 (Transpose): pSubList (A-row) re-fetch eliminated from inner loop.
** 11. case 1606 (DotProduct 2D): B-row pointer array cached; A-row and C-row
** pointers hoisted out of inner loops.
** 12. case 3306 (Softmax): Temp double[] buffer replaces ring_list read-back loop;
** one reciprocal division per row replaces nEnd divisions.
** ROUND 3 — Third pass:
** 14. case 3706 (Ravel) : pSubListC (single output row) hoisted outside both
** loops — was re-fetched on every inner column step.
** Intermediate k variable eliminated.
** 15. case 3906 (AtLeast2D): pSubListC hoisted outside loop — same pattern.
** Intermediate valueA variable eliminated.
** 16. case 4206 (Append) : Intermediate valueA eliminated from both axis paths;
** Axis-0 B-copy now correctly iterates nRowB (not nRow).
** Axis-1 B-copy now correctly iterates nEndB (not nEnd).
** 17. ring_mandelbrot : TWO PASSES FUSED INTO ONE.
** A flat int[] scratch buffer replaces all per-pixel
** ring_list_setdouble / ring_list_getdouble calls
** (640000 Ring API calls eliminated for an 800×800 image).
** Color table made 'static const' (ROM, not stack).
** 2.0 literal used instead of integer 2 in zI formula.
*/
=================
Regards
Bert Mariani