Your FastCopy v2.11 is very old, lastest version is v3.41.
I recommend to use lastest version.
But it is not the important point.
I tried some test, and almost result indicates FastCopy is 20%~50% faster than RichCopy v4.0. (I expand I/O thread to 6 in RichCopy, my CPU is 6core)
Only "4GB file test", FastCopy a little faster. but the result time of RichCopy is not real copy time.
This doesn't contain flush to device time, because RichCopy uses OS cache. (FastCopy uses Direct I/O, so FastCopy passes OS cache)
You can confirm it in this image. (But this image is XP explorer)
In an easy way, you can confirm HDDs usage in performance tab in task manager.
Anyway, I tried test patterns are in the following and I tested 1-4 * a-d matrix.
Test file patterns (Your test file size average seems 1-2MB)
1. 1023KB files and total-size 4GB (To reduce 1KB means ... FastCopy's best pattern is 1024KB, and 1023KB brings ftruncating overhead to FastCopy)
2. 2048KB files and total-size 4GB
3. 4GB files
4. cygwin dir (5000files, 141MB)
Device patterns
a. HDD to HDDb. HDD to SSD
c. SSD to HDD
d. SSD to SSD
All tests indecates FastCopy is faster than RichCopy.
So, I can't believe your test results.
Did you clear OS cache, before starting test each times?
And I recommend to watch HDD usage even if RichCopy says "Finish". (Perhaps OS still continue to write to HDD for flushing OS cache)
I don't know why you think to increase I/O threads(more than the number of devices) affect good performance.
I think it brings many seek in a HDD, and bad effect for performance.
If you know the principle/mechanism (have you learnt computer science?), please explain.