Hi all.
[ I posted this over in tesseract-ocr the other day, and (understandably) didn't get much of a response. Someone pointed out that this would be a better forum, so I am reposting it (edited) here. In the meantime there has been some discussion of the idea on one of my pull requests, so I've folded that into the repost here. ]
I've been playing with integrating Tesseract with Ghostscript for the past couple of weeks. I have it working nicely, and I've started trying to pass back some of the tweaks I've done along the way as pull requests on github - thanks to all the reviewers/commentators that have helped to get those in.
The biggest of these is an implementation of matrixDotVector for NEON equipped ARMs (intsimdmatrixneon.cpp). This makes a massive difference to the speed on ARM devices (such as my Raspberry pi), but (depending on what language data I feed it), profiles still show 30-55% of runtime still in this function.
I have a couple of ideas about ways to improve this a bit. Both spring from the final stage of the calculation. The initial calculations of parallel SIGMA(a*b) all happen in the SIMD registers, in the integer domain. At that point we:
1) Cast to double.
2) Divide by 127
3) Add the bias (as an integer that needs to be converted to a double each time we run through)
4) Multiply by the scale.
First idea: We could rejig this a bit to be:
1) Add the bias*127 (staying in the integer domain, so no conversion required. No multiplication even required for those architectures for which + (bias<<7)-bias is cheaper than +bias*127.)
2) Cast to double.
3) Multiply by scale/127. (And the /127 can be rolled into the scale values at deserialisation time, I think).
Straight off the bat, that saves us a floating-point divide, an int -> double conversion, and a floating-point add at the cost of an integer multiply and add. (For each value). That's probably a win, right?
More importantly, at least some of that can be done within the SIMD domain (for all three of the architectures we have SIMD implementations for, I believe).
Second idea: It'd be nice to do the whole thing in SIMD, but NEON (at least) doesn't support doubles. So can we use floats instead?
I was pointed at a previous discussion during which the idea of using floats instead of doubles came up, here:
https://github.com/tesseract-ocr/tesseract/issues/943#issuecomment-303239929The consensus at that time was that while it might be a win, it was a bad idea because numerical algorithms really shouldn't be doing SIGMA(a * b) in the float domain because the errors would add up.
That discussion predates the appearance of intsimdmatrix, in which the SIGMA(a * b) stage moved to be being done in the int domain. So, that objection, AIUI, no longer applies.
It may be that there are other good reasons why moving to using float (at least on some architectures) is a bad idea, but I couldn't see it there.
Thanks in advance for any help/insight people can offer.