I am trying to optimize integer rescaling (see example at the end).
There is a range or integer elements (in the example, elements are of
type unsigned short), current range maximum, and the desired maximum.
Straight forward way is to multiply by new maximum, and then to divide
by old maximum (taking care about the overflow).
Are there better ways of rescaling the integers?
///////////////////////////////////
#include <iostream>
unsigned short a = 0x1234; // value in the range to rescale
unsigned short b = 0x3fff; // new maximum range value
unsigned short c = 0x0800; // current maximum range value
int main()
{
const unsigned int d = a * b;
const unsigned short e = d / c;
std::cout << "rescaled value = 0x" << std::hex << e << std::endl;
}
///////////////////////////////////
Optimize? As in "speed it up"?
> There is a range or integer elements (in the example, elements are of
> type unsigned short), current range maximum, and the desired maximum.
> Straight forward way is to multiply by new maximum, and then to divide
> by old maximum (taking care about the overflow).
>
> Are there better ways of rescaling the integers?
Better? In what way? To make it more precise perform the operation
using [long] doubles and then truncate or round the results.
> ///////////////////////////////////
> #include <iostream>
>
> unsigned short a = 0x1234; // value in the range to rescale
It's *not* in the range. The "current maximum" is smaller than this value.
> unsigned short b = 0x3fff; // new maximum range value
> unsigned short c = 0x0800; // current maximum range value
>
>
> int main()
> {
> const unsigned int d = a * b;
> const unsigned short e = d / c;
It is better to store the intermediate value in a type that is
*guaranteed* to be larger. 'int' is not necessarily larger than
'short'. The actual solution apparently depends on the implementation
which controls the sizes of integral values. I would probably seek a
platform-specific solution.
> std::cout << "rescaled value = 0x" << std::hex << e << std::endl;
> }
> ///////////////////////////////////
V
--
I do not respond to top-posted replies, please don't ask
Yes, like use operations which are going to be faster then one
multiplication + one division (as in the example)
>
>> There is a range or integer elements (in the example, elements are of
>> type unsigned short), current range maximum, and the desired maximum.
>> Straight forward way is to multiply by new maximum, and then to divide
>> by old maximum (taking care about the overflow).
>>
>> Are there better ways of rescaling the integers?
>
> Better? In what way? To make it more precise perform the operation
> using [long] doubles and then truncate or round the results.
To do that, you need to convert integer to double, multiply and divide,
and back to integer. That is not going to be faster. Might be more
precise, but it shouldn't be.
>
>> ///////////////////////////////////
>> #include <iostream>
>>
>> unsigned short a = 0x1234; // value in the range to rescale
>
> It's *not* in the range. The "current maximum" is smaller than this value.
Yes, it is not a range, but lets imagine it is a randomly taken element
from a range :) Then, having a range or randomly chosen element doesn't
make much difference *how* will you rescale it.
>
>> unsigned short b = 0x3fff; // new maximum range value
>> unsigned short c = 0x0800; // current maximum range value
My mistake about the current maximum :
unsigned short c = 0x2800; // current maximum range value
>>
>>
>> int main()
>> {
>> const unsigned int d = a * b;
>> const unsigned short e = d / c;
>
> It is better to store the intermediate value in a type that is
> *guaranteed* to be larger. 'int' is not necessarily larger than
> 'short'. The actual solution apparently depends on the implementation
> which controls the sizes of integral values. I would probably seek a
> platform-specific solution.
Good point. Which NG would you suggest for x86?
As Victor noted your 'b' is out of range. That means that you've not paid
attention to detail. The details may or may not be important depending on the
problem at hand, which you do not describe.
In the general case you need to first decide on the distribution you want. For
example, are your maximums inclusive or exclusive? The value 0x0800 seems like
an exclusive maximum, while 0x3fff seems like an inclusive one.
When you have decided /what/ your rescaling should do, then you should implement
it correctly. That means e.g. avoiding overflow. The above code can easily
overflow, so as portable code it's generally incorrect no matter the details of
the desired scaling (it's like a car with square wheels, doesn't matter whether
it's meant to be a sports car or a lorry, it's just wrong).
So, yes, there are better ways.
The first step on the road to betterness is to define the desired effect, and
the second step is to implement that correctly. Only then worry about micro
efficiency. If at all (and if you do, first measure, Measure, MEASURE).
Cheers & hth.,
- Alf
--
blog at <url: http://alfps.wordpress.com>
[snip old code]
> As Victor noted your 'b' is out of range. That means that you've not
> paid attention to detail. The details may or may not be important
> depending on the problem at hand, which you do not describe.
>
Actually, the first part of the algorithm is calculating histogram of an
image. Another part of the algorithm is using cumulative sum of the
histogram as a lookup table.
Since the maximum value in the histogram is number of points in the
image (one point is 16 bits), I need to rescale all values in the
histogram to the maximum value of 16 bits (which is 0xffff).
All parts of the algorithm are optimized (at least I think that is
true), except for this rescaling. I am doing it after I calculate the
cumulative sum, and I am doing it as shown in the code example (one
multiplication and one division). It works fine, but I was wondering if
there is a faster way.
> In the general case you need to first decide on the distribution you
> want. For example, are your maximums inclusive or exclusive? The value
> 0x0800 seems like an exclusive maximum, while 0x3fff seems like an
> inclusive one.
>
Yes, as I said else thread, that is a small mistake. For completeness,
here is the correct version :
///////////////////////////////////
#include <iostream>
unsigned short a = 0x1234; // value in the range to rescale
unsigned short b = 0x3fff; // new maximum range value
unsigned short c = 0x2800; // current maximum range value
int main()
{
const unsigned int d = a * b;
const unsigned short e = d / c;
std::cout << "rescaled value = 0x" << std::hex << e << std::endl;
}
///////////////////////////////////
> When you have decided /what/ your rescaling should do, then you should
> implement it correctly. That means e.g. avoiding overflow. The above
> code can easily overflow, so as portable code it's generally incorrect
> no matter the details of the desired scaling (it's like a car with
> square wheels, doesn't matter whether it's meant to be a sports car or a
> lorry, it's just wrong).
>
I didn't really think about code portability, because my target is set
(x86 - some pentium processor). I also wasn't thinking about the
overflow, because that is easily fixed in proper unit tests.
The problem I am facing is the algorithm can not be used, because it
uses too much CPU :(
> So, yes, there are better ways.
>
> The first step on the road to betterness is to define the desired
> effect, and the second step is to implement that correctly. Only then
> worry about micro efficiency. If at all (and if you do, first measure,
> Measure, MEASURE).
>
I did measure, and the profiler told that 30% time is spent on rescaling.
ps Is this another problem for comp.programming ? Is there a NG where to
post optimization questions?
"Vladimir Jovic" <vlada...@gmail.com> wrote in message
news:i6887u$k0$1...@news.albasani.net...
> Alf P. Steinbach /Usenet wrote:
>
>> When you have decided /what/ your rescaling should do, then you should
>> implement it correctly. That means e.g. avoiding overflow. The above code
>> can easily overflow, so as portable code it's generally incorrect no
>> matter the details of the desired scaling (it's like a car with square
>> wheels, doesn't matter whether it's meant to be a sports car or a lorry,
>> it's just wrong).
>>
>
> I didn't really think about code portability, because my target is set
> (x86 - some pentium processor). I also wasn't thinking about the overflow,
> because that is easily fixed in proper unit tests.
>
What overflow exactly? If your platform uses 16 bits for unsigned short and
32 bits for unsigned int then overflow should not occur, i.e. 0xFFFF *
0xFFFF < 0xFFFFFFFF.
/Leigh
comp.lang.asm.x86, maybe?
Consider utilizing SSE (that's what those are called, IIRC). There you
can vectorize your multiplications, divisions, additions... Ask in the
x86 newsgroup. Actually, a decent compiler should be able to vectorize
some of those for you. BTW, "too much CPU" is not a good description of
your code's deficiencies. Too much compared to WHAT?
"Vladimir Jovic" <vlada...@gmail.com> wrote in message
news:i6887u$k0$1...@news.albasani.net...
>
>> When you have decided /what/ your rescaling should do, then you should
>> implement it correctly. That means e.g. avoiding overflow. The above code
>> can easily overflow, so as portable code it's generally incorrect no
>> matter the details of the desired scaling (it's like a car with square
>> wheels, doesn't matter whether it's meant to be a sports car or a lorry,
>> it's just wrong).
>>
>
> I didn't really think about code portability, because my target is set
> (x86 - some pentium processor). I also wasn't thinking about the overflow,
> because that is easily fixed in proper unit tests.
>
Ignore my other reply, I posted too quick as usual. unsigned short *
unsigned short is promoted to int not unsigned int hence the overflow
possibility. To avoid this cast the two operands to (unsigned int) before
the multiply to avoid possible overflow even though it probably makes no
difference to the final result on your platform.
/Leigh
The SSE doesn't have integer division.
Will try x86 group
> some of those for you. BTW, "too much CPU" is not a good description of
> your code's deficiencies. Too much compared to WHAT?
IMO the description is ok if you are trying to optimize the code, but
that is not relevant to the question. If your algorithm spends too much
time in one loop, you know you have to optimize everything in that loop.
For example, your response in else thread to convert to double, multiply
and divide and convert back to integer is going to slow down my algorithm.
You can multiply by the reciprocal.
Since your input is 16-bits only, float instead of double should be enough.
Floating point to int conversion is quite slow in C/C++ on x86, you
should look at SSE compiler intrinsics (that stuff is portable across
the major x86/x64 compilers). With SSE, you can also do 4 conversions in
parallel.
\\
You seem to have compile time constants for your scaling. If you make
those available to the compiler (declare them as integral const values),
the compiler will be able to optimize away the division (at least Visual
C++, GCC do).
--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>