Elementary string handling -- trying to optimise big O

Paul

unread,

Jun 5, 2015, 12:10:43 PM6/5/15

to

I am writing code to compress string representations by representing (for example) "aaa" by a3. The details of the problem, together with my solution, are provided below. I'm a bit concerned about my Append functions. The reason I wrote them this way, rather than simply using the normal library function, append, is that push_back guarantees amortized constant time (I think), meaning that my version of append is linear in the length of the appended word. In contrast to this, I couldn't find any guarantee that standard append is any better than being linear in the length of the concatenated word. Is this reasoning sound? Many thanks for your help.

Paul

#include <string>
// Implementing a method to perform basic string compression using the counts
// of repeated characters. For example, the string aabcccccaaa would become
// a2b1c5a3. If the compressed string would not become smaller than the original
// string, return the original string. You can assume the string has only upper
// and lower case letters (a - z).

void Append(string& str1, const string& str2)
{
for(auto c: str2)
str1.push_back(c);
}

void Append(string& str1, int x)
{
Append(str1, to_string(x));
}

string compression(string input)
{
const int inputSize = input.length();

if(inputSize <= 2)
return input;

string results;
int repeated = 1; // Counting occurrences of each character

for(int i = 1; i < inputSize; ++i)
if(input[i] == input[i-1])
++repeated;
else
{
results.push_back(input[i-1]);
Append(results, repeated);
repeated = 1;
}

results.push_back(input[inputSize - 1]);
Append(results, repeated);

return results.length() < inputSize ? results : input;

}

Paavo Helde

unread,

Jun 5, 2015, 12:54:43 PM6/5/15

to

Paul <peps...@gmail.com> wrote in
news:11bd72d2-fdbf-4447...@googlegroups.com:

> I am writing code to compress string representations by representing
> (for example) "aaa" by a3.

So, how can you tell later if the original string was "aaa" or "a3"? Oh,
I see, it's always a char and a count, so "abc" gets "compressed" to
"a1b1c1". OK, fine, but how do you uncompress "a111b5"? It might be
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbb" or "a1bbbbb".

> The details of the problem, together with
> my solution, are provided below. I'm a bit concerned about my Append
> functions. The reason I wrote them this way, rather than simply using
> the normal library function, append, is that push_back guarantees
> amortized constant time (I think), meaning that my version of append
> is linear in the length of the appended word. In contrast to this, I
> couldn't find any guarantee that standard append is any better than
> being linear in the length of the concatenated word. Is this
> reasoning sound? Many thanks for your help.

Regardless of standardise, your Append() is most probably slower than
library string append. Your Append() is pushing each char separately,
meaning that for each char std::string needs to check for free buffer
space and update the string length field. Standard append does these
things only once for the appended string.

But as always with optimizations, there is no other way to find out the
real truth than to measure it. Please post back your speed testing
results here, it would be interesting to know such things.

>
> Paul
>
> #include <string>
> // Implementing a method to perform basic string compression using the
> counts // of repeated characters. For example, the string aabcccccaaa
> would become // a2b1c5a3. If the compressed string would not become
> smaller than the original // string, return the original string. You
> can assume the string has only upper // and lower case letters (a -
> z).
>
> void Append(string& str1, const string& str2)
> {
> for(auto c: str2)
> str1.push_back(c);
> }
>
> void Append(string& str1, int x)
> {
> Append(str1, to_string(x));
> }
>
> string compression(string input)

If you are concerned about speed, this should be const string& input.

> {
> const int inputSize = input.length();
>
> if(inputSize <= 2)
> return input;
>
> string results;
> int repeated = 1; // Counting occurrences of each character
>
> for(int i = 1; i < inputSize; ++i)
> if(input[i] == input[i-1])
> ++repeated;
> else
> {
> results.push_back(input[i-1]);
> Append(results, repeated);
> repeated = 1;
> }
>
> results.push_back(input[inputSize - 1]);
> Append(results, repeated);
>
> return results.length() < inputSize ? results : input;

Oh, I see you still cannot tell if "a3" should be uncompressed to "aaa"
or "a3" ;-)

>
> }
>

Cheers
Paavo

Paavo Helde

unread,

Jun 5, 2015, 1:12:39 PM6/5/15

to

Paavo Helde <myfir...@osa.pri.ee> wrote in
news:XnsA4B0CA859B431m...@216.166.105.131:

> Paul <peps...@gmail.com> wrote in
> news:11bd72d2-fdbf-4447...@googlegroups.com:

>> I'm a bit concerned about my Append
>> functions.

>> void Append(string& str1, int x)
>> {
>> Append(str1, to_string(x));
>> }

Another note: It appears the string append is called only once per each
to_string() function call. If this undisclosed to_string() indeed does
conversion to the decimal string representation (equiv to printf("%d")),
then you can probably forget any concerns about the speed of your Append()
- the int-to-string conversion is a bit heavyweight operation and will
probably dominate in the timings. Your profiler will tell you more exactly.

hth
Paavo

Luca Risolia

unread,

Jun 5, 2015, 1:22:03 PM6/5/15

to

Il 05/06/2015 18:10, Paul ha scritto:
> I'm a bit concerned about my Append
> functions. The reason I wrote them this way, rather than simply
> using the normal library function, append, is that push_back
> guarantees amortized constant time (I think), meaning that my version
> of append is linear in the length of the appended word. In contrast
> to this, I couldn't find any guarantee that standard append is any
> better than being linear in the length of the concatenated word. Is
> this reasoning sound?

Theory is one thing, reality is (often) another thing. For example,
std::vector is known to be preferable than std::list in many practical
cases where complexity theory would otherwise suggest using the latter.
As a general rule, I suggest that you at least measure the performance
of your standard library implementation against concrete use cases
before considering reinventing it.

Message has been deleted

Paavo Helde

unread,

Jun 5, 2015, 1:50:18 PM6/5/15

to

Paul <peps...@gmail.com> wrote in
news:11bd72d2-fdbf-4447...@googlegroups.com:

> I am writing code to compress string representations by representing
> (for example) "aaa" by a3. The details of the problem, together with

> string compression(string input)
> {
[...]

>
> return results.length() < inputSize ? results : input;
> }

It is interesting to note that this approach attempts to defeat the
pigeonhole principle and compress all inputs to at least the same size than
the original. However, this is mathematically not possible (assuming
lossless compression). See:

http://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_applications

So, it seems some more work is needed with the design of this compression
algorithm ;-)

Cheers
Paavo

Mel

unread,

Jun 5, 2015, 2:49:25 PM6/5/15

to

On Fri, 05 Jun 2015 12:50:03 -0500, Paavo Helde
<myfir...@osa.pri.ee> wrote:
> Paul <peps...@gmail.com> wrote in
> news:11bd72d2-fdbf-4447...@googlegroups.com:

> > I am writing code to compress string representations by
representing
> > (for example) "aaa" by a3. The details of the problem, together
with

> > string compression(string input)
> > {
> [...]
> >
> > return results.length() < inputSize ? results : input;
> > }

> It is interesting to note that this approach attempts to defeat the
> pigeonhole principle and compress all inputs to at least the same
size than
> the original. However, this is mathematically not possible
(assuming
> lossless compression). See:

Question is: is this compession or hashing :)))

--
Press any key to continue or any other to quit

Paul

unread,

Jun 5, 2015, 2:51:17 PM6/5/15

to

Thanks, Paavo

You make a number of interesting points. I'll consider some of them in turn. Uncompression is uniquely determined by the constraint, given in the comments, that inputs only contain chars that are letters of the English alphabet. Furthermore, supposing that we ignore this constraint and thereby conclude that some strings can be uncompressed in different ways. I don't understand why this would be any type of problem -- we would simply have a non-injective function. Nobody said that we wanted to uncompress strings.

Excellent point about the function signature -- I was careless here.
I don't understand why you object to this code:

return results.length() < inputSize ? results : input;

If you believe that it sometimes results in a runtime error, please can you give inputs that cause incorrect output? If you believe that this is inefficient, please can you explain why?

Thank You,

Paul

Paul

unread,

Jun 5, 2015, 2:55:34 PM6/5/15

to

I am trying to implement the above function and "compression" is merely the name I am choosing to give the function. If I called the function "hashing" or "elephant" that wouldn't have changed my question. I don't understand the computer science theory of this at all, but I'm reading CLR on algorithms so they might explain this.

Paul

Paavo Helde

unread,

Jun 5, 2015, 3:26:54 PM6/5/15

to

Paul <peps...@gmail.com> wrote in
news:2ad532cc-675c-48e5...@googlegroups.com:

> Uncompression is uniquely determined by the constraint,
> given in the comments, that inputs only contain chars that are letters
> of the English alphabet.

OK, I had overlooked that. Yes, if each input byte contains less than 6
bits of information, then it can always be compressed into a shorter
length.

> Furthermore, supposing that we ignore this
> constraint and thereby conclude that some strings can be uncompressed
> in different ways. I don't understand why this would be any type of
> problem -- we would simply have a non-injective function. Nobody said
> that we wanted to uncompress strings.

In that case 'return "";' should be the fastest implementation.

> I don't understand why you object to this code:
>
> return results.length() < inputSize ? results : input;
>
> If you believe that it sometimes results in a runtime error, please
> can you give inputs that cause incorrect output? If you believe that
> this is inefficient, please can you explain why?

No, I somehow assumed all bits in the input are used, and if it were so
then this statement would make it impossible to have lossless
compression.

Cheers
Paavo

Vir Campestris

unread,

Jun 5, 2015, 4:45:47 PM6/5/15

to

On 05/06/2015 19:51, Paul wrote:
> Uncompression is uniquely determined by the constraint, given in the comments, that inputs only contain chars that are letters of the English alphabet.

So you've only got 52 possible input characters? Or is it 26?

Surely the first stage is to throw away all the spare bits. You only
need 6, and that leaves you some spares to mean things like "There
follows a 4 bit repeat count" or such. Unless of course one of the
requirements is that the output should still be printable.

But I wonder why you don't just use a standard compression, such as zip.
If it does need to be printable run that through base64.

Andy