C++ programming challenge

Peter Jansson

unread,

Jun 10, 2009, 6:54:34 AM6/10/09

to

Dear news group,

I have created a small programming challenge for those of you who are
interested in challenging your Standard C++ programming skills. The
challenge is about counting character frequency in large texts,
perhaps useful for spam filtering or classical crypto analysis. You
can read more about it here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

With kind regards,
Peter Jansson

Ioannis Vranos

unread,

Jun 10, 2009, 7:40:41 AM6/10/09

to

May you provide the about.com relevant URL?

--
Ioannis A. Vranos

C95 / C++03 Developer

http://www.cpp-software.net

Peter Jansson

unread,

Jun 10, 2009, 8:05:00 AM6/10/09

to

On 10 Juni, 13:40, Ioannis Vranos <ivra...@freemail.gr> wrote:
> Peter Jansson wrote:
> > Dear news group,
>
> > I have created a small programming challenge for those of you who are
> > interested in challenging your Standard C++ programming skills. The
> > challenge is about counting character frequency in large texts,
> > perhaps useful for spam filtering or classical crypto analysis. You
> > can read more about it here:
>

> >http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...

>
> > With kind regards,
> > Peter Jansson
>
> May you provide the about.com relevant URL?
>
> --
> Ioannis A. Vranos
>
> C95 / C++03 Developer
>
> http://www.cpp-software.net

There is no relevant about.com relevant URL at the time of this
writing. If they approve and use my challenge suggestion, I will
update my weblog post accordingly.

blargg

unread,

Jun 10, 2009, 10:47:16 AM6/10/09

to

Peter Jansson wrote:
[...]
> http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html
[from the blog:]
> Write a program that calculate the frequency of characters in a
> given large text file as fast as possible. Write the frequency
> results and how long time it took on standard output in the format
> given below. Submit your code in a comment to this post. Rules:
>
> 1. Your program should be case insensitive when it counts letters.
>
> 2. The program should be written using Standard C or C++ so that
> it can easily be compiled and run on various platforms.
>
> 3. The text file to use for counting can be downloaded in a
> compressed zip archive using one of the following ways:
>
> * Via bittorrent using this torrent file.
> * Via GNUnet (link will be written here when upload is complete).
>
> 3. You must begin timing your code just before the text file is
> opened and you must end the timing after the results have been
> written to standard output. For timing, you can use the code given
> below (it works on POSIX and Windows systems).

This seems silly. Either you have an array integers and increment based on
the incoming character (read in large chunks for speed), then later
combine the upper- and lower-case letters, or simply print out a
pre-computed list of frequencies, as the rules state that a known file
will be fed to the program. In the former (non-cheating) case, the
program's run-time will be based almost entirely on I/O speed, and thus
simply a measure of the method used to read the file. In C++, there are
only two ways: iostreams and C-style FILEs.

What am I missing?

Chris M. Thomasson

unread,

Jun 10, 2009, 11:01:10 AM6/10/09

to

"Peter Jansson" <webm...@jansson.net> wrote in message
news:12ba30e3-b04b-4c32...@g20g2000vba.googlegroups.com...

Here is a stupid simple solution; needs ASCII:
____________________________________________________________________
#include <stdio.h>

#define BUFFER 32768U

static
unsigned long
count_chars(
void const* buf,
size_t size,
unsigned long* counts
) {
unsigned char const* cur = buf;
unsigned char const* const end = cur + size;
unsigned long total = 0;
while (cur < end) {
if (*cur >= 'a' && *cur <= 'a' + 26) {
++total;
++counts[*cur - 'a'];
} else if (*cur >= 'A' && *cur < 'A' + 26) {
++total;
++counts[*cur - 'A'];
}
++cur;
}
return total;
}

int main(void) {
size_t i;
FILE* file;
unsigned long total = 0;
unsigned long counts[26] = { 0 };

file = fopen("./data.txt", "rb");

if (file) {
size_t size;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
total += count_chars(buf, size, counts);
printf("\rcharacters processed: %lu", total);
if (feof(file) || ferror(file)) break;
}

fclose(file);
}

puts("\n____________________________________________________");

for (i = 0; i < 26; ++i) {
printf("%c %%%f\n", i + 'a',
(total) ? ((double)counts[i] / total) * 100 : 0.0);
}

return 0;
}
____________________________________________________________________

Of course, if you wanted to parse huge multi-gigabyte files, this simplistic
approach would not be ideal. I can think of several different ways to
improve this thing through parallelization and memory mapped files. But,
that would not be standard C or C++.

Chris M. Thomasson

unread,

Jun 10, 2009, 11:08:22 AM6/10/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:FUPXl.12607$GD4....@newsfe15.iad...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ARGHGHGHAHREHAH!!!!

If course that should be:

if (*cur >= 'a' && *cur < 'a' + 26) {

> ++total;
> ++counts[*cur - 'a'];
> } else if (*cur >= 'A' && *cur < 'A' + 26) {
> ++total;
> ++counts[*cur - 'A'];
> }
> ++cur;
> }
> return total;
> }

> [...]

I repost program:
____________________________________________________________________
#include <stdio.h>
#include <assert.h>

#define BUFFER 32768U

static
unsigned long
count_chars(
void const* buf,
size_t size,
unsigned long* counts
) {
unsigned char const* cur = buf;
unsigned char const* const end = cur + size;
unsigned long total = 0;
while (cur < end) {

if (*cur >= 'a' && *cur < 'a' + 26) {
assert(*cur - 'a' >= 0 && *cur - 'a' < 26);

++total;
++counts[*cur - 'a'];
} else if (*cur >= 'A' && *cur < 'A' + 26) {

assert(*cur - 'A' >= 0 && *cur - 'A' < 26);

++total;
++counts[*cur - 'A'];
}
++cur;
}
return total;
}

int main(void) {
size_t i;
FILE* file;
unsigned long total = 0;
unsigned long counts[26] = { 0 };

file = fopen("c:/newvz/vztest/data.txt", "rb");

if (file) {
size_t size;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
total += count_chars(buf, size, counts);
printf("\rcharacters processed: %lu", total);
if (feof(file) || ferror(file)) break;
}

fclose(file);
}

puts("\n____________________________________________________");

for (i = 0; i < 26; ++i) {
printf("%c %%%f\n", i + 'a',
(total) ? ((double)counts[i] / total) * 100 : 0.0);
}

return 0;
}

____________________________________________________________________

Sorry about that retarded mistake!

;^(...

Ioannis Vranos

unread,

Jun 10, 2009, 1:12:57 PM6/10/09

to

When you say characters" you mean letters, symbols, everything in the basic character set (1-127), or it may
include wide_characters?

Ioannis Vranos

unread,

Jun 10, 2009, 1:27:39 PM6/10/09

to

Ioannis Vranos wrote:
> Peter Jansson wrote:
>> Dear news group,
>>
>> I have created a small programming challenge for those of you who are
>> interested in challenging your Standard C++ programming skills. The
>> challenge is about counting character frequency in large texts,
>> perhaps useful for spam filtering or classical crypto analysis. You
>> can read more about it here:
>>
>> http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html
>>
>>
>> With kind regards,
>> Peter Jansson
>
>
> When you say characters" you mean letters, symbols, everything in the
> basic character set (1-127), or it may include wide_characters?

Since this is comp.lang.c++, I will post a high-level C++ solution.

Preliminary codes, while waiting for an answer.

#include <map>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

map<char, unsigned long> characterFrequencies;

// The array where the read characters will be stored.
char buffer[BUFSIZ];

// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}

// We disable synchronisation with stdio, to speed up C++ I/O.
ios_base::sync_with_stdio(false);

string characters= "ABDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

// We start timing.
time1= clock();

// We open the file
ifstream inputFile(argv[argc -1]);

// An error happened
if(not inputFile.good())
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer[i] ];

}while(not inputFile.eof());

// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.
cout<<"\n\n\nThe letter frequencies are:\n";

for(string::size_type i= 0; i< characters.size(); ++i)
cout<< characters[i]<< ": "<< characterFrequencies[ characters[i] ]+ characterFrequencies[
tolower(characters[i]) ]<< "\n";

// We "stop" timing.
time2= clock();

// We convert he timing to seconds.
double totalTimeInSeconds= (time2- time1)/ CLOCKS_PER_SEC;

cout<<"\n\nThe whole process took "<< fixed<< totalTimeInSeconds<< " seconds\n";

cout<<"\n\nHave a nice day!\n";

Ioannis Vranos

unread,

Jun 10, 2009, 1:55:15 PM6/10/09

to

High level C++ solution:

Code 1:

#include <map>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// Warning: long double has problems with MINGW compiler for Windows.
map<char, long double> characterFrequencies;

return EXIT_FAILURE;
}

string characters= "ABDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

}while(not inputFile.eof());

cout<<fixed<< "\n\n\nThe letter frequencies are:\n";

long double totalcharacterFrequencies= 0;

for(string::size_type i= 0; i< characters.size(); ++i)

totalcharacterFrequencies+= characterFrequencies[ characters[i] ]+ characterFrequencies[
tolower(characters[i]) ];

for(string::size_type i= 0; i< characters.size(); ++i)

cout<< characters[i]<< ": "<< (characterFrequencies[ characters[i] ]+ characterFrequencies[
tolower(characters[i]) ])/ totalcharacterFrequencies* 100<< "%\n";

// We "stop" timing.
time2= clock();

// We convert the timing to seconds.

double totalTimeInSeconds= (time2- time1)/ CLOCKS_PER_SEC;

cout<<"\n\nThe whole process took "<< fixed<< totalTimeInSeconds<< " seconds.\n";

cout<<"\n\nHave a nice day!\n";
}

Code 2:

The above code more "inlined" (denoted with "==>").

This code is not any faster with my compiler (as it shouldn't), than the original code above.

#include <map>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// Warning: long double has problems with MINGW compiler for Windows.
map<char, long double> characterFrequencies;

// The array where the read characters will be stored.
char buffer[BUFSIZ];

// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}

// We disable synchronisation with stdio, to speed up C++ I/O.
ios_base::sync_with_stdio(false);

string characters= "ABDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

// We start timing.
time1= clock();

// We open the file
ifstream inputFile(argv[argc -1]);

// An error happened
if(not inputFile.good())
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

==> ADDED: const streamsize SIZE= inputFile.gcount();

==> CHANGED: for(streamsize i= 0; i< SIZE; ++i)
++characterFrequencies[ buffer[i] ];

}while(not inputFile.eof());

// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.

cout<<fixed<< "\n\n\nThe letter frequencies are:\n";

long double totalcharacterFrequencies= 0;

for(string::size_type i= 0; i< characters.size(); ++i)

totalcharacterFrequencies+= characterFrequencies[ characters[i] ]+ characterFrequencies[
tolower(characters[i]) ];

for(string::size_type i= 0; i< characters.size(); ++i)

cout<< characters[i]<< ": "<< (characterFrequencies[ characters[i] ]+ characterFrequencies[
tolower(characters[i]) ])/ totalcharacterFrequencies* 100<< "%\n";

// We "stop" timing.
time2= clock();

// We convert the timing to seconds.

double totalTimeInSeconds= (time2- time1)/ CLOCKS_PER_SEC;

cout<<"\n\nThe whole process took "<< fixed<< totalTimeInSeconds<< " seconds.\n";

Ioannis Vranos

unread,

Jun 10, 2009, 1:57:25 PM6/10/09

to

Corrected:

High level C++ solution:

Code 1:

#include <map>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// Warning: long double has problems with MINGW compiler for Windows.
map<char, long double> characterFrequencies;

// The array where the read characters will be stored.
char buffer[BUFSIZ];

// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}

// We disable synchronisation with stdio, to speed up C++ I/O.
ios_base::sync_with_stdio(false);

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

}while(not inputFile.eof());

long double totalcharacterFrequencies= 0;

Code 2:

return EXIT_FAILURE;
}

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

blargg

unread,

Jun 10, 2009, 2:49:36 PM6/10/09

to

> Peter Jansson wrote:
> [...]
> >
http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html
> [from the blog:]
> > Write a program that calculate the frequency of characters in a
> > given large text file as fast as possible. Write the frequency
> > results and how long time it took on standard output in the format
> > given below. Submit your code in a comment to this post. Rules:
> >
> > 1. Your program should be case insensitive when it counts letters.
> >
> > 2. The program should be written using Standard C or C++ so that
> > it can easily be compiled and run on various platforms.
> >
> > 3. The text file to use for counting can be downloaded in a
> > compressed zip archive using one of the following ways:
> >
> > * Via bittorrent using this torrent file.
> > * Via GNUnet (link will be written here when upload is complete).
> >
> > 3. You must begin timing your code just before the text file is
> > opened and you must end the timing after the results have been
> > written to standard output. For timing, you can use the code given
> > below (it works on POSIX and Windows systems).

[...]

Here's a simple solution that reads from stdin and doesn't do any timing
(for that, just run with the Unix "time" command). I see no reason to
complicate it with iostreams.

#include <limits.h>
#include <stdio.h>
#include <ctype.h>

int main()
{
size_t f [UCHAR_MAX+1] = { 0 };
size_t size = 0;
int c;

while ( (c = getchar()) != EOF )
{
f [c]++;
size++;
}

for ( c = 0; c <= UCHAR_MAX; ++c )
if ( islower( c ) )
f [toupper( c )] += f [c];

for ( c = 0; c <= UCHAR_MAX; ++c )
if ( f [c] && !islower( c ) && isprint( c ) )
printf( "%c %.3f%%\n", c, f[c]*100.0/size );

return 0;
}

Bart van Ingen Schenau

unread,

Jun 10, 2009, 3:31:17 PM6/10/09

to

Peter Jansson wrote:

Lets see how the std algorithms measure up to the rest (only tested on
its own source so far):

#include <fstream>
#include <iostream>
#include <numeric>
#include <map>
#include <ctype.h>
#include <iterator>
#include <algorithm>
#include <time.h>

struct accum_original
{
std::map<char, long> counts;
long total;
accum_original() : counts(), total(0) {}
void count_char(char c)
{
total++;
counts[tolower((unsigned char)c)]++;
}
const std::map<char, long>& getCounts()
{
return counts;
}
long getTotal()
{
return total;
}
};

struct accum_fast
{
struct data {
std::map<char, long> counts;
long total;
int refcount;
data() : counts(), total(0), refcount(1) {}
}* payload;
accum_fast() : payload(new data) {}
accum_fast(const accum_fast& other) : payload(other.payload) {
payload->refcount++; }
~accum_fast() { if (--payload->refcount == 0) { delete payload; } }
void swap(accum_fast& other)
{
data* temp(other.payload);
other.payload = payload;
payload = temp;
}
accum_fast& operator=(accum_fast rhs)
{
swap(rhs);
return *this;
}
void count_char(char c)
{
payload->total++;
payload->counts[tolower((unsigned char)c)]++;
}
const std::map<char, long>& getCounts()
{
return payload->counts;
}
long getTotal()
{
return payload->total;
}
};
typedef accum_fast accum;

accum count_chars(accum acc, char c)
{
acc.count_char(c);
return acc;
}

struct scaler
{
long total;
scaler(long total_) : total(total_) {}
std::pair<const char, double> operator()(std::pair<const char, long>
value)
{
return std::pair<const char, double>(value.first,
(100.0*value.second)/total);
}
};

namespace std {
ostream& operator<<(ostream& os, const pair<const char, double>& value)
{
os << '\'' << value.first << "' " << value.second << "%\n";
return os;
}
}

int main(int argc, char** argv)
{
clock_t time1, time2;
accum acc;

if (argc < 2)
{
time1 = clock();
acc = std::accumulate(std::istream_iterator<char>(std::cin),
std::istream_iterator<char>(), acc, count_chars);
}
else
{
time1 = clock();
std::ifstream file(argv[1]);
acc = std::accumulate(std::istream_iterator<char>(file),
std::istream_iterator<char>(), acc, count_chars);
}

std::transform(acc.getCounts().begin(), acc.getCounts().end(),
std::ostream_iterator<std::pair<const char, double>
>(std::cout, ""),
scaler(acc.getTotal()));
time2 = clock();
std::cout << "Time taken: " << (time2 - time1) << " clock ticks = " <<
(time2 - time1)*1.0/CLOCKS_PER_SEC << " seconds" << std::endl;
}

--
a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq
c.l.c FAQ: http://c-faq.com/
c.l.c++ FAQ: http://www.parashift.com/c++-faq-lite/

Peter Jansson

unread,

Jun 10, 2009, 4:53:48 PM6/10/09

to

Hi,

I am continously updating my weblog with timing results for all entries,
posted at:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Sincerely,
Peter Jansson

agelos.an...@gmail.com

unread,

Jun 10, 2009, 6:55:16 PM6/10/09

to

> http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...

>
> With kind regards,
> Peter Jansson

Does this work? :P

#include <fstream>
#include <iostream>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

int frequency[128];
memset(frequency, 0, sizeof(int) * 128);

const int BUFFER_SIZE = 4096;
char buffer[BUFFER_SIZE];
unsigned int totalChars = 0;

if (argc != 2)

{
cerr << "\nUsage: " << argv[0] << " fileNameToRead\n\n";
return EXIT_FAILURE;
}

clock_t time1, time2;

time1 = clock();

ifstream inputFile(argv[argc -1]);

if (!inputFile.good())
{
cerr << "Bad file, ciao!\n";
return EXIT_FAILURE;
}

int *p;
char *end;

do
{
inputFile.read(buffer, BUFFER_SIZE);
int bytesRead = inputFile.gcount();

totalChars += bytesRead;

end = buffer + bytesRead;
for (p = (int *)buffer; p <= (int *)(end - sizeof(int)); p++)
{
frequency[(*p & 0xff000000) >> 24]++;
frequency[(*p & 0x00ff0000) >> 16]++;
frequency[(*p & 0x0000ff00) >> 8 ]++;
frequency[(*p & 0x000000ff) ]++;
}

}while(!inputFile.eof());

char *c = (char *)p;
while (c != end)
{
frequency[*c]++;
c++;
}

inputFile.close();

double dTotalChars = (double)totalChars;

for (int i = 'A'; i <= 'Z'; i++)
{
cout << (char)i << " = " << (frequency[i] / dTotalChars) * 100.0 <<
"%\n";
}

for (int i = 'a'; i <= 'z'; i++)
{
cout << (char)i << " = " << (frequency[i] / dTotalChars) * 100.0 <<
"%\n";
}

time2 = clock();

double totalTimeInSeconds = (time2- time1);
cout << "Time elapsed (ms) : " << totalTimeInSeconds << endl;
}

agelos.an...@gmail.com

unread,

Jun 10, 2009, 7:10:25 PM6/10/09

to

On 10 Ιούν, 13:54, Peter Jansson <webmas...@jansson.net> wrote:

> http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...

>
> With kind regards,
> Peter Jansson

Actually, just changing the line :

ifstream inputFile(argv[argc -1]);

with
ifstream inputFile(argv[argc -1],std::ios::binary);

beats any other post (on my PC at least :))

Here's the new code (just added the binary open mode)

#include <fstream>
#include <iostream>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

int frequency[128];
memset(frequency, 0, sizeof(int) * 128);

const int BUFFER_SIZE = 4096;
char buffer[BUFFER_SIZE];
unsigned int totalChars = 0;

if (argc != 2)
{
cerr << "\nUsage: " << argv[0] << " fileNameToRead\n\n";
return EXIT_FAILURE;
}

clock_t time1, time2;

time1 = clock();

ifstream inputFile(argv[argc -1],std::ios::binary);

Chris M. Thomasson

unread,

Jun 10, 2009, 7:42:22 PM6/10/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:o%PXl.5964$f36....@newsfe19.iad...

>
> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
> news:FUPXl.12607$GD4....@newsfe15.iad...
>> "Peter Jansson" <webm...@jansson.net> wrote in message
>> news:12ba30e3-b04b-4c32...@g20g2000vba.googlegroups.com...
>>> Dear news group,
>>>
>>> I have created a small programming challenge for those of you who are
>>> interested in challenging your Standard C++ programming skills. The
>>> challenge is about counting character frequency in large texts,
>>> perhaps useful for spam filtering or classical crypto analysis. You
>>> can read more about it here:
>>>
>>> http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html
>>
>> Here is a stupid simple solution; needs ASCII:

>> [...]

Here, let me repost this crappy little thing using a bigger buffer, without
using any asserts and calls to printf during the actual processing phase:
_________________________________________________________________________
#include <stdio.h>

#define BUFFER 65536U

static
unsigned long
count_chars(
void const* buf,
size_t size,
unsigned long* counts
) {
unsigned char const* cur = buf;
unsigned char const* const end = cur + size;
unsigned long total = 0;
while (cur < end) {
if (*cur >= 'a' && *cur < 'a' + 26) {

++total;
++counts[*cur - 'a'];
} else if (*cur >= 'A' && *cur < 'A' + 26) {
++total;
++counts[*cur - 'A'];
}
++cur;
}
return total;
}

int main(void) {
size_t i;
FILE* file;
unsigned long total = 0;
unsigned long counts[26] = { 0 };

file = fopen("./data.txt", "rb");

if (file) {
size_t size;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
total += count_chars(buf, size, counts);

if (feof(file) || ferror(file)) break;
}

fclose(file);
}

for (i = 0; i < 26; ++i) {

printf("%c %%%f\n", i + 'a',
(total) ? ((double)counts[i] / total) * 100 : 0.0);
}

return 0;
}
_________________________________________________________________________

That just might shave one or two milliseconds off... lol

;^)

Ioannis Vranos

unread,

Jun 10, 2009, 8:00:36 PM6/10/09

to

This is an improved version of the original program on run-time performance.

I replaced std::map with std::valarray. std::map is an associative container performing ordering, with worse
random access performance than std::vector. In usual applications it doesn't matter much, but since we use
timings, here is the updated code.

The code continues to be independent on how the basic character set is implemented (e.g. ASCII, EBCDIC).
However since other codes posted with ASCII implementation assumptions were considered equivalent by the OP to
generic portable code, I may post ASCII-specific C++ code tomorrow.

Here's the code:

#include <valarray>

#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// Warning: long double has problems with MINGW compiler for Windows.

//
// The C++ basic character set is using the value range [0, 127].
// If we used vector<long double>, it would not have any run-time difference in any modern compiler.
valarray<long double> characterFrequencies(127);

return EXIT_FAILURE;
}

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

}while(not inputFile.eof());

long double totalcharacterFrequencies= 0;

cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

Chris M. Thomasson

unread,

Jun 10, 2009, 8:02:46 PM6/10/09

to

"Peter Jansson" <webm...@jansson.net> wrote in message
news:12ba30e3-b04b-4c32...@g20g2000vba.googlegroups.com...

Well, try this one out too:
_______________________________________________________________________________
#include <stdio.h>
#include <limits.h>

#define BUFFER 65536U

int main(int argc, char** argv) {
size_t i;

unsigned long total = 0;

unsigned long counts[UCHAR_MAX + 1] = { 0 };
FILE* file = fopen(argv[1], "rb");

if (file) {
size_t size;
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {

while (size) {
--size;
++counts[buf[size]];
}

if (size < BUFFER) {
if (feof(file) || ferror(file)) break;
}
}

fclose(file);
}

for (i = 0; i < 26; ++i) {

total += counts[i + 'a'];
total += counts[i + 'A'];
}

for (i = 0; i < 26; ++i) {
printf("%c %%%f\n", i + 'a',

(total) ? ((double)(counts[i + 'a'] + counts[i + 'A']) / total) *
100.0 : 0.0);
}

return 0;
}
_______________________________________________________________________________

It just might beat my previous submission by a little bit...

;^)

Ioannis Vranos

unread,

Jun 10, 2009, 8:11:31 PM6/10/09

to

Corrected:

This is an improved version of the original program on run-time performance.

I replaced std::map with std::valarray. std::map is an associative container performing ordering, with worse
random access performance than std::vector. In usual applications it doesn't matter much, but since we use
timings, here is the updated code.

The code continues to be independent on how the basic character set is implemented (e.g. ASCII, EBCDIC).
However since other codes posted with ASCII implementation assumptions were considered equivalent by the OP to
generic portable code, I may post ASCII-specific C++ code tomorrow.

Here's the code:

#include <valarray>

#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// Warning: long double has problems with MINGW compiler for Windows.

//
// The C++ basic character set is using the value range [0, 127].
// If we used vector<long double>, it would not have any run-time difference in any modern compiler.

valarray<long double> characterFrequencies(128);

return EXIT_FAILURE;
}

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

}while(not inputFile.eof());

long double totalcharacterFrequencies= 0;

cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

giannis t

unread,

Jun 10, 2009, 8:11:40 PM6/10/09

to

check this out:

#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>

const int MAX_TEXT_LENGTH = 8192;

int freq[256];
double pctFreq[256];

int main(int argc, char** argv) {

for (int i = 0; i < 256; i++) {
freq[i] = 0;
pctFreq[i] = 0;
}
unsigned char text[MAX_TEXT_LENGTH];
int count = 0;
char c;
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);
std::ifstream file(argv[1], std::ios::binary);
for (; file.read((char*) text, MAX_TEXT_LENGTH);) {
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
c = text[i];
if ((c >= 'A') && (c <= 'Z')) {
++freq[c];
count++;
} else if ((c >= 'a') && (c <= 'z')) {
++freq[c - diff];
count++;
}

}
}
file.close();
for (int i = ('A'); i <= 'Z'; i++) {
pctFreq[i] = round((double(freq[i]) / count)*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq[i] << "%\n";
}
return (EXIT_SUCCESS);
}

Ioannis Vranos

unread,

Jun 10, 2009, 8:20:32 PM6/10/09

to

Reading a text file as binary, is not guaranteed to work. It just happens that binary reading is implemented
as text reading in your platform.

Ioannis Vranos

unread,

Jun 10, 2009, 8:22:55 PM6/10/09

to

... and is platform specific (ASCII).

Ioannis Vranos

unread,

Jun 10, 2009, 8:24:15 PM6/10/09

to

It doesn't compile.

john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar main.cc: In function οΏ½int main(int, char**)οΏ½:
main.cc:10: error: οΏ½memsetοΏ½ was not declared in this scope
main.cc:19: error: οΏ½EXIT_FAILUREοΏ½ was not declared in this scope
main.cc:31: error: οΏ½EXIT_FAILUREοΏ½ was not declared in this scope
john@john-laptop:~/Projects/anjuta/cpp/src$

Ioannis Vranos

unread,

Jun 10, 2009, 8:25:55 PM6/10/09

to

agelos.an...@gmail.com wrote:
>
> Actually, just changing the line :
> ifstream inputFile(argv[argc -1]);
> with
> ifstream inputFile(argv[argc -1],std::ios::binary);
>
> beats any other post (on my PC at least :))
>
> Here's the new code (just added the binary open mode)

Binary mode isn't the same as text mode (it will not work) in all platforms.

Ioannis Vranos

unread,

Jun 10, 2009, 8:28:00 PM6/10/09

to

It doesn't compile:

john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar

main.cc: In function ï¿½long unsigned int count_chars(const void*, size_t, long unsigned int*)ï¿½:
main.cc:14: error: invalid conversion from ï¿½const void*ï¿½ to ï¿½const unsigned char*ï¿½
main.cc: In function ï¿½int main()ï¿½:
main.cc:53: warning: format ï¿½%cï¿½ expects type ï¿½intï¿½, but argument 2 has type ï¿½long unsigned intï¿½
john@john-laptop:~/Projects/anjuta/cpp/src$

Also binary reading isn't the same with text reading in all platforms (that is, the program will not work).

Ioannis Vranos

unread,

Jun 10, 2009, 8:32:22 PM6/10/09

to

Ioannis Vranos wrote:
>
> ... and is platform specific (ASCII).

I mean your code is ASCII-specific. Not that text reading is ASCII-specific.

Chris M. Thomasson

unread,

Jun 10, 2009, 9:02:26 PM6/10/09

to

"Ioannis Vranos" <ivr...@freemail.gr> wrote in message
news:h0pj2d$j61$5...@news.grnet.gr...

> Chris M. Thomasson wrote:
>> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
>> news:o%PXl.5964$f36....@newsfe19.iad...
>>>
>>> "Chris M. Thomasson" <n...@spam.invalid> wrote in message
>>> news:FUPXl.12607$GD4....@newsfe15.iad...
>>>> "Peter Jansson" <webm...@jansson.net> wrote in message
>>>> news:12ba30e3-b04b-4c32...@g20g2000vba.googlegroups.com...
>>>>> Dear news group,
>>>>>
>>>>> I have created a small programming challenge for those of you who are
>>>>> interested in challenging your Standard C++ programming skills. The
>>>>> challenge is about counting character frequency in large texts,
>>>>> perhaps useful for spam filtering or classical crypto analysis. You
>>>>> can read more about it here:
>>>>>
>>>>> http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html
>>>>
>>>> Here is a stupid simple solution; needs ASCII:
>>>> [...]
>>
>> Here, let me repost this crappy little thing using a bigger buffer,
>> without using any asserts and calls to printf during the actual
>> processing phase:
>> _________________________________________________________________________

>> [...]

>> _________________________________________________________________________
>
>
> It doesn't compile:
>
>
> john@john-laptop:~/Projects/anjuta/cpp/src$ g++ main.cc -o foobar
> main.cc: In function ï¿½long unsigned int count_chars(const void*, size_t,
> long unsigned int*)ï¿½:
> main.cc:14: error: invalid conversion from ï¿½const void*ï¿½ to ï¿½const
> unsigned char*ï¿½
> main.cc: In function ï¿½int main()ï¿½:
> main.cc:53: warning: format ï¿½%cï¿½ expects type ï¿½intï¿½, but argument 2 has
> type ï¿½long unsigned intï¿½
> john@john-laptop:~/Projects/anjuta/cpp/src$

Its a C program; use gcc.

> Also binary reading isn't the same with text reading in all platforms
> (that is, the program > will not work).

I don't see a real need to open for text reading. Oh well, shi% happens.

Chris M. Thomasson

unread,

Jun 10, 2009, 9:03:46 PM6/10/09

to

"Ioannis Vranos" <ivr...@freemail.gr> wrote in message

news:h0piug$j61$4...@news.grnet.gr...

> agelos.an...@gmail.com wrote:
>>
>> Actually, just changing the line :
>> ifstream inputFile(argv[argc -1]);
>> with
>> ifstream inputFile(argv[argc -1],std::ios::binary);
>>
>> beats any other post (on my PC at least :))
>>
>> Here's the new code (just added the binary open mode)
>
>
> Binary mode isn't the same as text mode (it will not work) in all
> platforms.

I only care about the platforms in which it does work. ;^)

Andrew Tomazos

unread,

Jun 10, 2009, 9:40:32 PM6/10/09

to

On Jun 10, 12:54 pm, Peter Jansson <webmas...@jansson.net> wrote:
> Dear news group,
>
> I have created a small programming challenge for those of you who are
> interested in challenging your Standard C++ programming skills. The
> challenge is about counting character frequency in large texts,
> perhaps useful for spam filtering or classical crypto analysis. You
> can read more about it here:
>

> http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...
>
> With kind regards,
> Peter Jansson

If you like these sort of mini-program competitions go and take a look
at the algorithm arena at TopCoder <http://www.topcoder.com>.

I warn you - you have to be prepared to have your ass handed to you by
some prepubescent Russian wunderkind. If you want your programming
ego intact *stay away* from that place. :)
-Andrew.

giannis t

unread,

Jun 10, 2009, 10:24:45 PM6/10/09

to

If the encoding is 1-byte long (ASCII, ISO-8859-?, etc), it works. Now
a newer approach taking word alignment in mind.

#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>

#include <ctime>

const int MAX_TEXT_LENGTH = 65536;

int freq[256];
double pctFreq[256];

int count = 0;

inline
void checkRange(char c) {
if ((c >= 'A') && (c <= 'z')) {
if ((c <= 'Z') || (c >= 'a')) {
++freq[c];
count++;
}
}
}

int main(int argc, char** argv) {

memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);
unsigned char text[MAX_TEXT_LENGTH];

int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);

clock_t time1, time2;
time1 = clock();

std::ifstream file(argv[1], std::ios::in);
//std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);

for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {

checkRange(text[i]);
}
} while (!file.eof());
file.close();
for (char i = 'A'; i <= 'Z'; i++) {
pctFreq[i] = round((double(freq[i] + freq[i + diff]) / count)

*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq[i] << "%\n";
}

time2 = clock();
double totalTimeInSeconds = (time2 - time1) / CLOCKS_PER_SEC;
std::cout << "\n\nThe whole process took " << std::fixed <<
totalTimeInSeconds << " seconds.\n";
return (0);
}

Chris M. Thomasson

unread,

Jun 10, 2009, 10:34:40 PM6/10/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:oQXXl.35$Ks...@newsfe02.iad...

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

lol. How is the that last if block ever going to executed!

> [...]

> _______________________________________________________________________________
>
>
>
> It just might beat my previous submission by a little bit...

This one actually tries to check for errors:
_______________________________________________________________________________
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];

#define BUFFER 65536U

#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)

int main(int argc, char** argv) {

if (argc > 1) {

size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };

FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
size_t size;
int status = EXIT_SUCCESS;

static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {

size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {

for (i = 0; i < 26; ++i) {

total += GET_COUNT(counts, i);
}

for (i = 0; i < 26; ++i) {

printf("%c %%%.3f\n", (char)(i + 'a'),
(total) ? ((double)GET_COUNT(counts, i) / total) * 100.0 : 0.0);
}

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
_______________________________________________________________________________

GRRRR!

giannis t

unread,

Jun 10, 2009, 10:37:57 PM6/10/09

to

This is the best I have:

#include <stdlib.h>
#include <iostream>
#include <string.h>
#include <iomanip>
#include <math.h>
#include <fstream>

#include "CStopWatch.hpp"

const int MAX_TEXT_LENGTH = 65536;

int freq[256];
double pctFreq[256];
int count = 0;

int main(int argc, char** argv) {

memset(freq, 0, sizeof (int) * 256);
memset(pctFreq, 0, sizeof (double) * 256);
unsigned char text[MAX_TEXT_LENGTH];
int diff = 'a' - 'A';
std::ios_base::sync_with_stdio(false);

CStopWatch s;
s.startTimer();
//std::ifstream file(argv[1], std::ios::in);

std::ifstream file(argv[1], std::ios::in | std::ios::binary);
do {
file.read((char *) text, MAX_TEXT_LENGTH);
for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {

++freq[text[i]];

}
} while (!file.eof());
file.close();
for (char i = 'A'; i <= 'Z'; i++) {

freq[i] += freq[i + diff];
count += freq[i];

}
for (char i = 'A'; i <= 'Z'; i++) {

pctFreq[i] = round((double(freq[i]) / count)*10000) / 100;
std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
std::fixed << std::setprecision(2) << pctFreq[i] << "%\n";
}

s.stopTimer();
std::cout << "Time [microseconds]: " << s.getElapsedTime()*1.0E6
<< '\n';
return (EXIT_SUCCESS);
}

Chris M. Thomasson

unread,

Jun 11, 2009, 12:10:38 AM6/11/09

to

Here is my version with some timing code (e.g., `clock()'):
________________________________________________________________
#include <stdio.h>
#include <time.h>

#include <stdlib.h>
#include <assert.h>
#include <limits.h>

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];

#define BUFFER 65536U

#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)

int main(int argc, char** argv) {

if (argc > 1) {

size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };

clock_t const start = clock();

FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
size_t size;
int status = EXIT_SUCCESS;

static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {

size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {

clock_t stop;

for (i = 0; i < 26; ++i) {

total += GET_COUNT(counts, i);

}

for (i = 0; i < 26; ++i) {

printf("%c %%%.3f\n", (char)(i + 'a'),

(total) ? ((double)GET_COUNT(counts, i) / total)
* 100.0 : 0.0);
}

stop = clock();

printf("\nelapsed time: %lums\n",
(unsigned long)((((double)stop - (double)start)
/ CLOCKS_PER_SEC) * 1000.0));

Peter Jansson

unread,

Jun 11, 2009, 1:39:20 AM6/11/09

to

On Wed, 10 Jun 2009, agelos.an...@gmail.com wrote:

> Here's the new code (just added the binary open mode)
>
> #include <fstream>
> #include <iostream>
> #include <ctime>
>
> int main(int argc, char **argv)
> {

/ ... /
> }
>

On my machine I have to add

#include <cstdlib> // EXIT_FAILURE
#include <cstring> // memset

in order to avoid compilation errors.

Can you also check that the elapsed time calculation is correct?

/Peter

James Kanze

unread,

Jun 11, 2009, 4:59:58 AM6/11/09

to

On Jun 10, 4:47 pm, blargg....@gishpuppy.com (blargg) wrote:
> Peter Jansson wrote:

> [...]
> This seems silly. Either you have an array integers and
> increment based on the incoming character (read in large
> chunks for speed), then later combine the upper- and
> lower-case letters, or simply print out a pre-computed list of
> frequencies, as the rules state that a known file will be fed
> to the program. In the former (non-cheating) case, the
> program's run-time will be based almost entirely on I/O speed,
> and thus simply a measure of the method used to read the file.
> In C++, there are only two ways: iostreams and C-style FILEs.

And any differences in speed between iostreams and FILEs will
depend on the library implementation, not my code.

On the other hand, he did say that the actual test would be run
on Ubantu, a flavor of Linux. So the fastest solution is
probably to use mmap.

But as you say, the test is silly. The fastest code on one
machine will not be the fastest on another. The best solution
is to write the code in the cleanest way possible, then profile
the results.

(Also, the proposal says nothing about the encoding of the input
file. If he's running under Ubantu, the default encoding is
UTF-8, and properly handling case folding for UTF-8 could be
significantly more time consuming than a naïve implementation
which ignores the reality of accented characters.)

--
James Kanze (GABI Software) email:james...@gmail.com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

James Kanze

unread,

Jun 11, 2009, 5:04:27 AM6/11/09

to

On Jun 10, 8:49 pm, blargg....@gishpuppy.com (blargg) wrote:
> > Peter Jansson wrote:
> > [...]

> [...]

> Here's a simple solution that reads from stdin and doesn't do
> any timing (for that, just run with the Unix "time" command).
> I see no reason to complicate it with iostreams.

How would iostream represent any additional complication? I
think the code would actually be slightly simpler.

If speed were really a problem, I'd experiment with mmap.

> #include <limits.h>
> #include <stdio.h>
> #include <ctype.h>

> int main()
> {
> size_t f [UCHAR_MAX+1] = { 0 };
> size_t size = 0;
> int c;
>
> while ( (c = getchar()) != EOF )

With iostream, these two lines would be:

char c ;
while ( std::cin.get( c ) )

That looks even simpler than what you've written.

Juha Nieminen

unread,

Jun 11, 2009, 5:05:40 AM6/11/09

to

James Kanze wrote:
> The fastest code on one
> machine will not be the fastest on another. The best solution
> is to write the code in the cleanest way possible, then profile
> the results.

I have yet to see a compiler/system where C++ streams are actually
faster then C streams. Thus it's pretty safe to say that if you want
good I/O speed, using C streams will give you an edge in most, if not
all systems. Even if there was a system where C++ streams are as fast as
C streams, you wouldn't be losing anything in speed even there.

(Of course a different issue is that with C streams your code will
become more complicated and error-prone.)

Chris M. Thomasson

unread,

Jun 11, 2009, 5:21:27 AM6/11/09

to

"James Kanze" <james...@gmail.com> wrote in message
news:a6dd6d1e-2ebd-4d93...@s16g2000vbp.googlegroups.com...
> [...]

> (Also, the proposal says nothing about the encoding of the input
> file. If he's running under Ubantu, the default encoding is
> UTF-8, and properly handling case folding for UTF-8 could be

> significantly more time consuming than a na�ve implementation

> which ignores the reality of accented characters.)

Well, then perhaps one can attempt to create a local table and map in
unsigned shorts/UNICODE values directly, or whatever as concrete indexes
into said table? I assume I can create sufficient sized table to map some
UNICODE character values directly into indexes of a local counter array.
Well, perhaps of course...

:^)

Chris M. Thomasson

unread,

Jun 11, 2009, 5:22:53 AM6/11/09

to

"Juha Nieminen" <nos...@thanks.invalid> wrote in message
news:EN3Yl.33$_M...@read4.inet.fi...

> James Kanze wrote:
>> The fastest code on one
>> machine will not be the fastest on another. The best solution
>> is to write the code in the cleanest way possible, then profile
>> the results.
>

> [...]

>
> (Of course a different issue is that with C streams your code will
> become more complicated and error-prone.)

Perhaps... ;^o

Yikes!!!!!!!!!!!!

Chris M. Thomasson

unread,

Jun 11, 2009, 5:25:35 AM6/11/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message

news:604Yl.13763$jT6....@newsfe17.iad...

Assume NO hash collisions...

Bo Persson

unread,

Jun 11, 2009, 6:17:28 AM6/11/09

to

giannis t wrote:
>
> inline
> void checkRange(char c) {
> if ((c >= 'A') && (c <= 'z')) {
> if ((c <= 'Z') || (c >= 'a')) {
> ++freq[c];
> count++;
> }
> }
> }
>

You haven't seen this page, have you?

http://en.wikipedia.org/wiki/EBCDIC_500

Bo Persson

Peter Jansson

unread,

Jun 11, 2009, 6:42:57 AM6/11/09

to

Dear news group,

I have now included all programs submitted so far in the challenge. 10
programs up to now, with 9 of them written in this group thread. You
can view the most recent results here:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Sincerely,
Peter Jansson

Ioannis Vranos

unread,

Jun 11, 2009, 6:48:56 AM6/11/09

to

OK, I have no problem if you can't provide a better solution, after all it is just a challenge. When I say it
will not work, I mean you will get inaccurate results, and there is a chance that the program already does not
work in your platform, that is there is a chance you are already getting inacaccurate results.

Second, are you the same person with the poster I replied (agelos.an...@gmail.com)?

Ioannis Vranos

unread,

Jun 11, 2009, 6:58:13 AM6/11/09

to

There is an error in your blog. Code named "Ioannis A. Vranos, Code 3 (C++): 2 s" isn't mine.

Also the "Ioannis A. Vranos, Code 4 (C++): 1 s"

is my original code improved (you may say corrected),

so you can erase the original code "Ioannis A. Vranos, Code 2 (C++): 13 s" and "Ioannis A. Vranos, Code 1
(C++): 13 s" (both of which were essentially the same and thus were in the same message).

Ioannis Vranos

unread,

Jun 11, 2009, 7:05:43 AM6/11/09

to

Corrected:

Ioannis Vranos wrote:
> Peter Jansson wrote:
>> Dear news group,
>>
>> I have now included all programs submitted so far in the challenge. 10
>> programs up to now, with 9 of them written in this group thread. You
>> can view the most recent results here:
>>
>> http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html
>>
>>
>> Sincerely,
>> Peter Jansson
>
>

==> There is an error in your blog. Code named "Ioannis A. Vranos, Code 4
==> (C++): 2 s" isn't mine.
>
>
==> Also the "Ioannis A. Vranos, Code 3 (C++): 2 s"
>
==> is my original code improved (you may say corrected),

Chris M. Thomasson

unread,

Jun 11, 2009, 7:08:20 AM6/11/09

to

"Ioannis Vranos" <ivr...@freemail.gr> wrote in message

news:h0qnek$f2q$1...@news.grnet.gr...

> Chris M. Thomasson wrote:
>> "Ioannis Vranos" <ivr...@freemail.gr> wrote in message
>> news:h0piug$j61$4...@news.grnet.gr...
>>> agelos.an...@gmail.com wrote:
>>>>
>>>> Actually, just changing the line :
>>>> ifstream inputFile(argv[argc -1]);
>>>> with
>>>> ifstream inputFile(argv[argc -1],std::ios::binary);
>>>>
>>>> beats any other post (on my PC at least :))
>>>>
>>>> Here's the new code (just added the binary open mode)
>>>
>>>
>>> Binary mode isn't the same as text mode (it will not work) in all
>>> platforms.
>>
>> I only care about the platforms in which it does work. ;^)
>
>
> OK, I have no problem if you can't provide a better solution, after all it
> is just a challenge. When I say it will not work, I mean you will get
> inaccurate results, and there is a chance that the program already does
> not work in your platform, that is there is a chance you are already
> getting inacaccurate results.

I only care about the platforms in which it works.

> Second, are you the same person with the poster I replied
> (agelos.an...@gmail.com)?

No. I post with email address of (n...@spam.invalid).

What makes you think that? I post a lot over in `comp.programming.threads'.

Ioannis Vranos

unread,

Jun 11, 2009, 7:31:49 AM6/11/09

to

Because you replied to a message of mine addressed to agelos.an...@gmail.com, as you can also see from
the quotations.

Peter Jansson

unread,

Jun 11, 2009, 7:39:16 AM6/11/09

to

On Thu, 11 Jun 2009, Ioannis Vranos wrote:

>> so you can erase the original code "Ioannis A. Vranos, Code 2 (C++): 13 s"
>> and "Ioannis A. Vranos, Code 1 (C++): 13 s" (both of which were essentially
>> the same and thus were in the same message).

My mistake, have fixed it.

Sincerely,
Peter Jansson

Alf P. Steinbach

unread,

Jun 11, 2009, 8:47:22 AM6/11/09

to

* Peter Jansson:

I made 2 small changes to Ioannis' program.

It sort of sped up... ;-)

<code>

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

typedef char sassert[
'Z' - 'A' == 25 &&
'z' - 'a' == 25 ? 1 : -1
];

//#define BUFFER 65536U
#define BUFFER 1025*1024U // !Alf

#define GET_COUNT(mp_self, mp_index) ( \
(mp_self)[(mp_index) + 'a'] + \
(mp_self)[(mp_index) + 'A'] \
)

int main(int argc, char** argv) {
if (argc > 1) {
size_t i;
unsigned long total = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
clock_t const start = clock();
FILE* file = fopen(argv[1], FILE_MODE);

setvbuf( file, 0, _IONBF, 0 ); // !Alf

if (size < BUFFER) break;
}

stop = clock();

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
</code>

Cheers & hth.,

- Alf

PS: It's possible to optimize even more but what's the point?

--
Due to hosting requirements I need visits to <url: http://alfps.izfree.com/>.
No ads, and there is some C++ stuff! :-) Just going there is good. Linking
to it is even better! Thanks in advance!

Ioannis Vranos

unread,

Jun 11, 2009, 9:00:43 AM6/11/09

to

Alf P. Steinbach wrote:
> * Peter Jansson:
>> On Thu, 11 Jun 2009, Ioannis Vranos wrote:
>>
>>>> so you can erase the original code "Ioannis A. Vranos, Code 2 (C++):
>>>> 13 s" and "Ioannis A. Vranos, Code 1 (C++): 13 s" (both of which
>>>> were essentially the same and thus were in the same message).
>>
>>
>> My mistake, have fixed it.
>
> I made 2 small changes to Ioannis' program.
>
> It sort of sped up... ;-)

Either you are joking which is OK, or you made a mistake because this has not any relevance with my code.

Alf P. Steinbach

unread,

Jun 11, 2009, 9:12:27 AM6/11/09

to

* Ioannis Vranos:

> Alf P. Steinbach wrote:
>> * Peter Jansson:
>>> On Thu, 11 Jun 2009, Ioannis Vranos wrote:
>>>
>>>>> so you can erase the original code "Ioannis A. Vranos, Code 2
>>>>> (C++): 13 s" and "Ioannis A. Vranos, Code 1 (C++): 13 s" (both of
>>>>> which were essentially the same and thus were in the same message).
>>>
>>>
>>> My mistake, have fixed it.
>>
>> I made 2 small changes to Ioannis' program.
>>
>> It sort of sped up... ;-)
>
>
> Either you are joking which is OK, or you made a mistake because this
> has not any relevance with my code.

Oh, sorry, perhaps it was someone else's code.

Anyways, it was the code at top at the website, or perhaps second: I aimed my
mouse at "Ioannis" there but may have missed.

The modified code is about four hundred times faster, near as I can tell. ;-)

Cheers & hth.,

- Alf

--

James Kanze

unread,

Jun 11, 2009, 9:53:46 AM6/11/09

to

On Jun 11, 11:21 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "James Kanze" <james.ka...@gmail.com> wrote in message

> news:a6dd6d1e-2ebd-4d93...@s16g2000vbp.googlegroups.com...

> > [...]
> > (Also, the proposal says nothing about the encoding of the
> > input file. If he's running under Ubantu, the default
> > encoding is UTF-8, and properly handling case folding for
> > UTF-8 could be significantly more time consuming than a

> > naïve implementation which ignores the reality of accented
> > characters.)

> Well, then perhaps one can attempt to create a local table and
> map in unsigned shorts/UNICODE values directly, or whatever as
> concrete indexes into said table?

How to fold UTF-8 input rapidly would be the challenge. There
are over a million code points (and UTF-8 actually supports over
4 billion), so a Unicode character doesn't fit into a short on
most machines, and the tables would be very, very big. And very
sparce, since very few characters (proprotionally) actually have
a case difference.

> I assume I can create sufficient sized table to map some
> UNICODE character values directly into indexes of a local
> counter array. Well, perhaps of course...

The approach I've taken, in such cases, is to use a tri,
starting with the first byte, and progressing byte by byte.
Whether this is optimal or not, I don't know; it's probably
optimal for the case where encodings greater than 127 are rare,
which is my case, and less optimal otherwise. The more
classical implementation, I think, is to convert the UTF-8 to
UTF-32, and use a two stage table (with default values, since in
many cases, there can be a good default value covering most
entries). In both cases, implementing something like islower or
toupper can require quite a bit of memory, however, even if only
a small subset of the character codes are involved. (None of
the larger scripts, like Chinese, have case.)

Frank Neuhaus

unread,

Jun 11, 2009, 10:04:30 AM6/11/09

to

"Alf P. Steinbach" <al...@start.no> schrieb im Newsbeitrag
news:h0qucv$g0m$1...@news.eternal-september.org...

> #define BUFFER 1025*1024U // !Alf

> [...]

> setvbuf( file, 0, _IONBF, 0 ); // !Alf

Just out of curiosity: Why 1025*1024 and not 1024*1024?

Cheers,
Frank

Alf P. Steinbach

unread,

Jun 11, 2009, 10:22:13 AM6/11/09

to

* Frank Neuhaus:

Just a typo.

Cheers,

- Alf

James Kanze

unread,

Jun 11, 2009, 10:34:01 AM6/11/09

to

On Jun 11, 4:24 am, giannis t <ogian...@gmail.com> wrote:
> On Jun 11, 3:20 am, Ioannis Vranos <ivra...@freemail.gr> wrote:

Not to pick on your code particularly, but in general, in the
context of the challenge, what is done to ensure that the code
is legal and portable C++? That it works correctly with
multibyte characters, or EBCDIC, or with all of the more exotic
implementations?

It may be perfectly valid to not require all of this, but if so,
the acceptable restrictions should be stated up front.

> If the encoding is 1-byte long (ASCII, ISO-8859-?, etc), it
> works.

He's testing on a Linux machine. The default encoding under
Linux is UTF-8, a multibyte encoding. Unless otherwise stated,
it seems safe to assume that code which doesn't handle multibyte
encodings correctly is incorrect.

> Now a newer approach taking word alignment in mind.

> const int MAX_TEXT_LENGTH = 65536;

Either the name is a lie, or the constraint is unreasonable.
Files of several Gigabytes are frequent today, and I've seen
some which are in the Terabyte range.

> int freq[256];

This is actually a count. And on most implementations, an int
will overflow long before you reach the end of a large file.

> double pctFreq[256];
> int count = 0;

> inline
> void checkRange(char c) {
> if ((c >= 'A') && (c <= 'z')) {

Always false with EBCDIC ('A' is 193, and 'z' 169).

> if ((c <= 'Z') || (c >= 'a')) {
> ++freq[c];
> count++;
> }
> }
> }

> int main(int argc, char** argv) {
> memset(freq, 0, sizeof (int) * 256);
> memset(pctFreq, 0, sizeof (double) * 256);

And formally, this isn't guaranteed to set the values in pctFreq
to 0.0. This is actually what got me wondering to begin with:
reading a double initialized in this way is formally undefined
behavior, so your program is doesn't work. Except that I've
never heard of an implementation where it won't actually work.
If the requirement is "portable C++", then he must provide some
sort of filter which will detect this error, and reject your
program because of it. If the requirement isn't portable C++,
of course, then on Ubantu, I'd use mmap to read the file. And
if he's willing to restrict the input to the single byte
encodings of UTF-8, something like blargg's suggestion, combined
with mmap, should be close to unbeatable. Except that if he has
a multicore machine, it might be worth creating some additional
threads (or even forking new processes), each which handles a
contiguous block of data in the file, and then merging the
results. (Using different processes would even allow using mmap
for files which don't fit into memory.)

Until we know exactly what the requirements are, it's impossible
to write a program and know whether it fulfills these
requirements.

(FWIW: with any reasonably good implementation, I would expect
using std::fill to be faster than memset. As well as working
for all types, and not just for integral types.)

> unsigned char text[MAX_TEXT_LENGTH];
> int diff = 'a' - 'A';
> std::ios_base::sync_with_stdio(false);
> clock_t time1, time2;
> time1 = clock();
> std::ifstream file(argv[1], std::ios::in);
> //std::ifstream file(argv[1], std::ios::in | std::ios::binary);
> do {
> file.read((char *) text, MAX_TEXT_LENGTH);
> for (std::streamsize i(0), num(file.gcount()); i < num; ++i) {
> checkRange(text[i]);

Why not just:
++ freq[ text[ i ] ] ;
here, and do the case folding once, before the output.
(Blargg's suggestion, but the obvious "rapid" solution.)

> }
> } while (!file.eof());

Almost. (I'm actually impressed that someone got this close to
correct.) If there's a hardware read error, you've got an
endless loop. Just use "while ( file )" as the condition.

> file.close();
> for (char i = 'A'; i <= 'Z'; i++) {
> pctFreq[i] = round((double(freq[i] + freq[i + diff]) / count)
> *10000) / 100;
> std::cout << i << ": " << std::setw(5) << std::setfill('0') <<
> std::fixed << std::setprecision(2) << pctFreq[i] << "%\n";
> }
> time2 = clock();
> double totalTimeInSeconds = (time2 - time1) / CLOCKS_PER_SEC;
> std::cout << "\n\nThe whole process took " << std::fixed <<
> totalTimeInSeconds << " seconds.\n";
> return (0);
> }

--

Thomas J. Gritzan

unread,

Jun 11, 2009, 10:46:54 AM6/11/09

to

Ioannis Vranos schrieb:

>> Peter Jansson wrote:
>>> Dear news group,
>>>

>>> I have created a small programming challenge for those of you who are
>>> interested in challenging your Standard C++ programming skills. The
>>> challenge is about counting character frequency in large texts,
>>> perhaps useful for spam filtering or classical crypto analysis. You
>>> can read more about it here:

> Since this is comp.lang.c++, I will post a high-level C++ solution.

What's the high-level about this solution?

> do
> {
> inputFile.read(buffer, sizeof(buffer));
>
>
> for(streamsize i= 0; i< inputFile.gcount(); ++i)
> ++characterFrequencies[ buffer[i] ];
>
> }while(not inputFile.eof());

Using istream::eof or ::feof() as loop-condition is always wrong.
It might be that the program counts the last block twice, or that it
will be stuck in an infinite loop when the read fails:

http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5

--
Thomas

Peter Jansson

unread,

Jun 11, 2009, 11:00:23 AM6/11/09

to

On Thu, 11 Jun 2009, Alf P. Steinbach wrote:

> It sort of sped up... ;-)
>

/.../

> Cheers & hth.,
>
> - Alf
>
> PS: It's possible to optimize even more but what's the point?

Alf, I have included your modified program as well in the comparison. It
is now on the weblog page.

Sincerely,
Peter Jansson

PS: How would you optimize it even more?

blargg

unread,

Jun 11, 2009, 11:36:48 AM6/11/09

to

"Chris M. Thomasson" wrote:
[...]

> #define GET_COUNT(mp_self, mp_index) ( \
> (mp_self)[(mp_index) + 'a'] + \
> (mp_self)[(mp_index) + 'A'] \
> )

[...]

Why do you keep posting non-portable programs that assume ASCII?
toupper/tolower are your friends; use them! If you think they pose a
performance problem, you're doing the algorithm wrong, as you should only
combine lower and uppercase counts AFTER you've counted frequencies, not
during.

Ioannis Vranos

unread,

Jun 11, 2009, 11:46:13 AM6/11/09

to

Thomas J. Gritzan wrote:
> Ioannis Vranos schrieb:
>>> Peter Jansson wrote:
>>>> Dear news group,
>>>>
>>>> I have created a small programming challenge for those of you who are
>>>> interested in challenging your Standard C++ programming skills. The
>>>> challenge is about counting character frequency in large texts,
>>>> perhaps useful for spam filtering or classical crypto analysis. You
>>>> can read more about it here:
>
>> Since this is comp.lang.c++, I will post a high-level C++ solution.
>
> What's the high-level about this solution?

That it uses high-level standard library facilities (std::string, std::vector or valarray, std::ifstream,
etc), instead of practically reimplementing existing high-level standard library functiuonality in the code,
using low-level facilities like built-in arrays, getchar(), UCHAR_MAX, etc.

>
>> do
>> {
>> inputFile.read(buffer, sizeof(buffer));
>>
>>
>> for(streamsize i= 0; i< inputFile.gcount(); ++i)
>> ++characterFrequencies[ buffer[i] ];
>>
>> }while(not inputFile.eof());
>
> Using istream::eof or ::feof() as loop-condition is always wrong.
> It might be that the program counts the last block twice, or that it
> will be stuck in an infinite loop when the read fails:
>
> http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5

About the above code:

The worst case is when a reading reads the last character, and the next reading reads no characters. In this
case, gcount() will be zero, so there will be no side-efrects.

The link you provided talks about a while construct, and not a do-while construct. In the do-while case, the
EOF will have already been reached.

That is, the FAQ mentions:

---------- From the FAQ ----------

For example, the following code might have an off-by-one error with the count i:

int i = 0;
while (! std::cin.eof()) { // WRONG! (not reliable)
std::cin >> x;
++i;
// Work with x ...
}

What you really need is:

int i = 0;
while (std::cin >> x) { // RIGHT! (reliable)
++i;
// Work with x ...
}

----------------------------------

If you convert it to do-while the fist becomes:

int i = 0;

do
{
std::cin >> x;
++i;
// Work with x ...
}while (! std::cin.eof());

which catches the EOF, since the reading takes place before the condition.

Ioannis Vranos

unread,

Jun 11, 2009, 11:53:57 AM6/11/09

to

As a hint to Chris M. Thomasson:

We know that the basic character set is guaranteed to be implemented in the range [0, 127].

Counting all the chars read in the style

// The read characters are stored in the char array buffer.
char buffer[BUFSIZ]= {};

vector<unsigned long> charFrequencies(128);

// We read and store the characters in the char array buffer.
//
// sizeof(buffer)== BUFSIZ.
//
for(size_t i= 0; i< sizeof(buffer); ++i)
charFrequencies[ buffer[i] ]++;

will work for all characters independently of the implementation of the *basic character set* (ASCII, EBCDIC,
whatever).

Thomas J. Gritzan

unread,

Jun 11, 2009, 12:28:47 PM6/11/09

to

Ioannis Vranos schrieb:

> Thomas J. Gritzan wrote:
>> Ioannis Vranos schrieb:

>>> do
>>> {
>>> inputFile.read(buffer, sizeof(buffer));
>>>
>>>
>>> for(streamsize i= 0; i< inputFile.gcount(); ++i)
>>> ++characterFrequencies[ buffer[i] ];
>>>
>>> }while(not inputFile.eof());

By the way, this "not" instead of "!" is horrible. I would write:

while (!inputFile.eof());

Which is still wrong because of eof. Correct is:

while (inputFile);

>> Using istream::eof or ::feof() as loop-condition is always wrong.
>> It might be that the program counts the last block twice, or that it
>> will be stuck in an infinite loop when the read fails:
>>
>> http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5
>
>
> About the above code:
>
> The worst case is when a reading reads the last character, and the next
> reading reads no characters. In this case, gcount() will be zero, so
> there will be no side-efrects.

No. The worst case is when the stream sets failbit or badbit, because of
a hardware or read error, as James mentions elsethread.
In this case, the loop will never stop.

> The link you provided talks about a while construct, and not a do-while
> construct. In the do-while case, the EOF will have already been reached.

[...]

> If you convert it to do-while the fist becomes:
>
>
> int i = 0;
>
> do
> {
> std::cin >> x;
> ++i;
> // Work with x ...
> }while (! std::cin.eof());

Assuming x equals i... (should we tell the author?)

>
> which catches the EOF, since the reading takes place before the condition.

If the read fails because of eof, x (or i) will contain garbage (like
the previous value) and this value will be evaluated (again).

If the input stream doesn't contain a number, the read fails and failbit
is set but not eof, so the loop will go infinitly.

So the rule is: Never use eof as loop condition.

--
Thomas

Andrew Tomazos

unread,

Jun 11, 2009, 12:50:53 PM6/11/09

to

On Jun 11, 3:53 pm, James Kanze <james.ka...@gmail.com> wrote:
> How to fold UTF-8 input rapidly would be the challenge. There
> are over a million code points (and UTF-8 actually supports over
> 4 billion), so a Unicode character doesn't fit into a short on
> most machines, and the tables would be very, very big. And very
> sparce, since very few characters (proprotionally) actually have
> a case difference.

UTF-8 only supports about 1 million code points. 5 and 6 byte
sequences are explicitly illegal.
-Andrew.

Andrew Tomazos

unread,

Jun 11, 2009, 12:52:57 PM6/11/09

to

On Jun 11, 4:34 pm, James Kanze <james.ka...@gmail.com> wrote:

> He's testing on a Linux machine. The default encoding under
> Linux is UTF-8, a multibyte encoding. Unless otherwise stated,
> it seems safe to assume that code which doesn't handle multibyte
> encodings correctly is incorrect.

Who defined that the "default encoding under Linux is UTF-8" ? In
what standard is this specified? Or where did you hear it?
-Andrew.

blargg

unread,

Jun 11, 2009, 12:53:10 PM6/11/09

to

Ioannis Vranos wrote:
[...]

> We know that the basic character set is guaranteed to be implemented in
> the range [0, 127].

[...]

> char buffer[BUFSIZ]= {};
> vector<unsigned long> charFrequencies(128);

[...]

> charFrequencies[ buffer[i] ]++;
>
> will work for all characters independently of the implementation of the
> *basic character set* (ASCII, EBCDIC, whatever).

And we can use UCHAR_MAX to ensure we cover ALL char values (cast char to
unsigned char first, of course). No need for any magic constants.

(BTW, your lines are longer than 80 characters, in case you didn't realize)

Alf P. Steinbach

unread,

Jun 11, 2009, 1:39:31 PM6/11/09

to

* Peter Jansson:

> On Thu, 11 Jun 2009, Alf P. Steinbach wrote:
>
>> It sort of sped up... ;-)
>>
> /.../
>
>> Cheers & hth.,
>>
>> - Alf
>>
>> PS: It's possible to optimize even more but what's the point?
>
>
> Alf, I have included your modified program as well in the comparison. It
> is now on the weblog page.

Uhm, I'm sorry to report that I was lead astray by two Windows "features":

1. In Windows, 'clock', used in that program to measure the time, reports
wall time, not process time.

2. For some obscure reason Windows XP, or perhaps it's the anti-virus, or
whatever, causes a process to run extra slowly the first few times. In
this case I ran the original program twice with a certain medium size
text file, then with modification it ran 400 times faster!, which I could
not believe was an arbitrary effect. But now testing it, and with a much
larger file, it is as you reported, about the same time.

> Sincerely,
> Peter Jansson
>
> PS: How would you optimize it even more?

On my machine, Windows XP,

#define BUFFER 16*1024U

for the same program text that you have, yields the best times (statistically).

You might want to put that also in your list. :-)

I don't have a bittorrent client so couldn't obtain your file, but with a half
gigabyte text file I constructed from a dictionary text, the processing is about
1/27 of the i/o time. So a marginal improvement, on that order, could be to read
e.g. four bytes (a long) at a time from the buffer to avoid memory accesses.

Just to be thorough I tested using the lower level Posix standard open etc., but
it turned out that it was not significantly faster for this than FILE*.

A less insignificant improvement could be to use a memory mapped file, because
that could conceivably leverage some OS virtual memory optimization. However,
while the Boost library does have some functionality for mapping a file to
memory, it beats me how to use that. It's not exactly well-documented.

Cheers & hth.,

- Alf (victim of Windows (the Eagles' newest greatest hit!))

Chris M. Thomasson

unread,

Jun 11, 2009, 7:06:25 AM6/11/09

to

"Ioannis Vranos" <ivr...@freemail.gr> wrote in message

news:h0qnek$f2q$1...@news.grnet.gr...

> Chris M. Thomasson wrote:
>> "Ioannis Vranos" <ivr...@freemail.gr> wrote in message
>> news:h0piug$j61$4...@news.grnet.gr...
>>> agelos.an...@gmail.com wrote:
>>>>
>>>> Actually, just changing the line :
>>>> ifstream inputFile(argv[argc -1]);
>>>> with
>>>> ifstream inputFile(argv[argc -1],std::ios::binary);
>>>>
>>>> beats any other post (on my PC at least :))
>>>>
>>>> Here's the new code (just added the binary open mode)
>>>
>>>
>>> Binary mode isn't the same as text mode (it will not work) in all
>>> platforms.
>>
>> I only care about the platforms in which it does work. ;^)
>
>
> OK, I have no problem if you can't provide a better solution, after all it
> is just a challenge. When I say it will not work, I mean you will get
> inaccurate results, and there is a chance that the program already does
> not work in your platform, that is there is a chance you are already
> getting inacaccurate results.

I only care about the platforms in which it works.

> Second, are you the same person with the poster I replied
> (agelos.an...@gmail.com)?

No. I post with email address of (n...@spam.invalid).

Ioannis Vranos

unread,

Jun 11, 2009, 2:26:05 PM6/11/09

to

Thomas J. Gritzan wrote:
> Ioannis Vranos schrieb:
>> Thomas J. Gritzan wrote:
>>> Ioannis Vranos schrieb:
>>>> do
>>>> {
>>>> inputFile.read(buffer, sizeof(buffer));
>>>>
>>>>
>>>> for(streamsize i= 0; i< inputFile.gcount(); ++i)
>>>> ++characterFrequencies[ buffer[i] ];
>>>>
>>>> }while(not inputFile.eof());
>
> By the way, this "not" instead of "!" is horrible. I would write:

Horrible? "not" is a *built in* keyword, as "and" and "or". They are more user readable.

>
>>> Using istream::eof or ::feof() as loop-condition is always wrong.
>>> It might be that the program counts the last block twice, or that it
>>> will be stuck in an infinite loop when the read fails:
>>>
>>> http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5
>>
>> About the above code:
>>
>> The worst case is when a reading reads the last character, and the next
>> reading reads no characters. In this case, gcount() will be zero, so
>> there will be no side-efrects.
>
> No. The worst case is when the stream sets failbit or badbit, because of
> a hardware or read error, as James mentions elsethread.
> In this case, the loop will never stop.

My code doesn't take into account hardware failures. Also if I am not wrong, while(inputFile) does not check
for EOF, neither while(cin>> something) that the FAQ uses.

Chris M. Thomasson

unread,

Jun 11, 2009, 2:30:28 PM6/11/09

to

"Alf P. Steinbach" <al...@start.no> wrote in message
news:h0qvs2$m1p$1...@news.eternal-september.org...

>* Ioannis Vranos:
>> Alf P. Steinbach wrote:
>>> * Peter Jansson:
>>>> On Thu, 11 Jun 2009, Ioannis Vranos wrote:
>>>>
>>>>>> so you can erase the original code "Ioannis A. Vranos, Code 2 (C++):
>>>>>> 13 s" and "Ioannis A. Vranos, Code 1 (C++): 13 s" (both of which were
>>>>>> essentially the same and thus were in the same message).
>>>>
>>>>
>>>> My mistake, have fixed it.
>>>
>>> I made 2 small changes to Ioannis' program.
>>>
>>> It sort of sped up... ;-)
>>
>>
>> Either you are joking which is OK, or you made a mistake because this has
>> not any relevance with my code.
>
> Oh, sorry, perhaps it was someone else's code.

Its my code that I posted here:

http://groups.google.com/group/comp.lang.c++/msg/2f6fd68d71c477aa

> Anyways, it was the code at top at the website, or perhaps second: I aimed
> my mouse at "Ioannis" there but may have missed.
>
> The modified code is about four hundred times faster, near as I can tell.
> ;-)

Four hundred times faster? Well, unfortunately I am not seeing the benefits
on my platform (Old P4 HyperThreaded 3.06gz 512mb ram). It actually goes
slower.

Chris M. Thomasson

unread,

Jun 11, 2009, 2:31:21 PM6/11/09

to

"blargg" <blarg...@gishpuppy.com> wrote in message
news:blargg.ei3-11...@192.168.1.4...

I felt no need to cover anything other than plain ol ASCII. Sorry.

Chris M. Thomasson

unread,

Jun 11, 2009, 2:39:04 PM6/11/09

to

"blargg" <blarg...@gishpuppy.com> wrote in message
news:blargg.ei3-11...@192.168.1.4...

> "Chris M. Thomasson" wrote:
> [...]
>> #define GET_COUNT(mp_self, mp_index) ( \
>> (mp_self)[(mp_index) + 'a'] + \
>> (mp_self)[(mp_index) + 'A'] \
>> )
> [...]
>
> Why do you keep posting non-portable programs that assume ASCII?

I have no good reason why.

> toupper/tolower are your friends; use them! If you think they pose a
> performance problem, you're doing the algorithm wrong, as you should only
> combine lower and uppercase counts AFTER you've counted frequencies, not
> during.

Actually, I am combining the counts after processing the data:

/* processing phase */

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

/* combine to acquire total */

for (i = 0; i < 26; ++i) {
total += GET_COUNT(counts, i);
}

Perhaps I will code something up that will handle something other than
ASCII.

Peter Jansson

unread,

Jun 11, 2009, 3:32:31 PM6/11/09

to

On Thu, 11 Jun 2009, Alf P. Steinbach wrote:

/.../

> I don't have a bittorrent client so couldn't obtain your file, but with a

/.../

You can download the file via a web based torrent relay, e.g.
http://www.torrentrelay.com

Sincerely,
Peter Jansson

Ioannis Vranos

unread,

Jun 11, 2009, 4:46:42 PM6/11/09

to

Alf is a troll with persistent storage (for years).

Alf P. Steinbach

unread,

Jun 11, 2009, 4:55:14 PM6/11/09

to

* Ioannis Vranos:

> Peter Jansson wrote:
>> On Thu, 11 Jun 2009, Alf P. Steinbach wrote:
>>
>> /.../
>>> I don't have a bittorrent client so couldn't obtain your file, but
>>> with a
>> /.../
>>
>> You can download the file via a web based torrent relay, e.g.
>> http://www.torrentrelay.com
>>
>>
>> Sincerely,
>> Peter Jansson
>
>
> Alf is a troll with persistent storage (for years).

As I read it you're trying to make a personal attack.

That reflects badly on you.

- Alf

Thomas J. Gritzan

unread,

Jun 11, 2009, 6:36:38 PM6/11/09

to

Ioannis Vranos schrieb:
> Thomas J. Gritzan wrote:
>> Ioannis Vranos schrieb:
>>> Thomas J. Gritzan wrote:
>>>> Ioannis Vranos schrieb:
>>>>> do
>>>>> {
>>>>> inputFile.read(buffer, sizeof(buffer));
>>>>>
>>>>>
>>>>> for(streamsize i= 0; i< inputFile.gcount(); ++i)
>>>>> ++characterFrequencies[ buffer[i] ];
>>>>>
>>>>> }while(not inputFile.eof());
>>
>> By the way, this "not" instead of "!" is horrible. I would write:
>
>
> Horrible? "not" is a *built in* keyword, as "and" and "or". They are
> more user readable.

You don't have to use it just because it's a keyword. register also is a
keyword, and you shouldn't use it, because it doesn't have any meaning
with current compilers.

About "not" or "!", C++ programmers are more used to the symbol. I
rarely see the use of not/and/or in this newsgroup. It might be more
readable to you, but it's not more readable in general.

>>>> Using istream::eof or ::feof() as loop-condition is always wrong.
>>>> It might be that the program counts the last block twice, or that it
>>>> will be stuck in an infinite loop when the read fails:
>>>>
>>>> http://www.parashift.com/c++-faq-lite/input-output.html#faq-15.5
>>>
>>> About the above code:
>>>
>>> The worst case is when a reading reads the last character, and the next
>>> reading reads no characters. In this case, gcount() will be zero, so
>>> there will be no side-efrects.
>>
>> No. The worst case is when the stream sets failbit or badbit, because of
>> a hardware or read error, as James mentions elsethread.
>> In this case, the loop will never stop.
>
>
> My code doesn't take into account hardware failures. Also if I am not
> wrong, while(inputFile) does not check for EOF, neither while(cin>>
> something) that the FAQ uses.

while(inputFile) doesn't check for EOF, and that's necessary.
For example:

int i;
while (file >> i)
{
// process i
}

If the extraction reads a number at the end of the file, the EOF bit
will be set. If the condition would check EOF, the last number wouldn't
be processed. But the next time you try to extract a number, failbit (or
badbit?) will be set and the loop ends correctly.

If there's an idiomatic usage, use it.

--
Thomas

Ioannis Vranos

unread,

Jun 11, 2009, 8:20:50 PM6/11/09

to

Thomas J. Gritzan wrote:
>
> About "not" or "!", C++ programmers are more used to the symbol. I
> rarely see the use of not/and/or in this newsgroup. It might be more
> readable to you, but it's not more readable in general.

It is more human readable. Also things eventually progress. Since it is not bad style to use "and", "or", and
"not" for conditions, but instead it makes the code more readable, it is a good style to use them.

> while(inputFile) doesn't check for EOF, and that's necessary.
> For example:
>
> int i;
> while (file >> i)
> {
> // process i
> }
>
> If the extraction reads a number at the end of the file, the EOF bit
> will be set. If the condition would check EOF, the last number wouldn't
> be processed. But the next time you try to extract a number, failbit (or
> badbit?) will be set and the loop ends correctly.
>
> If there's an idiomatic usage, use it.

If the extraction reads the last number stored in the file, the EOF bit will not be set. At the next read, EOF
will be set, however the condition will continue to be true. That means that although EOF has been reached, i
will still be processed. At the next read, EOF will be read again, and the condition will fail.

So I think the EOF check should be placed in the condition.

Chris M. Thomasson

unread,

Jun 11, 2009, 9:52:30 PM6/11/09

to

"Peter Jansson" <webm...@jansson.net> wrote in message
news:12ba30e3-b04b-4c32...@g20g2000vba.googlegroups.com...

> Dear news group,
>
> I have created a small programming challenge for those of you who are
> interested in challenging your Standard C++ programming skills. The
> challenge is about counting character frequency in large texts,
> perhaps useful for spam filtering or classical crypto analysis. You
> can read more about it here:
>

> http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Here is what I think will be my final submission... It should please some
people simply because it does not rely on pure ASCII. It can operate on
other character sets as well (e.g., EBCDIC). It should maintain its speed
wrt my other submission that is currently getting the best timing results.

One note, this version caculates to types of frequences. It gives output
like:

287924224 | 226951168
___________________________________________________
a %5.454 | %6.920
b %0.916 | %1.162
c %3.315 | %4.205
d %2.615 | %3.317
e %9.179 | %11.645
f %2.017 | %2.559
g %1.494 | %1.895
h %3.013 | %3.823
i %6.163 | %7.818
j %0.080 | %0.101
k %0.504 | %0.639
l %2.677 | %3.397
m %1.866 | %2.368
n %5.412 | %6.865
o %7.395 | %9.381
p %2.208 | %2.801
q %0.100 | %0.126
r %6.200 | %7.865
s %4.780 | %6.064
t %6.954 | %8.822
u %2.344 | %2.974
v %0.930 | %1.180
w %1.181 | %1.498
x %0.159 | %0.202
y %1.838 | %2.332
z %0.031 | %0.040

This was computed using the textfile I downloaded from the torrent on Peter
Jansson's blog. The first column is the displays the total number of
characters in the file, and the relative frequencies for that number. The
second column displays the total number of "valid" characters, and the
relative frequencies for that particular number. A valid character passes
the following constraint:

bool is_valid(int c) {
return (isalpha(c) && isprint(c));
}

Okay, here is the actual code:
_____________________________________________________________________
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>
#include <ctype.h>

#if ! defined (FILE_MODE)
# define FILE_MODE "rb"
#endif

#define BUFFER 65536U

#define CALC_FREQ(mp_count, mp_total) ( \
(mp_total) ? ((double)(mp_count) / \
(double)(mp_total)) * 100.0 : 0.0 \
)

int main(int argc, char** argv) {

if (argc > 1) {
clock_t const start = clock();
FILE* file = fopen(argv[1], FILE_MODE);

if (file) {
size_t size;
int status = EXIT_SUCCESS;
unsigned long total_chars = 0;
unsigned long total_valid_chars = 0;
unsigned long counts[UCHAR_MAX + 1] = { 0 };
static unsigned char buf[BUFFER];

while ((size = fread(buf, 1, BUFFER, file))) {
size_t tmpsize = size;

while (tmpsize) {
++counts[buf[--tmpsize]];
}

if (size < BUFFER) break;
}

if (ferror(file)) status = EXIT_FAILURE;
if (fclose(file)) status = EXIT_FAILURE;

if (status == EXIT_SUCCESS) {
size_t i;
clock_t stop;

for (i = 0; i <= UCHAR_MAX; ++i) {
total_chars += counts[i];
if (counts[i] && isalpha(i) && isprint(i)) {
total_valid_chars += counts[i];
if (isupper(i)) {
counts[(unsigned char)tolower(i)] += counts[i];
if ((unsigned char)tolower(i) > i) {
total_chars -= counts[i];
total_valid_chars -= counts[i];
}
}
}
}

printf("%lu | %lu\n"
"___________________________________________________\n",
total_chars, total_valid_chars);

for (i = 0; i <= UCHAR_MAX; ++i) {
if (isalpha(i) && isprint(i) && islower(i)) {
printf("%c %%%.3f | %%%.3f\n",
(char)tolower(i), CALC_FREQ(counts[i], total_chars),
CALC_FREQ(counts[i], total_valid_chars));
}
}

stop = clock();

printf("\nelapsed time: %lums\n",
(unsigned long)((((double)stop - (double)start)
/ CLOCKS_PER_SEC) * 1000.0));

return EXIT_SUCCESS;
}
}
}

fprintf(stderr, "file error...");

assert(0);

return EXIT_FAILURE;
}
_____________________________________________________________________

Any thoughts?

Chris M. Thomasson

unread,

Jun 12, 2009, 2:33:41 AM6/12/09

to

"James Kanze" <james...@gmail.com> wrote in message
news:695c01d2-f49e-437b...@h11g2000yqb.googlegroups.com...

On Jun 11, 11:21 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "James Kanze" <james.ka...@gmail.com> wrote in message

> > news:a6dd6d1e-2ebd-4d93...@s16g2000vbp.googlegroups.com...

> > > [...]
> > > (Also, the proposal says nothing about the encoding of the
> > > input file. If he's running under Ubantu, the default
> > > encoding is UTF-8, and properly handling case folding for
> > > UTF-8 could be significantly more time consuming than a

> > > na�ve implementation which ignores the reality of accented
> > > characters.)

> > Well, then perhaps one can attempt to create a local table and
> > map in unsigned shorts/UNICODE values directly, or whatever as
> > concrete indexes into said table?

> How to fold UTF-8 input rapidly would be the challenge. There
> are over a million code points (and UTF-8 actually supports over
> 4 billion), so a Unicode character doesn't fit into a short on
> most machines, and the tables would be very, very big. And very
> sparce, since very few characters (proprotionally) actually have
> a case difference.

> [...]

Keeping in line with my previous attitude, if the requirements of the test
dictate that I should provide support for UNICODE, then I would make quick
and dirty example which works on some platforms as a premature optimization
for said platforms wrt future "portable" implementation.

Yikes!

Chris M. Thomasson

unread,

Jun 12, 2009, 2:37:15 AM6/12/09

to

"Chris M. Thomasson" <n...@spam.invalid> wrote in message
news:h0ssre$o3j$1...@news.ett.com.ua...

Well, if I knew that an optimization would apply to most customers, then why
not go ahead and make it for them in an non-portable extension?

Peter Jansson

unread,

Jun 12, 2009, 2:46:25 AM6/12/09

to

On Thu, 11 Jun 2009, Chris M. Thomasson wrote:

>
> Any thoughts?
>

Thank you Chris, I have included your submission as well.

As someone suggested, I will run all the codes some number of times and
take the average execution time to have more reliable comparison.

Sincerely,
Peter

fifoforlifo

unread,

Jun 12, 2009, 3:22:56 AM6/12/09

to Peter Jansson

I thought this competition sounded fun, so I gave it a shot. The
following is a 2-threaded program that overlaps reads and
computation. I had to use boost::thread for this, but this should all
be doable in C++0x -- but I'll let you be the final judge of whether
it's an acceptable entry. As many pointed out, the problem is I/O
bound, so that's what I tackled first.
The computation part is nearly identical to Chris' code, I could not
beat the simple accumulate-into-256-wide-table. So cheers for finding
the best solution there :-)
FYI timings on my machine were:
Read the entire 280MB file, do no processing : 0.25 seconds
Chris' program : 0.6 seconds
This program : 0.4 seconds

/// This program counts plain ol' alphabet characters' frequency in a
text file.
/// Text file's encoding is assumed to be a superset of ASCII
/// (for example SHIFT-JIS or UTF-8 would work).

/*
CStopWatch s;
s.startTimer();
The_Stuff_You_Want_To_Time_Executes_Here();
s.stopTimer();
You may then get the elapsed time in seconds
via the method getElapsedTime.
*/
#ifdef WIN32
#include <windows.h>
class CStopWatch
{
private:
typedef struct {
LARGE_INTEGER start;
LARGE_INTEGER stop;
} stopWatch;
stopWatch timer;
LARGE_INTEGER frequency;
double LIToSecs( LARGE_INTEGER & L)
{
return ((double)L.QuadPart/(double)frequency.QuadPart);
}
public:
CStopWatch()
{
timer.start.QuadPart=0;
timer.stop.QuadPart=0;
QueryPerformanceFrequency( &frequency );
}
void startTimer() { QueryPerformanceCounter(&timer.start); }
void stopTimer() { QueryPerformanceCounter(&timer.stop); }
double getElapsedTime()
{
LARGE_INTEGER time;
time.QuadPart=timer.stop.QuadPart-timer.start.QuadPart;
return LIToSecs( time );
}
};
#else
#include <sys/time.h>
class CStopWatch
{
private:
typedef struct {
timeval start;
timeval stop;
} stopWatch;
stopWatch timer;
public:
void startTimer( ) { gettimeofday(&(timer.start),0); }
void stopTimer( ) { gettimeofday(&(timer.stop),0); }
double getElapsedTime()
{
timeval res;
timersub(&(timer.stop),&(timer.start),&res);
return res.tv_sec + res.tv_usec/1000000.0;
}
};
#endif

#include <stdio.h>
#include <boost/cstdint.hpp>
#include <boost/thread.hpp>
#include <boost/thread/mutex.hpp>
#include <boost/thread/condition.hpp>

#define BLOCK_SIZE 0x10000u
/// BLOCK_COUNT must be >= 3 for the FIFO to work properly
#define BLOCK_COUNT 0x10u
#define INBUF_SIZE (BLOCK_SIZE * BLOCK_COUNT)

static unsigned char inbuf[INBUF_SIZE];
static volatile int inbufAmt[BLOCK_COUNT];
static volatile size_t readCursor = 0;
static volatile size_t writeCursor = 0;
static size_t totalRead = 0;
static CStopWatch stopWatch;
/// fileReadDone is protected by ioMutex -- otherwise you get a race
condition at EOF
static volatile int fileReadDone = 0;

static boost::mutex ioMutex;
static boost::condition readAvailable;
static boost::condition writeAvailable;

static int totals[256] = {0};

struct Reader {
const char* const pFileName;

Reader(const char* pFileName_) : pFileName(pFileName_) {}

void operator()() {
stopWatch.startTimer();
FILE* pFile = fopen(pFileName, "rb");
if (!pFile)
return;

while (!feof(pFile) && !ferror(pFile)) {
inbufAmt[writeCursor] = fread((void*)&inbuf[BLOCK_SIZE *
writeCursor],
1, BLOCK_SIZE, pFile);
totalRead += inbufAmt[writeCursor];

const size_t nextWriteCursor = (writeCursor + 1) %
BLOCK_COUNT;
while (nextWriteCursor == readCursor) {
boost::mutex::scoped_lock lk(ioMutex);
readAvailable.notify_one();
writeAvailable.wait(lk);
}
writeCursor = nextWriteCursor;
readAvailable.notify_one();
}

{
boost::mutex::scoped_lock lk(ioMutex);
fileReadDone = 1;
readAvailable.notify_one();
}

fclose(pFile);
}
};

static void AccumulateTotals(const unsigned char* pBuffer, size_t
size) {
const unsigned char* pc = pBuffer;
const unsigned char* const pcEnd = pc + size;
for (; pc != pcEnd; ++pc) {
const unsigned char c = *pc;
++totals[c];
}
}

int main(int argc, char** argv) {

if (argc < 2) {
printf("\nusage:\n\t%s <text_file_name>\n", argv[0]);
return -1;
}

// launch a reader thread
Reader reader(argv[1]);
boost::thread readerThread(reader);

// accumulate totals from buffers as they are
while (!fileReadDone) {
while (writeCursor == readCursor) {
boost::mutex::scoped_lock lk(ioMutex);
if (fileReadDone)
break;
writeAvailable.notify_one();
readAvailable.wait(lk);
}
if (fileReadDone)
break;

AccumulateTotals(&inbuf[BLOCK_SIZE * readCursor], inbufAmt
[readCursor]);
readCursor = (readCursor + 1) % BLOCK_COUNT;
writeAvailable.notify_one();
}

long totalAlphaChars = 0;
for (size_t u = 0; u < 26; ++u) {
totalAlphaChars += totals['A' + u] + totals['a' + u];
}

for (size_t u = 0; u < 26; ++u) {
double result = (totals['A' + u] + totals['a' + u]) * 100.0 /
totalAlphaChars;
printf("%c %%%.3f\n", 'a' + u, result);
}

stopWatch.stopTimer();

const double elapsed = stopWatch.getElapsedTime();
printf("\nelapsed time: %f [seconds]\n", elapsed);

return 0;
}

Bart van Ingen Schenau

unread,

Jun 12, 2009, 3:44:01 AM6/12/09

to

Ioannis Vranos wrote:

> Thomas J. Gritzan wrote:
>>
>> while(inputFile) doesn't check for EOF, and that's necessary.
>> For example:
>>
>> int i;
>> while (file >> i)
>> {
>> // process i
>> }
>>
>> If the extraction reads a number at the end of the file, the EOF bit
>> will be set. If the condition would check EOF, the last number
>> wouldn't be processed. But the next time you try to extract a number,
>> failbit (or badbit?) will be set and the loop ends correctly.
>>
>> If there's an idiomatic usage, use it.
>
>
> If the extraction reads the last number stored in the file, the EOF
> bit will not be set.

That is not true. If the very last character in the stream is a digit,
then ios_base::eofbit will be set when the last number is successfully
read from the stream.

> At the next read, EOF will be set, however the
> condition will continue to be true. That means that although EOF has
> been reached, i will still be processed. At the next read, EOF will be
> read again, and the condition will fail.
>
> So I think the EOF check should be placed in the condition.

That will give you a loop that has an off-by-one error under certain
conditions. Depending on where the test is located, you either process
one item too few or one too many.

Bart v Ingen Schenau
--
a.c.l.l.c-c++ FAQ: http://www.comeaucomputing.com/learn/faq
c.l.c FAQ: http://c-faq.com/
c.l.c++ FAQ: http://www.parashift.com/c++-faq-lite/

Ioannis Vranos

unread,

Jun 12, 2009, 5:55:46 AM6/12/09

to

Bart van Ingen Schenau wrote:
>
>>
>> If the extraction reads the last number stored in the file, the EOF
>> bit will not be set.
>
> That is not true. If the very last character in the stream is a digit,
> then ios_base::eofbit will be set when the last number is successfully
> read from the stream.
>
>> At the next read, EOF will be set, however the
>> condition will continue to be true. That means that although EOF has
>> been reached, i will still be processed. At the next read, EOF will be
>> read again, and the condition will fail.
>>
>> So I think the EOF check should be placed in the condition.
>
> That will give you a loop that has an off-by-one error under certain
> conditions. Depending on where the test is located, you either process
> one item too few or one too many.
>
> Bart v Ingen Schenau

Here is a test you may do:

#include <iostream>
#include <fstream>

int main(int argc, char **argv)
{
using namespace std;

char c;

ifstream inputFile(argv[argc- 1]);

while(inputFile.read(&c, 1))
{
static int i= 0;

cout<< ++i<< endl;
}

}

Create a text file with one character of the basic character set, and run the program in the style:

./myprogram test.txt

This will print

1
2

and will terminate.

The failfbit is not be set before EOF is read twice.

Alf P. Steinbach

unread,

Jun 12, 2009, 6:13:29 AM6/12/09

to

* Ioannis Vranos:

>
> #include <iostream>
> #include <fstream>
>
> int main(int argc, char **argv)
> {
> using namespace std;
>
> char c;
>
>
> ifstream inputFile(argv[argc- 1]);
>
> while(inputFile.read(&c, 1))
> {
> static int i= 0;
>
> cout<< ++i<< endl;
> }
>
> }
>
> Create a text file with one character of the basic character set, and
> run the program in the style:
>
> ./myprogram test.txt
>
> This will print
>
> 1
> 2
>
> and will terminate.

I get the output

1

with g++ and MSVC.

Have you checked that the size of your data file is exactly 1 byte?

If the file really is 1 byte, then what is your compiler and system?

> The failfbit is not be set before EOF is read twice.

�27.6.1.3/27 "[when] end-of-file occurs on the input sequence ... calls
setstate(failbit|eofbit)"

Cheers & hth.,

Ioannis Vranos

unread,

Jun 12, 2009, 6:17:22 AM6/12/09

to

I am wrong.

Ioannis Vranos

unread,

Jun 12, 2009, 6:27:41 AM6/12/09

to

Based on the input of other people I corrected the loop condition. Also I improved the timing mechanism to be
more accurate:

#include <valarray>
#include <fstream>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <string>
#include <cctype>
#include <ctime>

int main(int argc, char **argv)
{
using namespace std;

// Warning: long double has problems with MINGW compiler for Windows.
//
// The C++ basic character set is using the value range [0, 127].
// If we used vector<long double>, it would not have any run-time difference in any modern compiler.
valarray<long double> characterFrequencies(128);

// The array where the read characters will be stored.
char buffer[BUFSIZ];

// If argc!= 2, then either the number of arguments is not correct, or the platform does not
// support arguments.
if(argc!= 2)
{
cerr<< "\nUsage: "<< argv[0]<< " fileNameToRead\n\n";

return EXIT_FAILURE;
}

// We disable synchronisation with stdio, to speed up C++ I/O.
ios_base::sync_with_stdio(false);

string characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

clock_t time1, time2;

// We start timing.
time1= clock();

// We open the file

ifstream inputFile(argv[argc -1]);

// An error happened
if(not inputFile.good())
{
cerr<< "\nCould not open file for reading, exiting...\n\n";

return EXIT_FAILURE;
}

do
{
inputFile.read(buffer, sizeof(buffer));

for(streamsize i= 0; i< inputFile.gcount(); ++i)
++characterFrequencies[ buffer[i] ];

}while(inputFile);

// Since rule 1 is: "Your program should be case insensitive when it counts letters",
// we add the results of lowercase characters and their equivallent uppercase letters together.
cout<<fixed<< "\n\n\nThe letter frequencies are:\n";

long double totalcharacterFrequencies= 0;

for(string::size_type i= 0; i< characters.size(); ++i)
totalcharacterFrequencies+= characterFrequencies[ characters[i] ]+ characterFrequencies[
tolower(characters[i]) ];

for(string::size_type i= 0; i< characters.size(); ++i)
cout<< characters[i]<< ": "<< (characterFrequencies[ characters[i] ]+ characterFrequencies[
tolower(characters[i]) ])/ totalcharacterFrequencies* 100<< "%\n";

// We "stop" timing.
time2= clock();

// We convert the timing to seconds.
double totalTimeInSeconds= static_cast<double>(time2- time1)/ CLOCKS_PER_SEC;

cout<<"\n\nThe whole process took "<< totalTimeInSeconds<< " seconds.\n";

cout<<"\n\nHave a nice day!\n";
}

In my machine, the program produces:

john@john-laptop:~/Projects/anjuta/cpp/src$ g++ -ansi -pedantic-errors -Wall -O3 main.cc -o foobar
john@john-laptop:~/Projects/anjuta/cpp/src$ ./foobar LetterFrequencyInput.txt

The letter frequencies are:
A: 6.919578%
B: 1.162287%
C: 4.205169%
D: 3.317211%
E: 11.644528%
F: 2.559197%
G: 1.895033%
H: 3.822553%
I: 7.818366%
J: 0.101068%
K: 0.638897%
L: 3.396621%
M: 2.367889%
N: 6.865435%
O: 9.381317%
P: 2.801040%
Q: 0.126336%
R: 7.865290%
S: 6.064106%
T: 8.821831%
U: 2.974300%
V: 1.180335%
W: 1.497979%
X: 0.202137%
Y: 2.331793%
Z: 0.039705%

The whole process took 1.080000 seconds.

Have a nice day!
john@john-laptop:~/Projects/anjuta/cpp/src$

blargg

unread,

Jun 12, 2009, 11:43:25 AM6/12/09

to

Chris M. Thomasson wrote:
[...]

> unsigned long counts[UCHAR_MAX + 1] = { 0 };
> static unsigned char buf[BUFFER];
>
> while ((size = fread(buf, 1, BUFFER, file))) {
> size_t tmpsize = size;
>
> while (tmpsize) {
> ++counts[buf[--tmpsize]];

I wouldn't be surprised if this were slower than going forward, as
backwards might foil a processor's read-ahead of memory into the cache.
You can preserve your loop's style by having your index count from -size
to 0:

unsigned char* const buf_end = buf + size;
ptrdiff_t i = -(ptrdiff_t) size;
do {
++counts[buf_end[i]];
}
while ( ++i );

Peter Jansson

unread,

Jun 12, 2009, 3:25:39 PM6/12/09

to

On Fri, 12 Jun 2009, fifoforlifo wrote:

> I thought this competition sounded fun, so I gave it a shot. The
> following is a 2-threaded program that overlaps reads and
> computation. I had to use boost::thread for this, but this should all
> be doable in C++0x -- but I'll let you be the final judge of whether
> it's an acceptable entry. As many pointed out, the problem is I/O
> bound, so that's what I tackled first.
> The computation part is nearly identical to Chris' code, I could not
> beat the simple accumulate-into-256-wide-table. So cheers for finding
> the best solution there :-)

/.../

Hi,

I have included your code in the list of results at
http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

My Ubuntu machine had Boost 1.37 so that is what was used during
compilation.

Cheers,
Peter Jansson

Ioannis Vranos

unread,

Jun 12, 2009, 4:22:51 PM6/12/09

to

I think you are relaxing the rules. I know Qt programming, and most (if not all) Linux distributions provide
the Qt libraries.

May I use Qt?

Chris M. Thomasson

unread,

Jun 12, 2009, 10:13:26 PM6/12/09

to

"blargg" <blarg...@gishpuppy.com> wrote in message

news:blargg.ei3-12...@192.168.1.4...

Yeah. Your right. I don't really know why I did it that way. I think I would
just go with something like:

while ((size = fread(buf, 1, BUFFER, file))) {

size_t tmpsize;

for (tmpsize = 0; tmpsize < size; ++tmpsize) {
++counts[buf[tmpsize]];
}

if (size < BUFFER) break;
}

Peter Jansson

unread,

Jun 13, 2009, 4:21:51 AM6/13/09

to

On Fri, 12 Jun 2009, Ioannis Vranos wrote:

> I think you are relaxing the rules. I know Qt programming, and most (if not
> all) Linux distributions provide the Qt libraries.
>
>
> May I use Qt?

It would be an interesting exercise to compare the result with a Qt
variant. Please provide the necessary compile flags (-Ix -Lx -lx) if you
have time to do it.

Sincerely,
Peter Jansson

Jonathan Lee

unread,

Jun 13, 2009, 12:11:24 PM6/13/09

to

> I have created a small programming challenge for those of you who are
> interested in challenging your Standard C++ programming skills.

If this has already been brought up then please ignore me, but how are
you ensuring the programs are reading from disk? For example, the
current program listed as fastest on your blog clocks in at .13s. So I
downloaded the source, downloaded the text file and unzipped it, and
ran it. Result: 0.17s. Fine, but I notice that my hard disk light
doesn't flash. So I restart the computer and run it again. Result:
6.1s.

I suspect that some of the times you've posted may be benefiting from
the fact that the file is in memory, and others are genuinely reading
it from disk.

--Jonathan

Jonathan Lee

unread,

Jun 13, 2009, 1:26:00 PM6/13/09

to

Here's a submission in C++. Nothing that interesting, but to
illustrate a
point: the speed of this program is dominated by the printing of the
floating point value representing percent. i.e., this line:

cout << ("abcdefghijklmnopqrstuvwxyz")[i] << ' ' << p <<"%\n";

if you remove "<< p" it will only take 0.158s on my computer. With
that in, it takes
0.514s. Perhaps it's just g++'s implementation of cout's float to text
processing.
In any event, I think we're at the mercy of the compiler's libs.

For what it's worth I also coded up a version that read longs instead
of chars, and
accumulated 16 bits at a time in the "count" array (fixing the
calculation at the
end). This didn't make much of a difference, though, in light of the
cout penalty
noted above.

//-----------------------------------------------------------------------------------

#include <cstdio>
#include <iostream>
using ::std::cout;
using ::std::endl;

#define BUFFSIZE 4096

int main(int argc, char* argv[]) {
if( argc <= 1 ) return (printf("Please provide a file name\n"), 1);
CStopWatch s;
s.startTimer();

if (FILE* f = fopen(argv[1], "r")) {
unsigned char buff[BUFFSIZE];
unsigned long count[256] = {0};
unsigned long total_alpha[26] = {0};

size_t bytesread;

while ((bytesread = fread(buff, 1, BUFFSIZE, f)) > 0) {
for (size_t i = 0; i < bytesread; ++i) {
++count[buff[i]];
}
}

unsigned long total = 0;
for (int i = 0; i < 26; ++i) {
unsigned long x;
x = count[("abcdefghijklmnopqrstuvwxyz")[i]]
+ count[("ABCDEFGHIJKLMNOPQRSTUVWXYZ")[i]];
total_alpha[i] = x;
total += x;
}
float p2 = 100.0f / total;
for (int i = 0; i < 26; ++i) {
float p = p2 * total_alpha[i];
cout << ("abcdefghijklmnopqrstuvwxyz")[i] << ' ' << p <<"%\n";
}

cout << endl;
fclose(f);
}
s.stopTimer();
cout << "Time elapsed: " << s.getElapsedTime() << endl;

return 0;
}

Peter Jansson

unread,

Jun 13, 2009, 3:34:15 PM6/13/09

to

On Sat, 13 Jun 2009, Jonathan Lee wrote:

> Here's a submission in C++. Nothing that interesting, but to
> illustrate a
> point: the speed of this program is dominated by the printing of the
> floating point value representing percent. i.e., this line:

/.../

Hi, you are quite right about the unreliable outcomes we see when counting
wall clock time during I/O. I have, however, included your program at the
comparison page:

http://blog.p-jansson.com/2009/06/programming-challenge-letter-frequency.html

Sincerely,
Peter Jansson

Keith H Duggar

unread,

Jun 13, 2009, 5:24:52 PM6/13/09

to

On Jun 13, 3:34 pm, Peter Jansson <webmas...@jansson.net> wrote:
> http://blog.p-jansson.com/2009/06/programming-challenge-letter-freque...

Why is that link programmed to refresh every three seconds? Can you
please change that to something more reasonable (15 seconds at least)
as it is very annoying.

Ioannis Vranos

unread,

Jun 13, 2009, 5:49:48 PM6/13/09

to

Opening the text file in binary mode is a mistake, so I think you should remove such answers as valid answers.

Hendrik Schober

unread,

Jun 13, 2009, 6:41:01 PM6/13/09

to

Chris M. Thomasson wrote:
> [...]

> I only care about the platforms in which it works.

And what about customers? Do you also only care about
customers for which it works? (I'd starve my kids doing
it this way.)

> [...]

Schobi

Chris M. Thomasson

unread,

Jun 13, 2009, 9:22:07 PM6/13/09

to

"Hendrik Schober" <spam...@gmx.de> wrote in message
news:h119ts$d5v$1...@hoshi.visyn.net...

> Chris M. Thomasson wrote:
>> [...]
>> I only care about the platforms in which it works.
>
> And what about customers?

You know of companies that are interested in the programs that this contest
has spawned?

> Do you also only care about
> customers for which it works? (I'd starve my kids doing
> it this way.)

I do not foresee anybody purchasing the little program I created for this
contest.

Jerry Coffin

unread,

Jun 13, 2009, 10:04:35 PM6/13/09

to

In article <blargg.ei3-10...@192.168.1.4>, blargg.ei3
@gishpuppy.com says...

[ ... ]

> Here's a simple solution that reads from stdin and doesn't do any timing

Simple, but wrong. The percentages it computes will (at least usually)
be incorrect.

[ ... ]

> while ( (c = getchar()) != EOF )
> {
> f [c]++;
> size++;

At least as I read the proposal, only letters should be counted toward
the total.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Jerry Coffin

unread,

Jun 13, 2009, 10:30:37 PM6/13/09

to

In article <12ba30e3-b04b-4c32-9625-
7bd26c...@g20g2000vba.googlegroups.com>, webm...@jansson.net
says...
> Dear news group,

>
> I have created a small programming challenge for those of you who are

> interested in challenging your Standard C++ programming skills. The
> challenge is about counting character frequency in large texts,
> perhaps useful for spam filtering or classical crypto analysis. You
> can read more about it here:

Yet another bit of code:

#include <iostream>
#include <fstream>
#include <vector>
#include <numeric>
#include <iomanip>

class counter {
std::vector<double> counts;
long total;
public:
static const int num = 'z'-'a'+1;

counter() : counts(num), total(0) {}

counter operator+(char c) {
char ch = tolower(c);
if (isalpha(ch)) {
++total;
++counts[ch-'a'];
}
return *this;
}

friend std::ostream &operator<<(std::ostream &os, counter const &c)
{
for (int i=0; i<num; ++i)
if (isalpha('a'+i))
// The single most complex part of the program is getting the
// output formatted as your web site shows it!
os << char('a'+ i) << " "
<< std::fixed
<< std::setw(6)
<< std::setprecision(3)
<< c.counts[i]/c.total*100.0 << "%\n";
return os;
}
};

int main(int argc, char **argv) {

std::ifstream input(argv[1]);

counter &counts = std::accumulate(
std::istream_iterator<char>(input),
std::istream_iterator<char>(),
counter());

std::cout << counts;
return 0;
}

I suppose it's open to argument whether I might be abusing
std::accumulate and/or operator overloading. I (obviously) don't really
think so, but I'll admit that especially in the latter case there's room
for question.

I don't have the right sort of machine handy to test it with, but I
believe this should work with something like EBCDIC. Making it work
correctly with something like almost any Unicode encoding would take
substantial modifications.

Chris M. Thomasson

unread,

Jun 13, 2009, 11:01:35 PM6/13/09

to

"Jonathan Lee" <cho...@shaw.ca> wrote in message
news:9c0295b2-66b2-4592...@x5g2000yqk.googlegroups.com...

It might be interesting to see a posted result that executed on a system in
which portions of the target file were not in memory. Then, run the program
10 or 20 times; take the average time (e.g., time command) and post that
result as well.