OWSTL benchmark experiment.

Peter C. Chapin

unread,

Jun 9, 2006, 1:18:29 AM6/9/06

to

One of the URLs posted by Stephen contained a link to a posting by
Stroustrup with a simple STL benchmark program. I downloaded the program
and gave it a spin.

I had to make a number of adjustments to account for unimplemented items
in OWSTL. For example the deque tests didn't compile; I believe we are
missing certain necessary deque iterator operations. I didn't look at it
very closely; I just commented that test out for now. The tests also
used template constructors to initialize various containers. We don't
have those so I changed the initialization code to use a different
method.

The tests exercise several sequences: a plain array, a vector being
accessed by pointers, a vector being accessed by vector iterators, and a
list. Each sequence is filled with a randomly shuffled collection of
doubles and then sorted. In the original program, std::unique() was also
called, but we don't have that yet so I skipped that part.

The tests also exercise the set and multiset associative containers. The
set is loaded up with data but otherwise nothing is done with it. The
multiset is loaded with data and then iterated over. Multiple occurances
of equal times are then removed. Such occurances exist: the data used to
initialize the containers has exactly two copies of every value.

The tests are run on different sized containers from n == 10 to n ==
1000000. The number of times the tests are run are scaled in such a way
as to cause the overall run time to be consistent for the different n
values. Thus for the n == 10 case you see a lot of setup overhead. For
the n == 1000000 case you see, basically, std::sort (for the sequences
anyway).

Note that we don't have list::sort() yet so the list test doesn't
actually sort anything. This is why it is abnormally fast in the numbers
below. However, I used exactly the same program on three different
compilers so there is consistency from run to run at least. I did all
runs on the same hardware and operating system (Pentium M processor,
Windows XP). Note that g++ was run under Cygwin.

The results...

Open Watcom v1.6beta (using wcl386 -ox bench.cpp)

size array vec_p vec_i list set multiset
10 1.76 2 2 1.83 1.96 5.21
100 1.56 1.74 1.74 0.881 1.37 3.15
1000 1.49 1.57 1.56 0.6 1.2 2.49
10000 1.47 1.56 1.59 0.461 1.45 3.04
100000 1.52 1.58 1.58 0.371 3.58 10.9
^C

I killed the n = 1000000 run after about five minutes. I suspect there
may be some problem with set or multiset (note the rather dramatic jump
in time for multiset when n == 100000). It should also be mentioned that
OW's random_shuffle() might not be producing the desired results for
sequences that are greater than 32K items. It should probably be using a
random number generator that produces 32 bit random values.

Visual C++ v8.0 (using cl /EHsc /Ox bench.cpp)

size array vec_p vec_i list set multiset
10 0.721 2.62 2.64 3.01 2.47 4.77
100 0.461 0.711 0.711 1.42 1.52 2.82
1000 0.501 0.57 0.571 1.78 1.61 2.88
10000 0.561 0.641 0.651 1.87 1.53 2.99
100000 0.581 0.671 0.681 1.56 2.79 6.46
1000000 0.571 0.681 0.691 1.36 4.45 6.98

It appears that VC++'s std::sort is about twice as fast as ours.
Probably switching to QuickSort would help with that. Notice also that
OW seemed to be able to optimize the overhead cruft better than VC++. I
don't understand why OW's list test is so much faster. Maybe OW has a
better memory allocator? (Recall that the list test doesn't sort... it
just builds a list).

g++ v3.4.4 (using g++ -o bench -O bench.cpp)

size array vec_p vec_i list set multiset
10 1.14 4.55 4.67 9.04 6.31 11.8
100 0.411 0.751 0.881 4.48 3.58 6.34
1000 0.421 0.51 0.641 3 2.59 4.44
10000 0.42 0.491 0.641 2.22 2.1 3.52
100000 0.421 0.48 0.581 1.71 2.22 4.14
1000000 0.501 0.541 0.62 1.57 3.83 5.82

Overall g++ does pretty well. The overhead costs seem particularly high.
I also notice g++ adds a penalty for using vector iterators. That's not
so good.

In conclusion: considering OWSTL's immature state I think it did okay on
this test. We could use a faster sort. There might be something broken
in set or multiset (or elsewhere... there was some kind of problem with
the n == 10000000 case). And of course we need to implement more of the
STL.

For those of you who want to play I'm attaching the program to the
bottom of this message. It's not particularly long. I think a program
like this would be a nice addition to bld\bench\owstl.

Peter

----> cut <----

#include <cmath>
#include <cstddef>
#include <cstdlib>
#include <ctime>

#include <algorithm>
#include <deque>
#include <iterator>
#include <list>
#include <set>
#include <vector>

#include <iostream>
#include <iomanip>

typedef double element_t;

using namespace std;

// Results are pushed into this vector
vector<double> result_times;

void do_head()
{
cout << "size" <<
"\tarray"
"\tvec_p"
"\tvec_i"
// "\tdeque"
"\tlist"
"\tset"
"\tmultiset" << '\n';
}

void do_row(int size)
{
cout << size;
cout << setprecision(3);
for (size_t i = 0; i < result_times.size(); ++i)
cout << '\t' << result_times[i];
cout << '\n';
}

clock_t start_time;

inline void start_timer() { start_time = clock(); }

inline double timer()
{
clock_t end_time = clock();
return (end_time - start_time)/double(CLOCKS_PER_SEC);
}

typedef void(*Test)(element_t*, element_t*, int);

void run(Test f, element_t* first, element_t* last, int number_of_times)
{
start_timer();
while (number_of_times-- > 0) f(first, last, number_of_times);
result_times.push_back(timer());
}

void array_test(element_t* first, element_t* last, int)
{
element_t* array = new element_t[last - first];
copy(first, last, array);
sort(array, array + (last - first));
// unique(array, array + (last - first));
delete [] array;
}

void vector_pointer_test(element_t* first, element_t* last, int)
{
// vector<element_t> container(first, last);
vector<element_t> container;
copy(first, last, back_inserter(container));
element_t *p = &*container.begin();
sort(p, p + (last - first));
// unique(p, p + (last-first));
}

void vector_iterator_test(element_t* first, element_t* last, int)
{
// vector<element_t> container(first, last);
vector<element_t> container;
copy(first, last, back_inserter(container));
sort(container.begin(), container.end());
// unique(container.begin(), container.end());
}

#ifdef NEVER
void deque_test(element_t* first, element_t* last, int)
{
// deque<element_t> container(first, last);
deque<element_t> container;
copy(first, last, back_inserter(container));
copy(first, last, container.begin());
sort(container.begin(), container.end());
// unique(container.begin(), container.end());
}
#endif

void list_test(element_t* first, element_t* last, int)
{
// list<element_t> container(first, last);
list<element_t> container;
copy(first, last, back_inserter(container));
// container.sort();
// container.unique();
}

void set_test(element_t* first, element_t* last, int)
{
// set<element_t> container(first, last);
set<element_t> container;
while (first != last) {
container.insert(*first);
++first;
}
}

void multiset_test(element_t* first, element_t* last, int)
{
// multiset<element_t> container(first, last);
multiset<element_t> container;
while (first != last) {
container.insert(*first);
++first;
}

typedef multiset<element_t>::iterator iterator;
{
iterator first = container.begin();
iterator last = container.end();

while (first != last) {
iterator next = first;
if (++next == last) break;
if (*first == *next)
container.erase(next);
else
++first;
}
}
}

void initialize(element_t* first, element_t* last)
{
element_t value = 0.0;
while (first != last) {
*first++ = value;
value += 1.0;
}
}

double logtwo(double x)
{
return log(x)/log(2.0);
}

const int largest_size = 1000000;

int number_of_tests(int size)
{
double n = size;
double largest_n = largest_size;
return int(floor((largest_n * logtwo(largest_n)) / (n * logtwo(n))));
}

void run_tests(int size)
{
const int n = number_of_tests(size);
const size_t length = 2*size;
result_times.clear();

// make a random test set of the chosen size:
vector<element_t> buf(length);
element_t* buffer = &*buf.begin();
element_t* buffer_end = buffer + length;
initialize(buffer, buffer + size);
initialize(buffer + size, buffer_end);
random_shuffle(buffer, buffer_end);

// test the containers:
run(array_test, buffer, buffer_end, n);
run(vector_pointer_test, buffer, buffer_end, n);
run(vector_iterator_test, buffer, buffer_end, n);
// run(deque_test, buffer, buffer_end, n);
run(list_test, buffer, buffer_end, n);
run(set_test, buffer, buffer_end, n);
run(multiset_test, buffer, buffer_end, n);
do_row(size);
}

int main()
{
do_head();
const int sizes [] = {10, 100, 1000, 10000, 100000, 1000000};
const int n = sizeof(sizes)/sizeof(int);
for (int i = 0; i < n; ++i) run_tests(sizes[i]);
return 0;
}

jonathan bailey

unread,

Jun 9, 2006, 1:18:32 AM6/9/06

to

"Peter C. Chapin" <pch...@sover.net> wrote in message news:Xns97BADED5D9A0...@206.225.93.118...

--------------------

Here's what I got:

Intel Compiler 9.0 (latest version):

size array vec_p vec_i list set multiset

10 0.484 1.73 1.63 4.69 3.34 6.42
100 0.313 0.468 0.422 2.27 1.88 3.31
1000 0.485 0.531 0.531 1.61 1.58 3.23
10000 0.64 0.657 0.656 1.83 1.75 3.45
100000 0.532 0.656 0.625 1.5 4.05 6.14
1000000 0.547 0.593 0.594 1.33 5.42 7.47

OW 1.6Beta:

size array vec_p vec_i list set multiset

10 1.72 2.11 2.11 1.91 1.66 6.3
100 1.77 1.88 2.02 0.906 1.42 3.64
1000 1.78 1.86 1.86 0.61 1.38 3.36
10000 1.88 2.05 1.92 0.484 1.95 5.59
100000 2.16 2.27 2.22 0.391 21 38.6
1000000 2.94 2.91 2.75 0.359 43.8 537
---------------

John

Stephen Howe

unread,

Jun 9, 2006, 1:18:32 AM6/9/06

to

> It appears that VC++'s std::sort is about twice as fast as ours.
> Probably switching to QuickSort would help with that.

It would, with using what is currently known as best.
And that means

(i) Musser's hybrid introsort - quicksort at best, heapsort at worst
(ii) 3-way partition rather than 2-way partition. Good for duplicates.
(iii) median selection-by-3 (and 9 for large numbers)

Some tinkering would be interesting as well.
a) Sedgewick talks about not running insertion sort on small sequences
internally but instead doing a final 1-pass insertion sort on a nearly
sorted final sequence. I am unconvinced by that. First, if heapsort has been
run over any portion of the array because quicksort failed to partition, its
portion's wont need a final insertion sort. Second, if a small insertion
sort is being run, the chances are the sequence has been recently
partitioned and therefore is in the chips L1/L2 cache.
STLPort did do this. I am not sure if they still do.
b) I would be interested if binary insertion sort gives better figures
instead of straight insertion sort for small sequences. Some instrumented
class with counts for cctors, ctors, assignment ops, dtors would be good.
Since sort() uses random interators, this can be done. It might be
unmeasureable.
c) I would be interested in substituting shellsort with decent gaps for
heapsort in worst cases for low N. For 16-bit OW, heapsort probably be
#ifdefined out as I think N, the cross-over point where heapsort overtakes
shellsort is well beyond pow(2 , 16). Some verification needed.
One of Heapsorts weaknesses (and there are not many) is that it has poor
locality of reference (cache unfriendly).
d) Sedgewick talks about Bentley & McIlroys work in 1993 in that getting a
good 3-way partition involves partitioning into 5 regions (duplicates found
on left, less than elements, unknown middle being currently partitioned,
more than elements, duplicates found on right) and then swapping duplicates
into centre afterwards into final position. This seems extraordinary. This
is the Dutch National Flag problem. And from what I know algorithmically it
is 2/3 N at worst with Dikstras algorithm or 5/9 N worst using an
alternative (current best). Still Bentley & McIlroys are geniuses, it just
seem odd.
This is probably the _CRITICAL_ bit to get right, as quicksort spends most
of its time partitioning, therefore getting a 3-way partition to optimally
partition with the minimum number of swaps is vital.
e) And on median selection, Bentley's & McIlroys methods have influenced
libraries today. You will find it in OW's qsort().
They use median-of-9 (median-of 3 of a median-of-3) for N >= 41 and
median-of-3 for N < 41. 41 seems a low limit for this, but again, they did
test this.
I would have thought with 41 elements, the effort to find the median-of-9
would not bring substantial benefits for the partition. Again, they must hav
measured this

> In conclusion: considering OWSTL's immature state I think it did okay on
> this test. We could use a faster sort. There might be something broken

> in set or multiset...

might be.

> For those of you who want to play I'm attaching the program to the
> bottom of this message. It's not particularly long. I think a program
> like this would be a nice addition to bld\bench\owstl.

I agree with you. Thanks for the work.
But I think the bits that dont work should be commented out and we will
uncomment them as the compiler/libraries support them later on.

Stephen Howe

Daniel

unread,

Jun 9, 2006, 1:18:34 AM6/9/06

to

> The results...
>
> Open Watcom v1.6beta (using wcl386 -ox bench.cpp)
>
> size array vec_p vec_i list set multiset
> 10 1.76 2 2 1.83 1.96 5.21
> 100 1.56 1.74 1.74 0.881 1.37 3.15
> 1000 1.49 1.57 1.56 0.6 1.2 2.49
> 10000 1.47 1.56 1.59 0.461 1.45 3.04
> 100000 1.52 1.58 1.58 0.371 3.58 10.9
> ^C
>
> I killed the n = 1000000 run after about five minutes. I suspect there
> may be some problem with set or multiset (note the rather dramatic jump
> in time for multiset when n == 100000). It should also be mentioned that
> OW's random_shuffle() might not be producing the desired results for
> sequences that are greater than 32K items. It should probably be using a
> random number generator that produces 32 bit random values.
>

my results are similar:

size array vec_p vec_i list set multiset

10 1.09 1.42 1.42 1.48 1.28 4.92
100 1.36 1.5 1.48 0.719 1.11 2.81
1000 1.39 1.47 1.45 0.469 1.06 2.38
10000 1.45 1.51 1.5 0.375 1.42 3.77
100000 1.67 1.72 1.72 0.297 20.5 27.9
1000000 1.98 2.03 2.02 0.266 36.2 369

I'm not going to say that there definitely isn't a problem with rb tree because
as far as I know no one else has looked at the code in any detail yet but... I'm
not so sure if it is the culprit. As you said Peter, the profiler reveals
_nfree_ as taking ~~73% when just running the set test.

I ran across this problem before. The free function looks to me like it looks
for pointers recently allocated and then degrades to a linear search if it
fails. I guess the order allocs and deallocs are done for the set gets mangled
up due to the sorting but not for list etc. and this means it ends up having to
use linear search all the time. I took my nasty block allocator from the bench
directory and added a quick fudge to deallocate the chunks when it is destroyed.
Got these timings when using this allocator for the set and multiset tests:

size array vec_p vec_i list set multiset

10 1.13 1.45 1.45 1.49 1.39 4.22
100 1.39 1.53 1.51 0.719 0.86 2.14
1000 1.42 1.5 1.48 0.485 0.828 1.7
10000 1.47 1.55 1.55 0.375 0.796 1.74
100000 1.7 1.77 1.75 0.281 2 4.53
1000000 2 2.06 2.05 0.265 2.25 5.02

Still not sure it is quite right, as I don't know why it doubles between 10^4
and 10^5. Something to look into...

I guess we might need to discuss the standard alloc/dealloc code. Is it ok for
what it does?

heres a diff of the files if you want to have a go:

15a16,17
> #include "fastalloc.hpp"
>
115c117,120
< set<element_t> container;
---
> set<element_t,less< element_t >,BlockAlloc<element_t> > s_t;
> //set<element_t> s_t;
> s_t container;
>
125c130,132
< multiset<element_t> container;
---
> typedef multiset<element_t,less< element_t >,BlockAlloc<element_t> > ms_t;
> //typedef multiset<element_t> ms_t;
> ms_t container;
131c138

< typedef multiset<element_t>::iterator iterator;

---
> typedef ms_t::iterator iterator;

21c21,26
< ~BlockAlloc( ) { }
---
> ~BlockAlloc( ) {
> std::list<pointer>::iterator i;
> for( i = mAllocatedChunks.begin(); i!=mAllocatedChunks.end(); ++i){
> delete (*i);
> }
> }
43a49
> mAllocatedChunks.push_front( mFreeChunk );
76a83
> std::list<pointer> mAllocatedChunks;

Stephen Howe

unread,

Jun 9, 2006, 1:18:41 AM6/9/06

to

I have just inspected <memory>.
The allocator is calling new operator (the high-level version that also will
call set_new_handler)
It should be calling operator new (the low-level version that just allocates
raw memory) - which is right - that is what allocator does - just
allocate/deallocate raw memory.
And that is not semantic word games.
That change alone should increase 0.5%, possibly even more.

Instead of

>>>>>>>>
pointer allocate( size_type n, allocator< void >::const_pointer = 0 )
{ return( reinterpret_cast< Type * >(
new unsigned char[ n * sizeof( Type ) ] ) ); }

void deallocate( pointer p, size_type )
{ delete [] reinterpret_cast< unsigned char * >( p ); }
>>>>>>>>

it should be

>>>>>>>>
pointer allocate( size_type n, allocator< void >::const_pointer = 0 )
{ return (reinterpret_cast< Type * >operator new(n * sizeof( Type )); }

void deallocate( pointer p, size_type )
{ operator delete (reinterpret_cast< void * >(p)); }
>>>>>>>>

at the very least.

> Can we provide a std::allocator that does something a little smarter
> than just calling new? After all std::allocator<T> knows the type it is
> allocating. Surely that information would be good for something. I
> haven't reviewed this part of the standard recently; I'm not sure how
> much flexibility we have here.

I think there is room. Before you said all this I thought that
std::allocator<T> has to call operator new.
Now having reviewed standard and online arguments, I think my belief was
mistaken.

Stephen Howe

unread,

Jun 9, 2006, 1:18:41 AM6/9/06

to

> Can we provide a std::allocator that does something a little smarter
> than just calling new? After all std::allocator<T> knows the type it is
> allocating. Surely that information would be good for something. I
> haven't reviewed this part of the standard recently; I'm not sure how
> much flexibility we have here.

Found it. And it means operator new _MUST_ be called
(SJHowe note: I am surprised at this. It does not seem to leave room for
custom allocators. Buut it may well do. I have a sneaking suspicion that
this is describing std::allocator. The room may be open to use other
allocators, even for standard containers. I think I will ask on
comp.lang.c++.moderated what the "wiggle" room is on this).
You want 20.4.1.1 in the standard

>>>>>>>>>>>>>>>>>>>>
20.4.1.1 allocator members [lib.allocator.members]

pointer address(reference x) const;

1 Returns: &x.
const_pointer address(const_reference x) const;

2 Returns: &x.
pointer allocate(size_type n, allocator<void>::const_pointer hint=0);

3 Notes: Uses ::operator new(size_t) (18.4.1).

4 Requires: hint either 0 or previously obtained from member allocate and
not yet passed to member
deallocate. The value hint may be used by an implementation to help improve
performance216).

5 Returns: a pointer to the initial element of an array of storage of size n
* sizeof(T), aligned appropriately
for objects of type T.

6 Note: the storage is obtained by calling ::operator new(size_t), but it is
unspecified when or
how often this function is called. The use of hint is unspecified, but
intended as an aid to locality if
an implementation so desires.

7 Throws: bad_alloc if the storage cannot be obtained.

216) In a container member function, the address of an adjacent element is
often a good choice to pass for this argument.
>>>>>>>>>>>>>>>>>>>>

Stephen Howe

Daniel

unread,

Jun 9, 2006, 1:18:57 AM6/9/06

to

I've attached another little test. I think it shows the problem is in the
memory management, not that that means there isn't a problem with rbtree!

I'm getting timings like this:

length: 10 loops: 300000 time: 0.453
length: 100 loops: 30000 time: 0.484
length: 300 loops: 10000 time: 0.516
length: 1000 loops: 3000 time: 0.531
length: 3000 loops: 1000 time: 0.563
length: 10000 loops: 300 time: 0.828
length: 30000 loops: 100 time: 2.156
length: 100000 loops: 30 time: 5.141
length: 200000 loops: 15 time: 13.797

on one machine and another is simular:

length: 10 loops: 300000 time: 0.422
length: 100 loops: 30000 time: 0.468
length: 300 loops: 10000 time: 0.453
length: 1000 loops: 3000 time: 0.532
length: 3000 loops: 1000 time: 0.781
length: 10000 loops: 300 time: 0.969
length: 30000 loops: 100 time: 1.234
length: 100000 loops: 30 time: 2.734
length: 200000 loops: 15 time: 4.094

but with the shuffle rem'ed out this:

C:\prog\code\test\bench3.exe
length: 10 loops: 300000 time: 0.296
length: 100 loops: 30000 time: 0.282
length: 300 loops: 10000 time: 0.297
length: 1000 loops: 3000 time: 0.281
length: 3000 loops: 1000 time: 0.297
length: 10000 loops: 300 time: 0.297
length: 30000 loops: 100 time: 0.328
length: 100000 loops: 30 time: 0.328
length: 200000 loops: 15 time: 0.328

and

length: 10 loops: 300000 time: 0.25
length: 100 loops: 30000 time: 0.281
length: 300 loops: 10000 time: 0.281
length: 1000 loops: 3000 time: 0.266
length: 3000 loops: 1000 time: 0.312
length: 10000 loops: 300 time: 0.282
length: 30000 loops: 100 time: 0.281
length: 100000 loops: 30 time: 0.281
length: 200000 loops: 15 time: 0.281

profiler shows time most of the time in the shuffled case is in
__MemFree. I just tried the other bench mark again and that says __MemFree is
the worst, not _nfree. I'm not sure what I did differently today. __MemFree is
called from _nfree, I supose it is all part of the same thing :S

I guess the allocator is being used differently to how it usually is in c for
this not to have been noted before. We will need to provide some nicer c++
allocators me thinks.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

I'm sure Peter wont mind me quoting his results as well:

I tried your benchmark program out here. I observed the same effect you
described: much faster timings with the random_shuffle() commented out.
Here is what I got on my hardware *with* the random_shuffle():

OW v1.6beta

length: 10 loops: 300000 time: 0.581
length: 100 loops: 30000 time: 0.47
length: 300 loops: 10000 time: 0.461
length: 1000 loops: 3000 time: 0.491
length: 3000 loops: 1000 time: 0.581
length: 10000 loops: 300 time: 0.841
length: 30000 loops: 100 time: 1.001
length: 100000 loops: 30 time: 1.562
length: 200000 loops: 15 time: 2.905

Interestingly, OW was typically faster than VC++ or g++. Here is VC++
v8.0:

length: 10 loops: 300000 time: 0.981
length: 100 loops: 30000 time: 0.841
length: 300 loops: 10000 time: 0.901
length: 1000 loops: 3000 time: 1.262
length: 3000 loops: 1000 time: 1.402
length: 10000 loops: 300 time: 1.522
length: 30000 loops: 100 time: 1.653
length: 100000 loops: 30 time: 2.113
length: 200000 loops: 15 time: 2.694

It seems like in this case OW's performance *degrades* to that of VC++.
Maybe the library allocator isn't too bad after all.

g++ v3.4.4 is even worse:

length: 10 loops: 300000 time: 2.623
length: 100 loops: 30000 time: 2.444
length: 300 loops: 10000 time: 2.423
length: 1000 loops: 3000 time: 2.434
length: 3000 loops: 1000 time: 2.433
length: 10000 loops: 300 time: 2.484
length: 30000 loops: 100 time: 2.453
length: 100000 loops: 30 time: 2.554
length: 200000 loops: 15 time: 2.894

bench3.cpp

Michal Necasek

unread,

Jun 9, 2006, 1:18:33 AM6/9/06

to

Would it be too much trouble to not quote reams of irrelevant text?

Michal

Stephen Howe

unread,

Jun 9, 2006, 1:18:58 AM6/9/06

to

> I guess the allocator is being used differently to how it usually is in c
for
> this not to have been noted before. We will need to provide some nicer
c++
> allocators me thinks.

Yes you are right. But also need to evaluate whether one memory allocator is
suitable for all OS's.
For 16-bit programs, there is no way the memory allocator will ever be able
to allocate over 1,000,000 blocks - therefore something simple should be
sufficient. For 32-bit and 64-bit programs, the allocator could be asked to
maange something.

Also one thing VC has had since 6.0 is a small memory allocator. It is most
effective for small C++ objects.
Above a settable threshold, it goes to the main allocator but below that the
small memory allocator kicks in.
Interestingly the small memory allocator was accidentally turned off in VC
7.0 (VC 2003) and it was noticed.
It is supposed be turned back on in the anticipated service pack for VC 8.0
(VC 2005).

See
http://tinyurl.com/gy6ap

Stephen Howe