Is 64 Bit Faster Than 32 Bit

0 views

Skip to first unread message

Luz Ignasiak

unread,

Aug 3, 2024, 4:42:26 PM8/3/24

to tramunibeat

The good news is that none of us are starting from scratch. Humans are social animals by nature, and we collaborate regularly in all facets of our lives. Structures that encourage us to harness, rather than oppress these well-developed muscles, could have an immediate impact on our ability to collaborate effectively with each other.

Science supports the notion that individuals are limited in our ability to change. In particular, we do not seem biologically capable of accelerating our abilities to process things emotionally. However, groups are capable of transcending the limitations of its individual members, provided that they are structured optimally and that they interact effectively.

I create and run Collaboration Gyms, spaces for changemakers to practice the skills they need to work effectively in groups. I also partner with other skilled practitioners to develop and give away toolkits for helping groups achieve high-performance, including DIY Strategy / Culture.

Previously, I spent a decade helping groups learn how to come alive and collaborate more skillfully together. I co-founded two social change consultancies in this pursuit: Blue Oxen Associates and Groupaya. I worked with C-level business leaders and social activists, rocket scientists and spies, billionaires and hackers, foundations and farmers. My work took me to nine countries across five different continents.

Check your input/output methods. In C++, using cin and cout is too slow. Use these, and you will guarantee not being able to solve any problem with a decent amount of input or output. Use printf and scanf instead.

Can someone please clarify this? Is really using scanf() in C++ programs faster than using cin >> something ? If yes, that is it a good practice to use it in C++ programs? I thought that it was C specific, though I am just learning C++...

EDIT: User clyfish points out below that the speed difference is largely due to the iostream I/O functions maintaining synchronization with the C I/O functions. We can turn this off with a call to std::ios::sync_with_stdio(false);:

Probably scanf is somewhat faster than using streams. Although streams provide a lot of type safety, and do not have to parse format strings at runtime, it usually has an advantage of not requiring excessive memory allocations (this depends on your compiler and runtime). That said, unless performance is your only end goal and you are in the critical path then you should really favour the safer (slower) methods.

There is a very delicious article written here by Herb Sutter "The String Formatters of Manor Farm" who goes into a lot of detail of the performance of string formatters like sscanf and lexical_cast and what kind of things were making them run slowly or quickly. This is kind of analogous, probably to the kind of things that would affect performance between C style IO and C++ style. The main difference with the formatters tended to be the type safety and the number of memory allocations.

I was getting TLE (time limit exceeded) on my submissions. On these problem solving online judge sites, you have about a 2-3 second time limit to handle potentially thousands of test cases used to evaluate your solution. For computationally intensive problems like this one, every microsecond counts.

The thing is: In C++, whenever you use cin and cout, a synchronization process takes place by default that makes sure that if you use both scanf and cin in your program, then they both work in sync with each other. This sync process takes time. Hence cin and cout APPEAR to be slower.

There are stdio implementations (libio) which implements FILE* as a C++ streambuf, and fprintf as a runtime format parser. IOstreams don't need runtime format parsing, that's all done at compile time. So, with the backends shared, it's reasonable to expect that iostreams is faster at runtime.

Yes iostream is slower than cstdio.
Yes you probably shouldn't use cstdio if you're developing in C++.
Having said that, there are even faster ways to get I/O than scanf if you don't care about formatting, type safety, blah, blah, blah...

The problem is that cin has a lot of overhead involved because it gives you an abstraction layer above scanf() calls. You shouldn't use scanf() over cin if you are writing C++ software because that is want cin is for. If you want performance, you probably wouldn't be writing I/O in C++ anyway.

Of course it's ridiculous to use cstdio over iostream. At least when you develop software (if you are already using c++ over c, then go all the way and use it's benefits instead of only suffering from it's disadvantages).

Even if scanf were faster than cin, it wouldn't matter. The vast majority of the time, you will be reading from the hard drive or the keyboard. Getting the raw data into your application takes orders of magnitude more time than it takes scanf or cin to process it.

I'd like to address the comment that nothing indicates that the different jump instructions take the same amount of time. This one is a little tricky to answer, but here's what I can give: In the Intel Instruction Set Reference, they are all grouped together under one common instruction, Jcc (Jump if condition is met). The same grouping is made together under the Optimization Reference Manual, in Appendix C. Latency and Throughput.

If one thinks about the actual circuitry used to implement the instructions, one can assume that there would be simple AND/OR gates on the different bits in EFLAGS, to determine whether the conditions are met. There is then, no reason that an instruction testing two bits should take any more or less time than one testing only one (Ignoring gate propagation delay, which is much less than the clock period.)

Historically (we're talking the 1980s and early 1990s), there were some architectures in which this was true. The root issue is that integer comparison is inherently implemented via integer subtractions. This gives rise to the following cases.

Now, when A < B the subtraction has to borrow a high-bit for the subtraction to be correct, just like you carry and borrow when adding and subtracting by hand. This "borrowed" bit was usually referred to as the carry bit and would be testable by a branch instruction. A second bit called the zero bit would be set if the subtraction were identically zero which implied equality.

So, on some machines, using a "less than" comparison might save one machine instruction. This was relevant in the era of sub-megahertz processor speed and 1:1 CPU-to-memory speed ratios, but it is almost totally irrelevant today.

Assuming we're talking about internal integer types, there's no possible way one could be faster than the other. They're obviously semantically identical. They both ask the compiler to do precisely the same thing. Only a horribly broken compiler would generate inferior code for one of these.

On PowerPC, first this performs a floating point comparison (which updates cr, the condition register), then moves the condition register to a GPR, shifts the "compared less than" bit into place, and then returns. It takes four instructions.

This requires the same work as compare_strict above, but now there's two bits of interest: "was less than" and "was equal to." This requires an extra instruction (cror - condition register bitwise OR) to combine these two bits into one. So compare_loose requires five instructions, while compare_strict requires four.

At the very least, if this were true a compiler could trivially optimise a b), and so even if the comparison itself were actually slower, with all but the most naive compiler you would not notice a difference.

Other answers have concentrated on x86 architecture, and I don't know the ARM architecture (which your example assembler seems to be) well enough to comment specifically on the code generated, but this is an example of a micro-optimisation which is very architecture specific, and is as likely to be an anti-optimisation as it is to be an optimisation.

There are probably some architectures where this is an optimisation, but I know of at least one architecture where the opposite may be true. The venerable Transputer architecture only had machine code instructions for equal to and greater than or equal to, so all comparisons had to be built from these primitives.

Even then, in almost all cases, the compiler could order the evaluation instructions in such a way that in practice, no comparison had any advantage over any other. Worst case though, it might need to add a reverse instruction (REV) to swap the top two items on the operand stack. This was a single byte instruction which took a single cycle to run, so had the smallest overhead possible.

Whether or not a micro-optimisation like this is an optimisation or an anti-optimisation depends on the specific architecture you are using, so it is usually a bad idea to get into the habit of using architecture specific micro-optimisations, otherwise you might instinctively use one when it is inappropriate to do so, and it looks like this is exactly what the book you are reading is advocating.

They have the same speed. Maybe in some special architecture what he/she said is right, but in the x86 family at least I know they are the same. Because for doing this the CPU will do a substraction (a - b) and then check the flags of the flag register. Two bits of that register are called ZF (zero Flag) and SF (sign flag), and it is done in one cycle, because it will do it with one mask operation.

This would be highly dependent on the underlying architecture that the C is compiled to. Some processors and architectures might have explicit instructions for equal to, or less than and equal to, which execute in different numbers of cycles.

You should not be able to notice the difference even if there is any. Besides, in practice, you'll have to do an additional a + 1 or a - 1 to make the condition stand unless you're going to use some magic constants, which is a very bad practice by all means.