Generically speaking, these kinds of differences can have several origins that make them frustrating if not impossible to track down and eliminate.
CPU manufacturer, compiler, and compiler settings (debug vs. release as well as others) can all make a difference.
For example, when doing floating point calculations with 64-bit doubles, Intel actually uses an 80-bit representation internally. It allows them to truncate results which is faster than formal IEEE 754 rounding. Other processor architectures do not do this.
I would certainly expect differences from x86 and arm64.
Some compilers have flags to force formal IEEE 754 compliance (some 'robust' geometry algorithms will fail without it). Microsoft provides options for /fp:fast, precise, and strict. An optimizing compiler can introduce a multiply-add statement that combines the two operations without any intermediate rounding. That instruction will not be used in debug mode -- or perhaps with a different compiler that misses that particular optimization opportunity.
I believe Java includes strict IEEE 754 compliance as a part of the language spec -- but there is a performance penalty in raw number crunching because of it. (I am not saying Java is slow, or other languages are fast.) It is nice in that it guarantees consistency.
In my experience, these kinds of problems are much more prevalent with float instead of double. Code-Eli (the curve-surface library behind OpenVSP's Bezier math) is all templated C++ code. It can be easily run with any data type. The unit tests are set up for float, double, and long-double. A long time ago, we even tested a quad-double library (it had lots of other problems). The unit tests there are a nightmare of numeric comparisons -- trying to work around all the different results we get based on data type, CPU, compiler, etc....
Single precision floats are a nightmare. Yes they take up less memory and are theoretically faster -- but their error is _huge_ compared to a double (and compared to relevant quantities). I can't tell you how often I've seen algorithms fail with float -- not just subtle differences, but failure to converge, or failure to calculate something to anything resembling the correct result.
One of the hardest things in OpenVSP (in terms of numerical precision) is the fact that we're dimensionless -- the user chooses the scale of the model. This means that some people choose to model their aircraft in mm -- the span of a 747 is a very big number in mm. Other people model their aircraft in m -- the span of a small drone is a very small number in m. This leads to needing to be able to work with numbers for the 'same quantity' that vary by six orders of magnitude in practice. Now you need to write (for example) a Newton's method solver that will project a 3D point onto a Bezier surface. You're going to be calculating derivatives dXYZ/dUV that will vary by six orders of magnitude. Doing this with floats would be an absolute disaster.
Scaling / shifting the problem before serious computation is an important step (that OpenVSP does not do enough). We also don't have a good enough unit testing framework -- you should test the 'same' problems at a huge range of scales. Instead, the developers usually model medium sized aircraft and we usually use feet -- so a certain range of scales gets tested much more frequently than what others would choose.
Shifting can be as important as scaling -- working at a point near the origin will provide small numbers by definition -- while working near the wingtip will provide much larger numbers.
Watch out for quantities that vary greatly in scale. Integrating a trajectory might work great for many projects -- until you decide to work on a solar powered aircraft that flies for weeks or months (machine epsilon for a 32-bit float measuring one month in seconds is 0.15 seconds. For an hour, it is 0.00021 sec). What is your timestep? What is your tolerance?
If someone is going to write a serious numerical algorithm with single precision -- they should make sure it is trivial to swap in double precision through a preprocessor directive #define, templates / generic programming, or whatever it takes in their language. This provides two advantages....
1) They can unit test and regression test float vs. double. This will help convince them that floats are OK for their application.
2) They can measure the memory and performance 'gains' of using floats -- is all this hassle worth it?
Developing a reliable numeric algorithm using floats is going to take a lot more diligence and attention to esoteric details than doing the same thing with doubles. Those skills aren't practiced and taught the way they were 30 years ago. I suppose a code that has had all of these problems chased out of it with float will be an even better program with double -- by that logic, perhaps we should write and debug our programs with half-precision numbers so these issues become easier to find and life becomes more frustrating...
Rob