Komodo 2.03 SSE42 available

Dann Corbit · Post by **Dann Corbit** » Wed Jun 22, 2011 10:49 pm

Dann Corbit wrote:
rbarreira wrote:
Dann Corbit wrote:
Dann Corbit wrote:
rbarreira wrote:
Dann Corbit wrote: And if an intermediate calculation for the product of 2 8 byte floats is stored in an 80 bit register and an IRQ fires, the data is stored where?
In memory, probably using these instructions or similar, which as mentioned in the link save and restore "the entire floating-point unit state".

Dann Corbit wrote:Why not read this:
http://msdn.microsoft.com/en-us/library/e7s85ffb.aspx
And specifically where they talk about 80 bit operations and loss of precision
I read it. It doesn't talk about IRQs or context switches at all, only about code generation, just as every compiler documentation I've seen about floating point optimization options.
"Improves the consistency of floating-point tests for equality and inequality by disabling optimizations that could change the precision of floating-point calculations, which is required for strict ANSI conformance. By default, the compiler uses the coprocessor's 80-bit registers to hold the intermediate results of floating-point calculations. This increases program speed and decreases program size. Because the calculation involves floating-point data types that are represented in memory by less than 80 bits, however, carrying the extra bits of precision (80 bits minus the number of bits in a smaller floating-point type) through a lengthy calculation can produce inconsistent results."
Imagine this bit in all caps:
Because the calculation involves floating-point data types that are represented in memory by less than 80 bits, however, carrying the extra bits of precision (80 bits minus the number of bits in a smaller floating-point type) through a lengthy calculation can produce inconsistent results.
Again, that doesn't refer to context switching. It refers to keeping full-precision when not requested by the source code, which may give different results from stricter code generation which rounds and writes to memory (to a data type with less precision) at every assignment operator in the source code.

Until you find even a single authoritative source which talks about loss of fp precision on context switching in, say, Windows or Linux, I'm going to ignore further posts in this thread...
Apparently, my information was dated. Here is the latest from Anger Fogg's assembly reference:
6.1 Can floating point registers be used in 64-bit Windows?
There has been widespread confusion about whether 64-bit Windows allows the use of the
floating point registers ST(0)-ST(7) and the MM0 - MM7 registers that are aliased upon
these. One early technical document found at Microsoft’s website says "x87/MMX registers
are unavailable to Native Windows64 applications" (Rich Brunner: Technical Details Of
Microsoft® Windows® For The AMD64 Platform, Dec. 2003). An AMD document says: "64-
bit Microsoft Windows does not strongly support MMX and 3Dnow! instruction sets in the
64-bit native mode" (Porting and Optimizing Multimedia Codecs for AMD64 architecture on
Microsoft® Windows®, July 21, 2004). A document in Microsoft’s MSDN says: "A caller
must also handle the following issues when calling a callee: [...] Legacy Floating-Point
Support: The MMX and floating-point stack registers (MM0-MM7/ST0-ST7) are volatile. That
is, these legacy floating-point stack registers do not have their state preserved across
context switches" (MSDN: Kernel-Mode Driver Architecture: Windows DDK: Other Calling
Convention Process Issues. Preliminary, June 14, 2004; February 18, 2005). This
description is nonsense because it confuses saving registers across function calls and
saving registers across context switches. Some versions of the Microsoft assembler ml64
(e.g. v. 8.00.40310) gives the following message when attempts are made to use floating
point registers in 64 bit mode: "error A2222: x87 and MMX instructions disallowed; legacy
FP state not saved in Win64".
However, a public discussion forum quotes the following answers from Microsoft engineers
regarding this issue: "From: Program Manager in Visual C++ Group, Sent: Thursday, May
26, 2005 10:38 AM. It does preserve the state. It’s the DDK page that has stale information,
which I’ve requested it to be changed. Let them know that the OS does preserve state of
x87 and MMX registers on context switches." and "From: Software Engineer in Windows
Kernel Group, Sent: Thursday, May 26, 2005 11:06 AM. For user threads the state of legacy
floating point is preserved at context switch. But it is not true for kernel threads. Kernel
mode drivers can not use legacy floating point instructions."
(www.planetamd64.com/index.php?showtopic=3458&st=100).
The issue has finally been resolved with the long overdue publication of a more detailed ABI
for x64 Windows in the form of a document entitled "x64 Software Conventions", well hidden
in the bin directory (not the help directory) of some compiler packages. This document says:
"The MMX and floating-point stack registers (MM0-MM7/ST0-ST7) are preserved across
context switches. There is no explicit calling convention for these registers. The use of
these registers is strictly prohibited in kernel mode code." The same text has later appeared
at the Microsoft website (msdn2.microsoft.com/en-us/library/a32tsf7t(VS.80).aspx).
My tests indicate that these registers are saved correctly during task switches and thread
switches in 64-bit mode, even in an early beta version of x64 Windows.
The Microsoft C++ compiler version 14.0 never uses these registers in 64-bit mode, and
doesn’t support long double precision. The Intel C++ compiler for x64 Windows supports
long double precision and __m64 in version 9.0 and later, while earlier versions do not.
The conclusion is that it is safe to use floating point registers and MMX registers in 64-bit
Windows, except in kernel mode drivers.

In my defense, it was the documents of AMD, Intel, and Microsoft that lead me astray.

See also:
http://www.rhinocerus.net/forum/lang-as ... ister.html

Dann Corbit · Post by **Dann Corbit** » Wed Jun 22, 2011 10:58 pm

Here is an example of GCC documentation saying that with certain compiler options, registers are not preserved:

FPU, MMX, SSE, and SSE2 Support

The x87 math coprocessor and on-chip FPU are software compatible, and are supported by VxWorks using the INCLUDE_HW_FP configuration macro.

There are two types of floating-point contexts and a set of routines associated with each type. The first type is 108 bytes and is used for older FPUs (i80387, i80487, Pentium) and older MMX technology. The routines fppSave( ), fppRestore( ), fppRegsToCtx( ),and fppCtxToRegs( ) are used to save and restore the context and to convert to or from FPPREG_SET. The second type is 512 bytes and is used for newer FPUs, newer MMX technology, and SSE technology (Pentium II, III, 4). The routines fppXsave( ), fppXrestore( ), fppXregsToCtx( ), and fppXctxToRegs( ) are used to save and restore the context and to convert to or from FPPREG_SET. The type of floating-point context used is automatically detected by checking the CPUID information in fppArchInit( ). The routines fppTaskRegsSet( ) and fppTaskRegsGet( ) then access the appropriate floating-point context. The bit interrogated for the automatic detection is the "Fast Save and Restore" feature flag.

Saving and restoring floating-point registers adds to the context switch time of a task. Therefore, floating-point registers are not saved and restored for every task. Only those tasks spawned with the task option VX_FP_TASK will have floating-point state, MMX technology state, and streaming SIMD state saved and restored. If a task executes any floating-point operations, MMX operations, or streaming SIMD operations, it must be spawned with VX_FP_TASK.

Executing floating-point operations from a task spawned without the VX_FP_TASK option results in serious and difficult to find errors. To detect this type of illegal, unintentional, or accidental floating-point operation, a new API and a new mechanism have been added to this release. The mechanism involves enabling or disabling the FPU by toggling the TS flag in the CR0 register of the new task switch hook routine, fppArchSwitchHook( ), respecting the VX_FP_TASK option. If the VX_FP_TASK option is not set in the switching-in task, the FPU is disabled. Thus, the device-not-available exception is raised if the task attempts to execute any floating-point operations. This mechanism is disabled in the default VxWorks configuration. To enable the mechanism, call the enabler, fppArchSwitchHookEnable( ), with a parameter TRUE (1). The mechanism is disabled using the FALSE (0) parameter.

There are six FPU exceptions that can send an exception to the CPU. They are controlled by the exception mask bits of the control word register. VxWorks disables these exceptions in the default configuration. The exceptions are as follows:

Precision

Overflow

Underflow

Division by zero

Denormalized operand

Invalid operation

Mixing MMX and FPU Instructions

A task with the VX_FP_TASK option enabled saves and restores the FPU and MMX state when performing a context switch. Therefore, the application does not need to save or restore the FPU and MMX state if the FPU and MMX instructions are not mixed within the task. Because the MMX registers are aliased to the FPU registers, care must be taken to prevent the loss of data in the FPU and MMX registers, and to prevent incoherent or unexpected results, when making transitions between FPU instructions and MMX instructions. When mixing MMX and FPU instructions within a task, Intel recommends the following guidelines:

Keep the code in separate modules, procedures, or routines.

Do not rely on register contents across transitions between FPU and MMX code modules.

When transitioning between MMX code and FPU code, save the MMX register state (if it will be needed in the future) and execute an EMMS instruction to empty the MMX state.

When transitioning between FPU and MMX code, save the FPU state, if it will be needed in the future.

Dann Corbit · Post by **Dann Corbit** » Wed Jun 22, 2011 11:18 pm

Also:

Code: Select all

The pitfalls of verifying floating-point computations
David Monniaux
CNRS / Laboratoire d’informatique de l’´Ecole normale sup´erieure
http&#58;//www.di.ens.fr/~monniaux

3.1 x87 floating-point unit
Processors of the IA32 architecture &#40;Intel 386, 486, Pentium etc.
and compatibles&#41; feature a floating-point unit often known as
“x87” &#91;Int05, chapter 8&#93;.
It supports the floating-point, integer, and packed BCD integer
data types and the floating-point processing algorithms and exception
handling architecture defined in the IEEE Standard 754 for
Binary Floating-Point Arithmetic.
This unit has 80-bit registers internally in “extended double” format
&#40;64-bit mantissa and 15-bit exponent&#41;, often associated to the long
double C type; it can read and write data to memory in this 80-bit
format or in standard IEEE-754 single and double precision. By default,
all operations performed on CPU registers are done with 64-
bit precision, but it is possible to reduce precision to 24-bit &#40;same as
IEEE single precision&#41; and 53-bit &#40;same as IEEE double precision&#41;
mantissas by setting some bits in the unit’s control register.&#91;Int05,
§8.1.5.2&#93; Note, however, that these precision settings do not affect
the range of exponents available, and only affect a limited number
of operations &#40;containing all operations specified in IEEE-754&#41;.
The most usual way of generating code for the IA32 is to hold
temporaries —and, in optimised code, program variables —in the
x87 registers. Doing so yields more compact and efficient code
than always storing register values into memory and reloading
them. However, it is not always possible to do everything inside
registers, and compilers then generally store extra temporary values
to main memory using the type of the value per the typing rules of
the language. This means that the final result of the computations
depend on how the compiler allocates registers, since temporaries
&#40;and possibly variables&#41; will incur or not incur rounding whether or
not they are spilt to main memory.
As an example, the following program compiled with gcc 4.0.1
&#91;Fre&#93; under Linux will print 10308 &#40;1E308&#41;&#58;
double v = 1E308;
double x = &#40;v * v&#41; / v;
printf&#40;"%g %d\n", x, x==v&#41;;
How is that possible? v * v done in double precision will overflow,
and thus yield +8, and the final result should be +8.
However, since all computations are performed in extended precision,
the computations do not overflow. However, if we use the
-ffloat-store option, which forces gcc to store floating-point
variables in memory, we obtain +8.
The result of computations can actually depend on compilation
options or compiler versions, or anything that affects propagation.
With the same compiler and system, the following program prints
10308 &#40;when compiled in optimised mode (-O&#41;, while it prints +8 when compiled in default mode.
double foo&#40;double v&#41; &#123;
double y = v * v;
return &#40;y / v&#41;;
&#125;
main&#40;) &#123; printf&#40;"%g\n", foo&#40;1E308&#41;);&#125;
Examination of the assembly code shows that when optimising, the
compiler reuses the value of y stored in a register, while it saves
and reloads y to and from main memory in non-optimised mode.
A common optimisation is inlining—that is, replacing a call to
a function by the expansion of the code of the function at the point
of call. For simple functions &#40;such as small arithmetic operations,
e.g. x 7? x2&#41;, this can increase performance significantly, since
function calls induce costs &#40;saving registers, passing parameters,
performing the call, handling return values&#41;. C &#91;ISO99, §6.7.4&#93; and
C++ have an inline keyword in order to pinpoint functions that
should be inlined &#40;however, compilers are free to inline or not to
inline such functions; they may also inline other functions when
it is safe to do so&#41;. However, on x87, whether or not inlining is
performed may change the semantics of the code!
Consider what gcc 4.0.1 on IA32 does with the following program,
depending on whether the optimisation switch -O is passed&#58;
static inline double f&#40;double x&#41; &#123;
return x/1E308;
&#125;
double square&#40;double x&#41; &#123; return x*x; &#125;
int main&#40;void&#41; &#123;
printf&#40;"%g\n", f&#40;square&#40;1E308&#41;));
&#125;
gcc does not inline functions when optimisation is turned off.
The square function returns a double, but the calling convention
is to return floating point value into a x87 register — thus in
long double format. Thus, when square is called, it returns
approximately 10716, which fits in long double but not double
format. But when f is called, the parameter is passed on the stack
— thus as a double, +8. The program therefore prints +8. In
comparison, if the program is compiled with optimisation on, f is
inlined; no parameter passing takes place, thus no conversion to
double before division, and thus the final result printed is 10308.
It is somewhat common for programmers to add a comparison
check to 0 before computing a division, in order to avoid possible
division-by-zero exceptions or the generation of infinite results. A
first objection to this practise is that, anyway, computing 1/x for
x very close to zero will generate very large numbers that will
result in overflows later. Another objection is that it may actually
not work, depending on what the compiler does.
Consider the following source code&#58;3
void do_nothing&#40;double *x&#41; &#123; &#125;
int main&#40;void&#41; &#123;
double x = 0x1p-1022, y = 0x1p100, z;
do_nothing&#40;&y&#41;;
z = x / y;
if &#40;z != 0&#41; &#123;
do_nothing&#40;&z&#41;;
assert&#40;z != 0&#41;;
&#125;
&#125;
This program exhibits different behaviours depending on various
factors, even when one uses the same compiler &#40;gcc version
4.0.2 on IA32&#41;&#58;
• If it is compiled without optimisation, x / y is computed as a
long double then converted into a IEEE-754 double precision
number &#40;0&#41; in order to be saved into memory variable z. The if
statement is thus not taken.
• If it is compiled as a single source code with optimisation, gcc
performs some kind of global analysis which understands that
do_nothing does nothing. Then, it does constant propagation,
sees that z is 0, thus that the if statement is not taken, and
finally that main&#40;) performs no side effect. It then effectively
compiles main&#40;) as a “no operation”.
• If it is compiled as two source codes &#40;one for each function&#41;,
gcc cannot do constant propagation. The z != 0 is performed
on a nonzero long double quantity and thus is taken. However,
after the second do_nothing&#40;) call, z is reloaded from
main memory as the value 0 &#40;because conversion to doubleprecision
flushed it to 0&#41;. As a consequence, the printed result
is +8.
• If, with the same compilation setup, one removes the second
do_nothing&#40;) call, the program detects an assertion failure
and aborts. Note that cursory program analysis, optimisation,
or naive static analysis may well conclude that the assertion
z != 0 is true throughout the if branch.
One should therefore be extra careful with strict comparisons, because
these may be performed on the extended precision type.
We are surprised of these discrepancies. After all, the C specification
says &#91;ISO99, 5.1.2.3, program execution, §12, ex. 4&#93;&#58;
Implementations employing wide registers have to take care to
honor appropriate semantics. Values are independent of whether
they are represented in a register or in memory. For example, an
implicit spilling of a register is not permitted to alter the value.
Also, an explicit store and load is required to round to the precision
of the storage type.
3 C99 introduces hexadecimal floating-point literals in source code. &#91;ISO99,
§6.4.4.2&#93; Their syntax is as follows&#58; 0xmmmmmm.mmmm p±ee where
mmmmmm.mmmm is a mantissa in hexadecimal, possibly containing a point,
and ee is n exponent possibly preceded by a sign. They are interpreted as
&#91;mmmmmm.mmmm &#93;16 × 2ee . See also Sect. 4.4.
However, this paragraph, being an example, is not normative. &#91;ISO99,
foreword, §6&#93;.
Let us note, finally, that common debugging practises that, apparently,
should not change the computational semantics, may actually
alter the result of computations. Adding a logging statement in
the middle of a computation may alter the scheduling of registers,
for instance by forcing some value to be spilt into main memory
and thus undergo additional rounding. As an example, simply inserting
a printf&#40;"%g\n", y&#41;; call after the computation of y in
the above foo function forces y to be flushed to memory, and thus
the final result then becomes +8regardless of optimisation.
Also, it is commonplace to disable optimisation when one intends
to use a software debugger, because in optimised code, the
compiled code corresponding to distinct statements may become
fused, variables may not reside in a well-defined location, etc. However,
as we have seen, simply disabling or enabling optimisation
may change computational results.

rbarreira · Post by **rbarreira** » Thu Jun 23, 2011 1:07 am

Dann Corbit wrote:Here is an example of GCC documentation saying that with certain compiler options, registers are not preserved:

FPU, MMX, SSE, and SSE2 Support

The x87 math coprocessor and on-chip FPU are software compatible, and are supported by VxWorks using the INCLUDE_HW_FP configuration macro.

See bolded - this relates to vxWorks which is a real-time OS, typically for embedded systems.

I saw the one you posted from Agner Fogg... apparently there was a mistake in documentation which implied that floating point registers couldn't be used at all under x64, not specifically a loss of precision if I read correctly.

Anyway, it seems the matter is settled regarding context switches. So what's left to worry about are the numerous other problems with floating point math, exacerbated by certains compilers / compiler options.

Dann Corbit · Post by **Dann Corbit** » Thu Jun 23, 2011 1:15 am

rbarreira wrote:
Dann Corbit wrote:Here is an example of GCC documentation saying that with certain compiler options, registers are not preserved:

FPU, MMX, SSE, and SSE2 Support

The x87 math coprocessor and on-chip FPU are software compatible, and are supported by VxWorks using the INCLUDE_HW_FP configuration macro.
See bolded - this relates to vxWorks which is a real-time OS, typically for embedded systems.

I saw the one you posted from Agner Fogg... apparently there was a mistake in documentation which implied that floating point registers couldn't be used at all under x64, not specifically a loss of precision if I read correctly.

Anyway, it seems the matter is settled regarding context switches. So what's left to worry about are the numerous other problems with floating point math, exacerbated by certains compilers / compiler options.

The implication was that floating point registers could not be used because of reliability.

At any rate, I do admit that for many years I have been under a misapprehension about at least one reason for numerical oddities.

Komodo 2.03 SSE42 available

Re: Komodo 2.03 SSE42 available

Re: Komodo 2.03 SSE42 available

Re: Komodo 2.03 SSE42 available

Re: Komodo 2.03 SSE42 available

Re: Komodo 2.03 SSE42 available