A note for C programmers

syzygy · Post by **syzygy** » Sun Dec 01, 2013 2:35 pm

syzygy wrote:
lucasart wrote:The exampole you show is -- I believe -- an abusive optimiation by the compiler (IOW a true GCC bug). AFAIK there is nothing in the standard that allows the compiler to perform that optimization. The "as if" of the standard does not apply here, unless the compiler can prove that the ereport() function cannot change the control flow. Obviously, if there is an exit() statement somewhere in the *possible* call tree of ereport(), as well as a longjmp() or a throw in C++ for example, the compiler cannot make such an assumption.
I agree, even if a function is not guaranteed to not return, in a specific case it might not return and the undefined behavior would/should never be triggered.

More generally, I don't think "provably" undefined behavior later in the program should allow a compiler to optimise the whole program away. The computer should do exactly what the programmer told it to do (in particular cause all the externally visible side effects), up to the point that it actually runs into the undefined behavior. In other words, undefined behavior may cause demons to fly out of your nose, but it may not result in time travel.

Perhaps using flto fixes the problem, as the code is generated at link time from the bytecode produced at compile time (so the code of ereport() is visible, rather than just an extern declaration). I think this one is a true GCC bug.
You probably mean clang bug, if clang in fact does what the author suggests it does.

But maybe I should read the paper again to be sure that I understand the author's point and that I am not reading too much into it.

edit: oops, this was indeed about gcc. More reason to study the paper a bit better.

It seems this example of the paper is correct in the sense that this really did occur. Following the link I found this bug report. The example code is:

Code: Select all

#include <stdio.h>

int testdiv&#40;int i, int k&#41; &#123;
        if &#40;k == 0&#41; printf&#40;"found divide by zero\n");
        return&#40;i/k&#41;;
&#125;

int main&#40;) &#123;
        int i = testdiv&#40;1,0&#41;;
        return&#40;i&#41;;
&#125;

Compiled on Ultrasparc with "gcc -O3" executing this code resulted in a floating point exception without the line "found divide by zero" being printed.

Two people pointed out that the program is invoking undefined behavior and the bug got status "RESOLVED INVALID".

I think it is very clear that this is in fact a compiler bug. That certain code has undefined behavior does NOT mean it is unreachable. I do not believe that anything in the C standard allows statements with undefined behavior to be treated as unreachable.

The C standard defines the semantics of a C program. Certain aspects it leaves undefined, e.g. the result of a division by zero is undefined. That means that once a division by zero is encountered, anything can happen. It does not mean that once a division by zero is encountered, the program can travel back in time and undo the side effect of the printf() that was executed fully in conformance with defined behavior.

So this is/was simply a gcc bug.

bob · Post by **bob** » Sun Dec 01, 2013 4:59 pm

Sven Schüle wrote:
bob wrote:
lucasart wrote:
mar wrote:
Sven Schüle wrote:See my comments above.
They never read, they never listen, they're always right, they're never wrong. Guess who?
Indeed. Impossible to have a discussion without these arrogant lecturers jumping in the middle of it, trying to patronize you about trivial stuff that has nothing to do with the subject...
Arrogant indeed. Look at the topic of the thread "C programmers". We don't have any "bool" data type. We have ints and floats. Helps if you read along the thread before popping off with something that is irrelevant...

This was ALWAYS about C.
The point was never about "bool" data type but about a conditional expression which is also called a "boolean" expression. That certainly exists in C.

Exactly, where true != 0 and false == 0. Exactly as was pointed out multiple times. Given that, his two-element array approach does not work for C. Would only work for languages where they have a boolean type and true=1, false=0. Quite unlike (say) fortran which originally used true = -1 and false = 0 for "logical" variable values (at least on the Cray).

bob · Post by **bob** » Sun Dec 01, 2013 5:09 pm

lucasart wrote:
mar wrote:
lucasart wrote:According to the first paper, the C standard permits even crazier optimizations. Apparently (I was not aware of it and still cannot believe it myself), these two expressions are not equivalent:
Code: Select all
a&#91;x ? 1 &#58; 0&#93;
a&#91;x != 0&#93;
The second one is fine, and no standard compliant compiler will break this code. But the first one is dangerous. The compiler is allowed to assume that x can only be 1 or 0, hence replace it by a[x]. This is a "nice" optimization because it produces branch-less code. The problem is that when x is not 0 or 1, all hell breaks loose. The program could cause segfault (if you're lucky!) or break silently...
The C standard says this:
1. the controlling expression of if statement shall be of scalar type
2. the statement after if executes if the expression compares unequal to 0
3. the statement after else (if present) executes if the expression compares equal to 0

Thus a compiler that would turn if (x) A[1] = y; else A[0] = y; into A[x] = y; would simply break the standard if x can ever be anything else than 0 or 1.
I think the author is simply wrong on this one.
Thanks for taking the time to look at the standard, and clarifying this point. It really did not make sense, but the author of the paper seemed to know what he was talking about, so I started to fear that he could be correct. This "optimization" would almost certainly have broken my code, and many other's.

I can't find any compiler that does this. I think the claim is bogus, unless he just happened to run into a compiler with a bug.

How can he state "the compiler is allowed to assume that x = 0 or 1, when the semantics of all statements using a conditional value says 0 = false, !=0 = true.

Sven · Post by **Sven** » Sun Dec 01, 2013 6:34 pm

bob wrote:
lucasart wrote:
mar wrote:
lucasart wrote:According to the first paper, the C standard permits even crazier optimizations. Apparently (I was not aware of it and still cannot believe it myself), these two expressions are not equivalent:
Code: Select all
a&#91;x ? 1 &#58; 0&#93;
a&#91;x != 0&#93;
The second one is fine, and no standard compliant compiler will break this code. But the first one is dangerous. The compiler is allowed to assume that x can only be 1 or 0, hence replace it by a[x]. This is a "nice" optimization because it produces branch-less code. The problem is that when x is not 0 or 1, all hell breaks loose. The program could cause segfault (if you're lucky!) or break silently...
The C standard says this:
1. the controlling expression of if statement shall be of scalar type
2. the statement after if executes if the expression compares unequal to 0
3. the statement after else (if present) executes if the expression compares equal to 0

Thus a compiler that would turn if (x) A[1] = y; else A[0] = y; into A[x] = y; would simply break the standard if x can ever be anything else than 0 or 1.
I think the author is simply wrong on this one.
Thanks for taking the time to look at the standard, and clarifying this point. It really did not make sense, but the author of the paper seemed to know what he was talking about, so I started to fear that he could be correct. This "optimization" would almost certainly have broken my code, and many other's.
I can't find any compiler that does this. I think the claim is bogus, unless he just happened to run into a compiler with a bug.

How can he state "the compiler is allowed to assume that x = 0 or 1, when the semantics of all statements using a conditional value says 0 = false, !=0 = true.

Once more for you

The author of the paper used "cond", Lucas translated "cond" into "x", and there is the little sloppiness. Most of us will understand "cond" as a boolean expression but "x" as an integer variable (just by intuition, even if there is no formal reason for it). We can only guess whether the author really meant that a compiler were allowed to optimize the "if (x) A[1] = ...; else A[0] = ...;" into "A[x] = ...;" but I would simply assume he didn't, and he always meant "cond" as a boolean expression like "x != 0".

syzygy · Post by **syzygy** » Sun Dec 01, 2013 6:36 pm

wgarvin wrote:(1) Dangerous Optimizations and the Loss of Causality is a classic presentation about the growing trend of compiler writers exploiting the freedom of undefined behavior. The spec lets them assume you never invoke undefined behavior in your code; as they are increasingly taking advantage of that, more and more "working" old code is put at risk of failing in subtle and unexpected ways. Code that invokes any undefined behavior is basically a potential time bomb that might someday, with the help of an eager compiler writer, decide to blow up your program, or (much worse) introduce subtle security vulnerabilities into it. Every C or C++ programmer ought to know enough about this stuff to avoid falling into the bear pit.

I may be proven wrong, but I believe that Seacord, the author of this presentation, makes a crucial mistake (apart from believing that the C standard prescribes that "if (cond)" gives undefined behavior if "cond" evaluates to something else than 0 or 1).

On page 6, bottom and page 7, top, he essentially explains what I have tried to express in my previous post. The C standard essentially defines the behavior of an abstract machine. A compiler must produce code that provides this behavior in so far as it is specified and in so far as it is visible to the outside, e.g. in terms of side effects. If at some point during its execution the abstract machine encounters some a code segment with undefined behaviour, then anything goes and the compiler may do whatever it pleases, e.g. nothing. But before any undefined behavior is encountered, it seems to me that defined side effects must be respected.

I therefore believe that the Seacord is wrong about compilers being allowed to implement the "total license" policy he defines on page 7, bottom:

Code: Select all

Total license&#58; treat any possible undefined behavior as a "can't happen" condition. This permits aggressive optimizations.

It seems to me that the correct formulation is: treat any possible undefined behavior as a "anything goes". This is not the same as "can't happen". Can't happen implies that a program that provably encounters undefined behavior at some point during its execution could be fully optimised away, including all its side-effects before the undefined behavior is encountered.

According to paragraph 4 of the C99 standard:

A program that is correct in all other aspects, operating on correct data, containing unspecified behavior shall be a correct program and act in accordance with 5.1.2.3.

Paragraph 5.1.2.3, which explains in what sense a compiled program must implement the behaviour of the abstract machine, contains no indication that side effects do not have to be produced if at some later point in time undefined behavior is encountered. It only states that an evaluation can be optimised away if its value is not used and no "needed" side effects are produced.

The way I understand "undefined behavior", it just means that the compiler writer may supplement the specification of the abstract machine with whatever he likes, for example with whatever seems to be most efficient. Different compilers and compilers for different systems may fill in the behavior left undefined by the standard in their own way. This seems to me to be the only natural interpretation of a standard leaving something "undefined". Turning that into a "can't happen" is nothing more than a distortion of the standard.

syzygy · Post by **syzygy** » Sun Dec 01, 2013 8:06 pm

Hmmm, it seems I am wrong. At least, if Regehr is right:

For now, we can ignore the existence of compilers. There is only the “C implementation” which — if the implementation conforms to the C standard — acts the same as the “C abstract machine” when executing a conforming program. The C abstract machine is a simple interpreter for C that is described in the C standard. We can use it to determine the meaning of any C program.

The execution of a program consists of simple steps such as adding two numbers or jumping to a label. If every step in the execution of a program has defined behavior, then the entire execution is well-defined. Note that even well-defined executions may not have a unique result due to unspecified and implementation-defined behavior; we’ll ignore both of these here.

If any step in a program’s execution has undefined behavior, then the entire execution is without meaning. This is important: it’s not that evaluating (1<<32) has an unpredictable result, but rather that the entire execution of a program that evaluates this expression is meaningless. Also, it’s not that the execution is meaningful up to the point where undefined behavior happens: the bad effects can actually precede the undefined operation.

So according to Regehr encountering undefined behaviour may result in time travel.

I still fail to see how this would follow from the C standard. The C standard admittedly distinguishes between "unspecified" and "undefined" behavior, but it defines "unspecified behavior" as behavior where the standard provides two or more possibilites and there is no requirement which possibility is chosen in any instance. This does not mean that the presence of any "undefined behavior" renders the whole program meaningless. The examples of undefined behavior that are given rather suggest that you can simply not rely on what will happen upon execution of a code construct with undefined behavior:

Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

Apparently such a program may not even compile, but that would seem to refer to "undefined behavior" in the form of e.g. illegal characters in the source file (see Annex J.2. of the standard).

I wonder where Rehger's and Seacord's understanding of "undefined behavior" is coming from.

Rein Halbersma · Post by **Rein Halbersma** » Sun Dec 01, 2013 8:21 pm

syzygy wrote: Paragraph 5.1.2.3, which explains in what sense a compiled program must implement the behaviour of the abstract machine, contains no indication that side effects do not have to be produced if at some later point in time undefined behavior is encountered. It only states that an evaluation can be optimised away if its value is not used and no "needed" side effects are produced.

The way I understand "undefined behavior", it just means that the compiler writer may supplement the specification of the abstract machine with whatever he likes, for example with whatever seems to be most efficient. Different compilers and compilers for different systems may fill in the behavior left undefined by the standard in their own way. This seems to me to be the only natural interpretation of a standard leaving something "undefined". Turning that into a "can't happen" is nothing more than a distortion of the standard.

This got quite a lot of coverage on StackOverflow. I think that for a C compiler, you are right. But at least for a C++ compiler, such an optimization is allowed since the Standard in 1.9/5 contains this gem of a quote:

A conforming implementation executing a well-formed program shall produce the same observable behavior
as one of the possible executions of the corresponding instance of the abstract machine with the same program
and the same input. However, if any such execution contains an undeﬁned operation, this International
Standard places no requirement on the implementation executing that program with that input (not even
with regard to operations preceding the ﬁrst undeﬁned operation).

Note the final clause in parentheses. This allows the compiler to reorder the instructions and prune everything after the undefined division by zero. Effectively, undefined behavior allows time travel in C++.

The C++ Standard is considerably more detailed than the C Standard in this regard. I have no idea if the C Standard is supposed to have the same program execution as the C++ Standard under these circumstances, or that the absence in C of a similar clause to C++ 1.9/5 is meant as a no-go.

syzygy · Post by **syzygy** » Sun Dec 01, 2013 8:31 pm

Sven Schüle wrote:We can only guess whether the author really meant that a compiler were allowed to optimize the "if (x) A[1] = ...; else A[0] = ...;" into "A[x] = ...;" but I would simply assume he didn't, and he always meant "cond" as a boolean expression like "x != 0".

Now that I look at page 8 again, you might indeed be right. But the author has really chosen a terribly obscure way of presenting this.

A total license implementation could determine that, in the absence of any
undefined behavior, the condition cond must have value 0 or 1.

There is really nothing in the code example other than the letters "cond" that could suggest that the expression cond could only evaluate to 0 or 1. I really have difficulty not to understand this sentence as "if (cond) is undefined if cond evaluates to anything else than 0 or 1", which clearly is wrong. Only in the bottom sheet on page 8 is there a suggestion that he meant something else: "If undefined behavior occurred somewhere in the integer arithmetic of cond".

I don't think anyone would find it shocking if the code produces A[!!a] = X if cond equals "!!a" or A[ptr1 < ptr2] if cond equals "ptr1 < ptr2". The latter conditional expression can in fact be undefined. But as a C programmer I interpret "condition" essentially as any integer expression.

Sven · Post by **Sven** » Sun Dec 01, 2013 8:37 pm

syzygy wrote:
Sven Schüle wrote:We can only guess whether the author really meant that a compiler were allowed to optimize the "if (x) A[1] = ...; else A[0] = ...;" into "A[x] = ...;" but I would simply assume he didn't, and he always meant "cond" as a boolean expression like "x != 0".
Now that I look at page 8 again, you might indeed be right. But the author has really chosen a terribly obscure way of presenting this.

Fully agreed!

syzygy · Post by **syzygy** » Sun Dec 01, 2013 8:49 pm

Rein Halbersma wrote:This got quite a lot of coverage on StackOverflow. I think that for a C compiler, you are right. But at least for a C++ compiler, such an optimization is allowed since the Standard in 1.9/5 contains this gem of a quote:

A conforming implementation executing a well-formed program shall produce the same observable behavior
as one of the possible executions of the corresponding instance of the abstract machine with the same program
and the same input. However, if any such execution contains an undeﬁned operation, this International
Standard places no requirement on the implementation executing that program with that input (not even
with regard to operations preceding the ﬁrst undeﬁned operation).
Note the final clause in parentheses. This allows the compiler to reorder the instructions and prune everything after the undefined division by zero. Effectively, undefined behavior allows time travel in C++.

This definition indeed does allow C++ compilers to consider undefined behavior as "can't happen".

This approach seems to have been thought up by a theoretical computer scientist used to thinking of programs processing predefined input. If the abstract machine leaves undefined any aspect of the execution of that program on that input, then the combination (program, input) has no meaning. But in reality programs interact with users and the undefined behavior can be triggered by unexpected user input that the user thinks up a long time after the program has started to run. It seems a C++ program could claim to be able to look into the future and refuse to run, because it "knows" that the user would cause it to execute undefined behavior at some point in the future, and therefore it does not have to run at all.

The C++ Standard is considerably more detailed than the C Standard in this regard. I have no idea if the C Standard is supposed to have the same program execution as the C++ Standard under these circumstances, or that the absence in C of a similar clause to C++ 1.9/5 is meant as a no-go.

It seems to me that the absence means that side-effects occurring before the encounter of undefined behavior have to be respected. But clearly other people see this differently.

Thanks a lot for the pointers!

Btw, (as also pointed out on stackoverflow) either way the ereport() example was compiled wrongly, because ereport() might simply not return in which case no undefined behavior would occur. But the gcc bug report is a bit different (although I'm not sure it is valid to assume that printf() always returns).

A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers