A note for C programmers

bob · Post by **bob** » Fri Dec 06, 2013 5:09 pm

Rein Halbersma wrote:
bob wrote: So how about stopping with the hand waving and simply explain how to break a hash probe where I get an 8 byte signature that does not go with the corresponding 8 bytes of score and stuff. Then we can talk.
No, the burden of proof is on you. You claim that the compiler has no choice but to emit a sane set of instructions that give you the expected result. However, both the C11 and C++11 Standard leave no doubt that reads and writes from different threads to the same variable constitute undefined behavior.

I made no such claim. My claim is that the compiler is required to semantically do EXACTLY what I say.

I want to write w1 and w2 to memory, w1 and w2 are 8 byte values. I simply write w1^w2 for the first value, and w2 for the second. If anything else gets written, I won't get a match since w1 (original signature) xored with anything other than the original score/etc will NOT produce the original signature to give me a match.

It really is that simple. You are reading far more into this than there is. The only issue here occurs when two threads write to the same address, one writes {a1, a2} as the two 8 byte values, the other writes {b1,b2} as the two values. There are exactly 4 possible outcomes. After both complete the writes, memory contains one of the following pairs of words: {a1, a2}, {a1, b2}, {b1,a2}, {b1, b2}

{a1, a2} will be recognized, as will {b1, b2}. The other two will not decode to valid signatures and no match occurs.

Now exactly how hard is that to grasp? If you have some bizarre hardware that writes byte by byte rather than double word by double word, still works perfectly. If your hardware writes bit by bit, STILL works perfectly.

You might get away with it for a long time on many different architectures, and no compiler will probably emit completely bogus instrutions. The most likely failure scenario that I can think of is that the compiler will eliminate the offending code in its entirety. The code example posted by Ronald is very instructive

Compiler can't eliminate ANYTHING. It won't even KNOW about the race.

Code: Select all
#include <stdio.h> 

int main&#40;) 
&#123; 
  int i, k = 0; 

  for &#40;i = 1; i > 0; i += i&#41; 
    k++; 
  printf&#40;"k = %d\n", k&#41;; 
  return 0; 
&#125;
Running gcc -O0 will produce 31 on architectures where an int is 32-bits. However, with gcc -O2 and higher, the compiler will recognize that "i += i" yields signed overflow UB. It will then eliminate the entire expression, and further optimize this code to an infinite loop. Such a scenario is also possible to happen with your XOR trick on multicore machines. Compiler routinely optimize away UB code instructions. You need extra compiler instructions or code modifcations to get the compiler to do what you intended.

Don't know what planet you compile on. Here's that run on my macbook:

scrappy% cc -O -o tst tst.c
.scrappy% ./tst
k = 31
scrappy% cc -O2 -o tst tst.c
scrappy% ./tst
k = 31
scrappy% cc -O3 -o tst tst.c
scrappy% ./tst
k = 31

gcc version 4.7.3 (MacPorts gcc47 4.7.3_3)

Compilers do NOT always behave as you seem to imagine they do. BTW, this does break intel's compiler. But for the race condition, the compilers have no clue there is even a race in the first place.

You can detect such optimizations with -Wstrict-overflow=5 and gcc will warn you about it (and with -Werrors it will also fail to compile). Even better is to pass a flag -fwrapv and then you are guaranteed it will output CHAR_BIT * sizeof(int) - 1. Best of all is not to write such insane loops in the first place.

bob · Post by **bob** » Fri Dec 06, 2013 5:13 pm

Rein Halbersma wrote:
rbarreira wrote: What I mean is: if a piece of code has defined behavior for 1 thread but undefined behavior for more than 1 thread, the only way that the compiler can "entirely eliminate pieces of code it can prove to be undefined behavior" is if it generates two different pieces of code: one which is used when only 1 thread is running, and another which is used when more than one thread is running.

While theoretically possible, this is very unlikely to happen with current (and near/medium-term) compiler technology. The reason being of course that the compilers are nowhere near smart enough to make high-level analysis of C code to determine its threading strategies, let alone generate code which makes runtime decisions based on the current thread layout.
You may be right, but all I'm saying is that it's UB and therefore a time bomb in the code. Any sufficiently advanced compiler is free to exploit this for optimization purposes. The UB has been diagnosed and if the response of a programmer is "I'll believe it when I see it", then best of luck to him. The proper way to deal with it is to specify the memory order or to introduce synchronization.

I'm glad Bob is making chess programs. I shudder at the thought of car, medical or defense systems being made by people who ignore undefined behavior. Fortunately, the MISRA C/C++ rules explicitly forbid the use of undefined behavior in any life-critical software system, and current international jurisprudence is to hold firms and individuals to a strict liability standard.

This is silly. We execute the SAME code for one thread or multiple threads. The compiler can not break the one thread code. Therefore it also can't break the multiple-thread code.

And in fact, it can't detect this condition. Some use fork() and the compiler can not see beyond that system call to determine what might be a race condition. Not just hard. Absolutely impossible. It simply doesn't have enough information available at compile time.

You seem to be stuck deep in this "if it says undefined, the compiler can do whatever it wants." Fortunately, the compiler guys are not stuck at the same place. The idea is to "do the right thing". Not try and see what they can break. Just because the standard says something is undefined doesn't mean a thing in this case, because the compiler has no clue that a race even exists. Just because there is an unprotected store does not mean there is a race, in fact. There are lockless locks that work on intel due to in-order writes, for one example.

syzygy · Post by **syzygy** » Fri Dec 06, 2013 8:43 pm

bob wrote:
syzygy wrote:
bob wrote:
AlvaroBegue wrote:The standard defines "unspecified behavior", not "unspecified". Similarly with "undefined behavior". You may not like these names, but they are used every day by a lot of people.

"Undefined behavior" means it works when you first run it, it passes all your tests and it explodes in your face when you show it to your boss or your most important customer.

Although you are generally right that people shouldn't go around breaking code that has worked forever, I recommend you just fix the code and move on.
Undefined behavior does NOT mean it will explode. Not one single instance of strcpy() will break on the example I gave where you do strcpy(st, st+n). Integer overflow does NOT make checksum break.
It means it MAY explode. The standard allows it. Compiler writers are making more and more use of this freedom. It makes programs faster.
Code: Select all
#include <stdio.h>

int main&#40;)
&#123;
  int i, k = 0;

  for &#40;i = 1; i > 0; i += i&#41;
    k++;
  printf&#40;"k = %d\n", k&#41;;
  return 0;
&#125;
What do you expect as output?
back to my point. Apple MADE it explode. If the source and destination overlap, the program crashes, period.

As far as the above, it would depend on word size. On my mac, where int defaults to 32 bits, I'd expect 31. On a Cray, I would expect 63. On an old 8086, I would expect 7.

Wrong. Compiled with gcc, you get no output at all. The program hangs in an infinite loop.

If the compiler can't deal with the overflow/underflow, it is broken.

If a compiler complies with the standard, then by definition it is not broken.

syzygy · Post by **syzygy** » Fri Dec 06, 2013 8:51 pm

bob wrote:
Rein Halbersma wrote:
bob wrote:Food for thought. The English language is well-defined.
Think again: http://english.stackexchange.com/questi ... nt-regions

Human languages are anything but well-defined, they are extremely context-sensitive (time, place, person).
I'll bet you won't find one native English-speaker that would say "undefined" and "unspecified" don't mean the same basic concept unless you want to get down to a minute semantic war based on what is meant by "is" for example.

This is about a standard. The standard defines both terms. The standard leaves no doubt that the two terms have different meanings. If you find that ridiculous, then you have never really read a standard nor thought about the importance of being precise.

syzygy · Post by **syzygy** » Fri Dec 06, 2013 9:02 pm

bob wrote:I don't care what the compiler does, it can't break the lockless hashing code. It is as simple as that. If you want to even try to convince me otherwise, here's the challenge:

You certainly do care what the compiler does.

a hash entry is 16 bytes. I have to write it as 2 8-byte blocks, each of which is certainly guaranteed to be atomic in x86 since that is what is written across the bus in one cycle.

Nope, the compiler is free to implement storing an uint64 as 8x storing a byte. Nothing in the standard stops the compiler from doing just that.

The compiler most likely won't do that, but here you are depending on the compiler and not on the C language.

bob · Post by **bob** » Fri Dec 06, 2013 9:20 pm

syzygy wrote:
bob wrote:
Rein Halbersma wrote:
bob wrote:Food for thought. The English language is well-defined.
Think again: http://english.stackexchange.com/questi ... nt-regions

Human languages are anything but well-defined, they are extremely context-sensitive (time, place, person).
I'll bet you won't find one native English-speaker that would say "undefined" and "unspecified" don't mean the same basic concept unless you want to get down to a minute semantic war based on what is meant by "is" for example.
This is about a standard. The standard defines both terms. The standard leaves no doubt that the two terms have different meanings. If you find that ridiculous, then you have never really read a standard nor thought about the importance of being precise.

Please. One can't be precise while at the same time defining "green" to mean "round objects > 10" in length" while "red" means "anything made of any material other than steel, aluminum or titanium." If you need a new word, coin a new word. Don't redefine an existing word. That is anything BUT a "clear standard."

syzygy · Post by **syzygy** » Fri Dec 06, 2013 9:21 pm

mar wrote:
Rein Halbersma wrote:Such a scenario is also possible to happen with your XOR trick on multicore machines.
Not at all. Consider a shared library. How can you know in advance that the code that's exported is executed from multiple threads? The only undefined behavior by accessing unguarded variable from multiple threads is a simple race condition. There is no way the compiler can use this to optimize this kind of UB because it can't know at compile time.

As far as I know, there is no reason that a C99-compliant compiler lets you have any shared datastructures between threads. For example, global datastructures could be allocated in scratchpad memory local to the processor so that threads on other processors don't even have physical access to it.

If we start talking about multithreading (and not the C11 way), then I think we have to read the C99 standard together with the pthread standard or whatever document describes the guaranteed behavior of the threading system you intend to use.

Just basing yourself on the C99 standard, you cannot make any assumptions on how the compiler maps a C program to the hardware, because the C99 standard is completely silent on it.

bob · Post by **bob** » Fri Dec 06, 2013 9:22 pm

syzygy wrote:
bob wrote:I don't care what the compiler does, it can't break the lockless hashing code. It is as simple as that. If you want to even try to convince me otherwise, here's the challenge:
You certainly do care what the compiler does.

a hash entry is 16 bytes. I have to write it as 2 8-byte blocks, each of which is certainly guaranteed to be atomic in x86 since that is what is written across the bus in one cycle.
Nope, the compiler is free to implement storing an uint64 as 8x storing a byte. Nothing in the standard stops the compiler from doing just that.

The compiler most likely won't do that, but here you are depending on the compiler and not on the C language.

And if you read what I wrote, that won't bother my code one scintilla. Nor if the compiler decides to store bit by bit somehow. Still works perfectly.

Do you have ANOTHER point to try?

BTW the compiler is not going to have much luck storing those bytes, intel cache takes care of that by combining writes. So I am actually not depending on the compiler OR the hardware whatsoever here. My code works on the most conservative hardware, or the most speculative out-of-order-write hardware. Doesn't matter. Hence one CAN deal with undefined behavior, unless the compiler author decides to be an ass. But in this case, there is no way the compiler can detect races anyway, so it is all 100% moot.

Rein Halbersma · Post by **Rein Halbersma** » Fri Dec 06, 2013 9:24 pm

bob wrote:
Code: Select all
#include <stdio.h> 

int main&#40;) 
&#123; 
  int i, k = 0; 

  for &#40;i = 1; i > 0; i += i&#41; 
    k++; 
  printf&#40;"k = %d\n", k&#41;; 
  return 0; 
&#125;
Running gcc -O0 will produce 31 on architectures where an int is 32-bits. However, with gcc -O2 and higher, the compiler will recognize that "i += i" yields signed overflow UB. It will then eliminate the entire expression, and further optimize this code to an infinite loop. Such a scenario is also possible to happen with your XOR trick on multicore machines. Compiler routinely optimize away UB code instructions. You need extra compiler instructions or code modifcations to get the compiler to do what you intended.
Don't know what planet you compile on. Here's that run on my macbook:

scrappy% cc -O -o tst tst.c
.scrappy% ./tst
k = 31
scrappy% cc -O2 -o tst tst.c
scrappy% ./tst
k = 31
scrappy% cc -O3 -o tst tst.c
scrappy% ./tst
k = 31

gcc version 4.7.3 (MacPorts gcc47 4.7.3_3)

Online example using gcc -O3 (gcc 4.6.4 on Linux)

bob · Post by **bob** » Fri Dec 06, 2013 9:27 pm

simon wrote:
wgarvin wrote: Okay, we get that you're still annoyed about it, although I'm surprised that two weeks have passed and you still haven't admitted yet that you're completely wrong!
You're probably the only person who is surprised.

WHAT am I "wrong" about?

Certainly not my original point, namely that Apple broke something that did not need to be broken. That Apple, in doing so, caused a lot of debugging effort on my part, and on the part of many others, unnecessarily.

I'm neither supporting OR railing against using undefined behavior if one is careful. I do believe the compiler should "do the best thing" as opposed to "let's be anal and break the code..."

A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers

Re: A note for C programmers