How could a compiler break the lockless hashing method?

bob · Post by **bob** » Mon Dec 09, 2013 6:17 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:Unless I'm mistaken, the C standard allows the compiler to load a value multiple times. Since the C99 standard does not know about threads, it may load the hash entry, do the xor-trick to verify that the entry can be used, and load the hash entry again. Between the two loads the hash entry may change its content due to other threads. There you go.
Doesn't break a thing, however.

Think about it.
I can only assume you did not read.

A compiler can legally rewrite:
Code: Select all
    word1 = htable->word1;
    word2 = htable->word2 ^ word1;
    if (word2 == temp_hashkey)
      break;
as
Code: Select all
  word2 = htable->word2 ^ htable->word1;
  if (word2 == temp_hashkey)
    break;
  word1 = htable->word1;
See the problem?
Don't see any problem at all. Doesn't change a thing. Next line after exiting the loop access word1, so it has to load word 1 anyway. This is a bad optimization, btw, it saves nothing whatsoever and adds the extra code to load word one after the loop breaks rather than within the loop. Don't need that kind of anti-optimization and I don't believe any good compiler would choose to do something so broken...
You don't see a problem, yet you need to argue that a good compiler won't do this anti-optimization.

I suppose this means that you DO see the problem?

So this is what I should have written:
Code: Select all
    word2 = htable->word2 ^ htable->word1;
    if (word2 == temp_hashkey)
      break;
  }
  if (entry < 4) {
    word1 = htable->word1;
    ...
Compare to your hash.c.
This rewrite is legal. The compiler may take your code and effectively produce the above code.
This rewrite breaks lockless hashing.

On architectures with few registers, a compiler may actually do this to avoid having to spill word1 onto the stack. Why store word1 on the stack when it can just as easily be reloaded from htable->word1?

Whether we can find an existing compiler for x86 that actually does this is not so interesting. What matters is that your code is broken.

Again, won't change a thing. I could even FORCE it to re-read it later by referring to it as volatile. But I don't care...
And you really do not realise that reloading defeats the purpose of the xor-trick?

OK, let's do this technically.

Here is the actual code, again:

for (entry = 0; entry < 4; entry++, htable++) {
word1 = htable->word1;
word2 = htable->word2 ^ word1;
if (word2 == temp_hashkey)
break;
}
...
...
...
use word1 to extract draft and such later, as in right here.

A compiler has three choices. I'll be the compiler and "explain" what I see and what I will do.

choice 1. Before I can do ANYTHING inside that loop, I have to read htable->word1 AND table->word2. I now have them in registers. I do the xor, and the comparison, and if it matches I break out of the loop. At this point, htable->word1, htable->word2, and htable itself are all in registers. I now notice there is no future references to either htable->word2 or htable and I free up those registers. I keep the register with htable->word1 for use when the code extracts draft and such. Two memory accesses to the table, period.

choice 2. Registers are tight. Again I have to load htable->word1 and htable->word2 into registers. but when I exit the loop, I can't keep htable->word1 in a register, so I spill it to local memory (variable named just word1 in my code) When I need that value, I load it again from memory. Three memory reads, rather than two. Slower.

choice 3. The choice you think can happen, by the way. Registers are tight, so again I load htable->word1 and htable->word2 in registers to do the comparison. But when I get the match, I can't save word1, and I don't spill it to the local data, I will just reload from scratch when it is needed. I now need it. I obviously do NOT have htable in a register, since it is NOT needed after the loop, otherwise I would have simply put word1 in that register and kept it instead. So I go back to the loop and when I finish I spill htable to a local variable, and just drop the other two. Now I need word1. I first go to memory to get the pointer (stored in htable) and then use that to go to memory to fetch htable->word1. How does that compare? two memory reads as always, a memory write, two MORE memory reads. That is slower than if it had just done EXACTLY what my code says do, as is done with -O0.

It is an anti-optimization. It makes the code slower. Choice one is what is actually done by gcc, and you can verify that easily enough with -S as I did. On a 32 bit box, choice 2 might happen since there are fewer registers (8 less) to use. Choice 3 will NEVER happen because the purpose of the optimizer is not to make the code slower.

Now how about stopping with "the compiler could do this or could do that." It could actually do most anything, but in most of such cases it is broken and will eventually be fixed. This would be one of those cases.

The rewrite never happens. Although I could force it by just using htable->word1 EVERYWHERE, AND declaring that word1 is volatile. Then it has to do exactly what you suggest, and it also runs slower. But I didn't declare it volatile. And rewriting as you suggest is simply broken because it makes the code slower than if compiled with no optimization at all...

I suppose if you want to hang your hat on a compiler bug, then yes, lockless hashing can be broken. As can a truly locked implementation, because compilers have been known to move the code that acquires or clears locks around to make the code faster. That will also break a locked hashing approach. But who cares? I can't do a thing about compiler bugs. I can only plan on what is reasonable. And what you suggest is not going to happen, for the reasons I gave.

bob · Post by **bob** » Mon Dec 09, 2013 6:28 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:I compile search.c, hash.c and other files SEPARATELY. How can the compiler be certain there are no locks around HashStore() when it is called from search (which it can't see when compiling hash.c) and around HashProbe() when it is called?
Apart from this being a non-argument given that compilers nowadays can easily optimise across compilation units, you have apparently forgotten about crafty.c:
Code: Select all
/* last modified 01/18/09 */
/*
 *******************************************************************************
 *                                                                             *
 *   This module is designed for the most efficient compiling, as it includes  *
 *   all of the source files into one large wad so that the compiler can see   *
 *   all function calls and inline whatever is appropriate.  The includes are  *
 *   loosely ordered so that the most common functions occur first, to help    *
 *   with cache layout when the code is actually loaded.                       *
 *                                                                             *
 *******************************************************************************
 */
#include "search.c"
#include "thread.c"
#include "repeat.c"
#include "next.c"
#include "killer.c"
#include "quiesce.c"
#include "evaluate.c"
#include "movgen.c"
#include "make.c"
#include "unmake.c"
#include "hash.c"
...
Please grow up. Can you do that to Linux? Give it a whirl and report back. 99% of my crafty compiles are individual files for obvious speed reasons.
What on earth is the matter with you? Crafty evidently compiles as a single unit. I think it is time for you to consult your physician.
What is the matter with you? It compiles as a single unit "in final release mode". Not in development mode. I think it time for you to visit a psychiatrist...
bob wrote:I compile search.c, hash.c and other files SEPARATELY.
So during development you are safe and you don't care about the final release? Is that how I should understand it?

Me thinks you are really overstepping the boundaries of civil behavior. I see demons flying out of...

If you state "I do X" and somebody else indisputably shows that you do "not X", then you should admit that. You have no other choice, unless you are not a sane, reasonable adult. But no, you lack any such decency.

Because YOU choose to try to post nonsense. If you had just looked at the Crafty Makefile you would see two distinct versions of the "objects = " assignment, with one copied out.

For years I only compiled separately. I started doing it when some system I ran on gave that option, with the idea that it would rearrange the procedure ordering to place procedures executed close together in time close together in memory. I've not even verified that it makes any difference for Crafty on the PC.

Actually I just did. This is gcc 4.7.3, PGO-compiled as one big source, and then as separate files.

log.001: time=11.82 n=56025487 afhm=1.18 predicted=0 50move=0 nps=4.7M
log.002: time=11.72 n=56025487 afhm=1.18 predicted=0 50move=0 nps=4.8M

Strangely enough the second is a tiny bit faster, compiling as separate files.

Please show me where you "indisputably showed I compile as one large file." I just pointed you to evidence that directly disputes that. The Makefile is made to do it either way. The source files are split naturally, and only combined via the #include trick in crafty.c. I just "indisputably proved" that I do it BOTH ways. As I always have.

Again, please grow up and do just a bit of research. Your example above was bad, because it produces code slower than no optimization. NO compiler will do that unless it has bugs. Which makes that pointless to discuss.

The C standard does NOT require that I compile everything as one file. Quite the opposite, it explicitly allows multiple files. It has ALWAYS been that way.

bob · Post by **bob** » Mon Dec 09, 2013 6:36 pm

rbarreira wrote:
hgm wrote:
rbarreira wrote:That would be a very strange decision ...
I am sure I would have nothing to gain from this -flto whatsoever. I wouldn't write code that could benefit from it. If global optimization tricks would exist that could make my program faster, they would already have been used in my code, and if they were not, there would be a reason why I didn't want them to be used, and I would not want to give the compiler a chance to wreck my code by these so-called 'optimizations'.

For me, a programming language is a way to instruct the hardware as to what it should do, not a blank cheque to the compiler writer to do whatever he dreams up in his crazy mind, and hope for the best.
Then you should probably use a language other than C and/or stop upgrading your compiler.

As for your statement that whole-program optimization could never speed up your programs, you might be surprised if you actually tried it. Unless of course you're a superhuman who has analyzed everything related to your software.

To date, whole-program optimization gives one advantage, inlining. I did encounter a compiler that used PGO to re-order procedures to improve the memory layout slightly. But as of today, inlining is what you get. That's rarely a significant win. For me, it does nothing. I just tried a single-file compile using PGO, and then a multiple-file compile using PGO, the single-file compile was slightly slower, NOT faster.

Here's a slightly longer run than the previous post used:

log.001: time=23.72 n=117008633 afhm=1.19 predicted=0 50move=0 nps=4.9M
log.002: time=23.68 n=117008633 afhm=1.19 predicted=0 50move=0 nps=4.9M

second is compiling each source file separately, still using PGO. First is compiling one large source file with PGO.

It is not always a win, although that small difference is really more noise than it is anything else. I'm not "superhuman". But I have done more benchmarking, profiling, and such on Crafty than you can possibly imagine. I have compiled with -S and looked at individual procedures to see how the asm looks. I have modified my C at times to help the compiler produce better code (optimizers are NOT as good as some of you think they are.) They are "good" to be sure. They are FAR from "great".

rbarreira · Post by **rbarreira** » Mon Dec 09, 2013 6:51 pm

bob wrote:
rbarreira wrote:
hgm wrote:Oh, then there isn't any real problem. Just never use the -flto, like you would avoid any -fauto-crash flag. If you are asking for trouble, you usually get it.
That would be a very strange decision in the context of this thread - the only reason why you would risk using undefined behavior for lockless hashing is better performance (avoiding spinlocks or mutexes). Since it's performance you're after, you're crippling yourself by not using flto / IPO (as icc calls it). What's the rationale for that? At the very least you'd owe it to yourself to compare safe hashing + flto vs unsafe hashing without flto.

This is not an academic question btw - many chess programs are being compiled with whole-program optimization (and in the case of Crafty it does a low-tech version of it by compiling the whole program as one .c file which #includes the other c files).
Did you bother to stop and ask yourself the question "Why is this option NOT the default?"

#1. to date, all it really does is allow cross-module inlining.

#2. it is slow.

Nice strawman you got going there - I never said you should always use it.

bob · Post by **bob** » Mon Dec 09, 2013 7:29 pm

rbarreira wrote:
bob wrote:
rbarreira wrote:
hgm wrote:Oh, then there isn't any real problem. Just never use the -flto, like you would avoid any -fauto-crash flag. If you are asking for trouble, you usually get it.
That would be a very strange decision in the context of this thread - the only reason why you would risk using undefined behavior for lockless hashing is better performance (avoiding spinlocks or mutexes). Since it's performance you're after, you're crippling yourself by not using flto / IPO (as icc calls it). What's the rationale for that? At the very least you'd owe it to yourself to compare safe hashing + flto vs unsafe hashing without flto.

This is not an academic question btw - many chess programs are being compiled with whole-program optimization (and in the case of Crafty it does a low-tech version of it by compiling the whole program as one .c file which #includes the other c files).
Did you bother to stop and ask yourself the question "Why is this option NOT the default?"

#1. to date, all it really does is allow cross-module inlining.

#2. it is slow.
Nice strawman you got going there - I never said you should always use it.

No straw man. Explain this:

Since it's performance you're after, you're crippling yourself by not using flto

Not using it is bad. At least per your statement. I've tested it multiple times since it first came along. No win for me anywhere. Did you mean "you're crippling yourself <SOMETIMES>" which is NOT what you wrote...

One thing I do regularly, is to compile crafty a couple of hundred times, trying out each of the various optimizations that default to disabled, to see if any help. The answer is almost uniformly "no". Which probably why they default to disabled. They might help a program here or there, but not the general case.

It sometimes helps to actually try this stuff as opposed to just assuming it will do some good. For example, I just ran a long PGO compile using one source file and individual source files. I expected the one large file to be slightly faster, because the compiler can inline within a file, but not across files that are compiled separately. Interestingly, the one file compile runs consistently at 4.9M bps, the separate file compile runs consistently at 5.0. A couple of percent, (consistent percentage) slower. Why? don't know and am NOT going to try to look at several hundred thousands of lines of ASM to see what is going wrong. I'll just continue to compile separately as I have for years and let it go at that...

rbarreira · Post by **rbarreira** » Mon Dec 09, 2013 7:42 pm

bob wrote:
rbarreira wrote:
bob wrote:
rbarreira wrote:
hgm wrote:Oh, then there isn't any real problem. Just never use the -flto, like you would avoid any -fauto-crash flag. If you are asking for trouble, you usually get it.
That would be a very strange decision in the context of this thread - the only reason why you would risk using undefined behavior for lockless hashing is better performance (avoiding spinlocks or mutexes). Since it's performance you're after, you're crippling yourself by not using flto / IPO (as icc calls it). What's the rationale for that? At the very least you'd owe it to yourself to compare safe hashing + flto vs unsafe hashing without flto.

This is not an academic question btw - many chess programs are being compiled with whole-program optimization (and in the case of Crafty it does a low-tech version of it by compiling the whole program as one .c file which #includes the other c files).
Did you bother to stop and ask yourself the question "Why is this option NOT the default?"

#1. to date, all it really does is allow cross-module inlining.

#2. it is slow.
Nice strawman you got going there - I never said you should always use it.
No straw man. Explain this:

Since it's performance you're after, you're crippling yourself by not using flto
Not using it is bad. At least per your statement. I've tested it multiple times since it first came along. No win for me anywhere. Did you mean "you're crippling yourself <SOMETIMES>" which is NOT what you wrote...

One thing I do regularly, is to compile crafty a couple of hundred times, trying out each of the various optimizations that default to disabled, to see if any help. The answer is almost uniformly "no". Which probably why they default to disabled. They might help a program here or there, but not the general case.

It sometimes helps to actually try this stuff as opposed to just assuming it will do some good. For example, I just ran a long PGO compile using one source file and individual source files. I expected the one large file to be slightly faster, because the compiler can inline within a file, but not across files that are compiled separately. Interestingly, the one file compile runs consistently at 4.9M bps, the separate file compile runs consistently at 5.0. A couple of percent, (consistent percentage) slower. Why? don't know and am NOT going to try to look at several hundred thousands of lines of ASM to see what is going wrong. I'll just continue to compile separately as I have for years and let it go at that...

I was replying to someone who said never use lto. And I mentioned actually testing the performance, so don't pretend I was acting ignorant please.

As for your experiments, there are a few unstated variables - which compiler, options and version. I believe some gcc versions had buggy lto.

syzygy · Post by **syzygy** » Mon Dec 09, 2013 7:44 pm

bob wrote:choice 3. The choice you think can happen, by the way. Registers are tight, so again I load htable->word1 and htable->word2 in registers to do the comparison. But when I get the match, I can't save word1, and I don't spill it to the local data, I will just reload from scratch when it is needed. I now need it. I obviously do NOT have htable in a register, since it is NOT needed after the loop, (...)

htable IS needed after the loop, see hash.c.

It is an anti-optimization. It makes the code slower. Choice one is what is actually done by gcc, and you can verify that easily enough with -S as I did. On a 32 bit box, choice 2 might happen since there are fewer registers (8 less) to use. Choice 3 will NEVER happen because the purpose of the optimizer is not to make the code slower.

As if optimisers never make anything slower.

You are betting that this "anti-optimization" will never happen. Just as you were betting that your abuse of strcpy() would never fail.

Face it, the implementation of your xor-trick is broken. Unfortunately there is no good way to fix it without using compiler-specific tricks.

syzygy · Post by **syzygy** » Mon Dec 09, 2013 7:48 pm

bob wrote:
syzygy wrote:If you state "I do X" and somebody else indisputably shows that you do "not X", then you should admit that. You have no other choice, unless you are not a sane, reasonable adult. But no, you lack any such decency.
Because YOU choose to try to post nonsense. If you had just looked at the Crafty Makefile you would see two distinct versions of the "objects = " assignment, with one copied out.

Ok, let's do that. I have crafty-23.6 here:

Code: Select all

#objects = search.o thread.o repeat.o next.o killer.o quiesce.o evaluate.o     \
       movgen.o make.o unmake.o hash.o  attacks.o swap.o boolean.o utility.o  \
       history.o probe.o book.o data.o drawn.o edit.o epd.o epdglue.o init.o  \
       input.o interrupt.o iterate.o main.o option.o output.o ponder.o        \
       resign.o root.o learn.o setboard.o test.o time.o validate.o annotate.o \
       analyze.o evtest.o bench.o
objects = crafty.o

What were you saying?

bob · Post by **bob** » Mon Dec 09, 2013 7:55 pm

syzygy wrote:
bob wrote:
syzygy wrote:If you state "I do X" and somebody else indisputably shows that you do "not X", then you should admit that. You have no other choice, unless you are not a sane, reasonable adult. But no, you lack any such decency.
Because YOU choose to try to post nonsense. If you had just looked at the Crafty Makefile you would see two distinct versions of the "objects = " assignment, with one copied out.
Ok, let's do that. I have crafty-23.6 here:
Code: Select all
#objects = search.o thread.o repeat.o next.o killer.o quiesce.o evaluate.o     \
       movgen.o make.o unmake.o hash.o  attacks.o swap.o boolean.o utility.o  \
       history.o probe.o book.o data.o drawn.o edit.o epd.o epdglue.o init.o  \
       input.o interrupt.o iterate.o main.o option.o output.o ponder.o        \
       resign.o root.o learn.o setboard.o test.o time.o validate.o annotate.o \
       analyze.o evtest.o bench.o
objects = crafty.o
What were you saying?

That the Makefile works either way. Remove "#" on first definition of objects, which is exactly what I have on my local machines, and add "#" to the second definition. For release, for whatever reason lost in history, the standard Makefile has the one-file option enabled.

I don't distribute my local Makefiles because I change them all the time, and I don't want to create additional issues for new versions when "make" suddenly doesn't work.

So what is your point? It compiles EITHER way. EITHER way is correct, although I now know that separate files produces an executable about 2% faster with current gcc, approximately the same speed with icc. What more is there to say? You think I am legally required to run the same Makefile that I distribute? Even though said Makefile has two options with this comment in front that you left off:

#
# one of the two following definitions for "objects" should be used. The
# default is to compile everything separately. However, if you use the
# definition that refers to crafty.o, that will compile using the file crafty.c
# which #includes every source file into one large glob. This gives the
# compiler max opportunity to inline functions as appropriate. You should try
# compiling both ways to see which way produces the fastest code.
#

Notice that the COMMENT actually relates to what I normally do, namely "compile everything separately". When/how/why it was changed to do the one big file option I don't remember and really don't care since the difference is small in the first place. You DID read those comments, right?

bob · Post by **bob** » Mon Dec 09, 2013 8:01 pm

syzygy wrote:
bob wrote:choice 3. The choice you think can happen, by the way. Registers are tight, so again I load htable->word1 and htable->word2 in registers to do the comparison. But when I get the match, I can't save word1, and I don't spill it to the local data, I will just reload from scratch when it is needed. I now need it. I obviously do NOT have htable in a register, since it is NOT needed after the loop, (...)
htable IS needed after the loop, see hash.c.

It is an anti-optimization. It makes the code slower. Choice one is what is actually done by gcc, and you can verify that easily enough with -S as I did. On a 32 bit box, choice 2 might happen since there are fewer registers (8 less) to use. Choice 3 will NEVER happen because the purpose of the optimizer is not to make the code slower.
As if optimisers never make anything slower.

You are betting that this "anti-optimization" will never happen. Just as you were betting that your abuse of strcpy() would never fail.

Face it, the implementation of your xor-trick is broken. Unfortunately there is no good way to fix it without using compiler-specific tricks.

1. You are correct about htable. Had forgotten about updating the age when I get a match on a probe. Case 1 is STILL the right answer. Keep htable in register, keep word1 in register. That's what a rational compiler will do until it encounters a register jam. Which won't happen in hash.c due to simplicity.

2. I am certainly betting that the compiler, in its "common subexpression" optimization pass, will realize that going BACK to memory is about a thousand times worse than recomputing a typical common subexpression. And it will NOT generate multiple memory accesses. In fact, for gcc, you can remove the local word1/word2 declarations, and replace all of em by 'htable->word1 or htable->word2, and the compiler produces the same code. As it should. I simply wrote it to make it easy to read, knowing that word1/word2 really won't be touched anyway.

3. I'm not using any "compiler-specific tricks" to make it work. It works. As intended. Unless someone breaks the optimizer. And even then there is no ill effect, since I check the hash move for validity, and a random signature collision has no effect on final move choice (see paper by myself and Cozzie). And it has no performance penalty due to locks.

How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?