How could a compiler break the lockless hashing method?

rbarreira · Post by **rbarreira** » Fri Dec 13, 2013 8:39 am

wgarvin wrote:
bob wrote:So it looks like MOST of the long-term C programmers have quite a different view from those of you that are apparently a product of the Java abstract-machine crowd. What kind of language would target assembly language programming? One that lets you get to the hardware to do the things the hardware is capable of, WITHOUT getting in the way and being an obstructionist.
You keep saying "Java" and "abstract-machine" together, but the abstract machine stuff in the C language specification pre-dates the existence of Java by many years. So I'm not quite sure what your point is there. Maybe the "Java" is a reference to some kind of "dumbing down" of C? If anything, the opposite is happening--the compilers are getting cleverer and it seems to me that programmers have a greater need now to understand difficult topics like UB than they did in the past. Even though the UB stuff puts a few things out of bounds, C does still let you access a lot of the power of the underlying hardware.

I learned x86 assembly before I learned C or C++, then I learned a bit of C and then early C++ (back then, C++ was not nearly the abomination of a language that it was today). I didn't really learn Java until my first job after university, where I used it every day for about 4 years. For the last 7 years, it's been C++ on somewhat limited platforms: game consoles. I've spent countless days trying to save 100 kb of code size, or shave a few microseconds off of something that gets executed 30 times a second, or tweak some float code to use vector intrinsics just to avoid a performance-killing load-hit-store. So I appreciate the value in being able to get close to the hardware from inside a high-level language like C or C++, I really do! But portability is also something we must have. Some of my code has to work on 6 different compilers for 4 different CPU architectures, so using gcc-specific extensions or something, usually won't fly. And of course it needs to be as fast as possible, and if those compilers will do something extremely clever for me then I'll take it even if the price is that I have to be careful to avoid >191 kinds of undefined behavior or really weird things might occur.

Yeah Bob, no need to insult people's intelligence by saying we can't handle C because we're braindead due to Java. I was programming x86 assembly way before I touched Java. In my day to day work I use C++ and C# mostly, on hobby stuff I use x64 asm / C / C++.

Ironically, it looks like it's actually you who can't handle C. You expect it to be a clean safe language where everything is neatly defined, when it has NEVER been that. You pretend it's only recent compilers and standards that make it a hard-to-use language, which is also false.

syzygy · Post by **syzygy** » Fri Dec 13, 2013 10:28 am

bob wrote:
syzygy wrote:
Certainly integer overflow on any hardware from the last 30 years, excepting for Univac which has been dead for about that long, is not "undefined at all.
So what is then? Signed overflow wraps? You have already explained that that is not what you want in the case of x * 2 / 2.
It wraps in ALL cases. The only question is, does it wrap at 32 or more bits? I'd like to see it do the best the hardware can offer. I suppose C COULD just use 16 bit values on x86, since that is all the 286 can do. But there is more in the hardware.

Then you have to learn how to cast.

The C standard is clear about multiplication of 2 unsigned ints giving an unsigned int and not something wider. Programmers must be able to rely on the C standard, so a compiler shouldn't just play smart here.

However, when the ints are signed, the compiler obviously may do whatever it wishes. Overflow can't happen.

If you really need the intermediate result of a multiplication of two 32-bit values in 64-bit precision, just cast one of the two to 64-bit first. In other words, tell the compiler exactly what you want.

But the very basic mistake you keep making is confusing x86 semantics with C semantics.

A C programmer should only be concerned with C semantics. He can rely on what is defined by the C standard. He can not rely on anything else (unless he restricts himself to a particular compiler that offers additional guarantees). The contract between the C programmer and the C compiler is the C standard.
For a naive or sloppy programmer, that is likely true. However some of us want as much as we can get.

But those some of you are deluded.

syzygy · Post by **syzygy** » Fri Dec 13, 2013 10:34 am

wgarvin wrote:[Edit: it probably makes X * 2 / 2 emit those math ops 'as written' too. unless the compiler notices it can fold the 2 / 2 first... I guess I forgot throughout the posts about that toy example, that it could just fold the constant part first.. or else I'm tired and not thinking clearly.. hmm.]

With -fwrapv the compiler isn't allowed to fold 2 / 2 first, because it would not follow the programmer's explicit instruction to first multiply by 2, wrap in case of overflow, then divide by 2.

wgarvin · Post by **wgarvin** » Fri Dec 13, 2013 3:36 pm

syzygy wrote:
wgarvin wrote:[Edit: it probably makes X * 2 / 2 emit those math ops 'as written' too. unless the compiler notices it can fold the 2 / 2 first... I guess I forgot throughout the posts about that toy example, that it could just fold the constant part first.. or else I'm tired and not thinking clearly.. hmm.]
With -fwrapv the compiler isn't allowed to fold 2 / 2 first, because it would not follow the programmer's explicit instruction to first multiply by 2, wrap in case of overflow, then divide by 2.

Oops, yeah. * and / operators associate left-to-right, so compiler can only emit some other ordering if the result would look "as if" it had been evaluated left to right. With overflow being UB it would be able to reassociate there, but not with -fwrapv turned on.

[edit: I don't use signed types very often, because overflow is UB so I find unsigned types easier to reason about. if X were unsigned, it might emit a single AND instruction.

With signed X and -fwrapv, I guess it could emit an ADD/LEA then a SAR. ]

bob · Post by **bob** » Fri Dec 13, 2013 7:13 pm

wgarvin wrote:
Its design provides constructs that map efficiently to typical machine instructions, and therefore it has found lasting use in applications that had formerly been coded in assembly language, most notably system software like the Unix computer operating system.
Correct. C provides constructs that map efficiently TO the target hardware. Notice what that description does not say: it does not say that every aspect of the target hardware necessarily gets such an efficient mapping from a C construct, or any mapping at all.

Example: Just because you can shift bits into and out of the "sign bit" of a register in x86 assembly language, doesn't mean you can do it portably in C with a signed int.

Example: Just because loading from a zero address will trap on your x86 chip on your favorite operating system/memory model, doesn't mean you can dereference a NULL pointer in C and expect it to trap at that point in the program. Dereferencing NULL is not defined to trap, it is actually UB, and may instead summon the nasal demons.

Example: Several C compilers recognize idioms that they can turn into a 'rotate' instruction. Which is nice and friendly of them, since the C language doesn't have a rotate operator of any kind. C has shift operators but no rotate operators, even though nearly every machine capable of running compiled C code has both shift and rotate instructions.

Anyway, C does provide useful constructs that map efficiently to the target hardware, but in the interests of portability and maximum performance, it also left a lot of stuff undefined. And fixing that without tanking the performance is perhaps more difficult that you make it sound.

Notice that it does say "applications formerly written in assembly language." This has been the mantra of C since it was created. But apparently we have gotten quite a ways away from that original intent... to the point where it might even make the Java boys happy.

bob · Post by **bob** » Fri Dec 13, 2013 7:14 pm

wgarvin wrote:
bob wrote:So it looks like MOST of the long-term C programmers have quite a different view from those of you that are apparently a product of the Java abstract-machine crowd. What kind of language would target assembly language programming? One that lets you get to the hardware to do the things the hardware is capable of, WITHOUT getting in the way and being an obstructionist.
You keep saying "Java" and "abstract-machine" together, but the abstract machine stuff in the C language specification pre-dates the existence of Java by many years. So I'm not quite sure what your point is there. Maybe the "Java" is a reference to some kind of "dumbing down" of C? If anything, the opposite is happening--the compilers are getting cleverer and it seems to me that programmers have a greater need now to understand difficult topics like UB than they did in the past. Even though the UB stuff puts a few things out of bounds, C does still let you access a lot of the power of the underlying hardware.

I learned x86 assembly before I learned C or C++, then I learned a bit of C and then early C++ (back then, C++ was not nearly the abomination of a language that it was today). I didn't really learn Java until my first job after university, where I used it every day for about 4 years. For the last 7 years, it's been C++ on somewhat limited platforms: game consoles. I've spent countless days trying to save 100 kb of code size, or shave a few microseconds off of something that gets executed 30 times a second, or tweak some float code to use vector intrinsics just to avoid a performance-killing load-hit-store. So I appreciate the value in being able to get close to the hardware from inside a high-level language like C or C++, I really do! But portability is also something we must have. Some of my code has to work on 6 different compilers for 4 different CPU architectures, so using gcc-specific extensions or something, usually won't fly. And of course it needs to be as fast as possible, and if those compilers will do something extremely clever for me then I'll take it even if the price is that I have to be careful to avoid >191 kinds of undefined behavior or really weird things might occur.

My point was that the "abstract machine" for C was, years ago, the machine it was being used on. Java uses an abstract machine that has been dumbed down to the least common denominator of all available machines. Looks like that is where C is headed thanks to the UB optimizer fans...

bob · Post by **bob** » Fri Dec 13, 2013 7:17 pm

rbarreira wrote:
wgarvin wrote:
bob wrote:So it looks like MOST of the long-term C programmers have quite a different view from those of you that are apparently a product of the Java abstract-machine crowd. What kind of language would target assembly language programming? One that lets you get to the hardware to do the things the hardware is capable of, WITHOUT getting in the way and being an obstructionist.
You keep saying "Java" and "abstract-machine" together, but the abstract machine stuff in the C language specification pre-dates the existence of Java by many years. So I'm not quite sure what your point is there. Maybe the "Java" is a reference to some kind of "dumbing down" of C? If anything, the opposite is happening--the compilers are getting cleverer and it seems to me that programmers have a greater need now to understand difficult topics like UB than they did in the past. Even though the UB stuff puts a few things out of bounds, C does still let you access a lot of the power of the underlying hardware.

I learned x86 assembly before I learned C or C++, then I learned a bit of C and then early C++ (back then, C++ was not nearly the abomination of a language that it was today). I didn't really learn Java until my first job after university, where I used it every day for about 4 years. For the last 7 years, it's been C++ on somewhat limited platforms: game consoles. I've spent countless days trying to save 100 kb of code size, or shave a few microseconds off of something that gets executed 30 times a second, or tweak some float code to use vector intrinsics just to avoid a performance-killing load-hit-store. So I appreciate the value in being able to get close to the hardware from inside a high-level language like C or C++, I really do! But portability is also something we must have. Some of my code has to work on 6 different compilers for 4 different CPU architectures, so using gcc-specific extensions or something, usually won't fly. And of course it needs to be as fast as possible, and if those compilers will do something extremely clever for me then I'll take it even if the price is that I have to be careful to avoid >191 kinds of undefined behavior or really weird things might occur.
Yeah Bob, no need to insult people's intelligence by saying we can't handle C because we're braindead due to Java. I was programming x86 assembly way before I touched Java. In my day to day work I use C++ and C# mostly, on hobby stuff I use x64 asm / C / C++.

Ironically, it looks like it's actually you who can't handle C. You expect it to be a clean safe language where everything is neatly defined, when it has NEVER been that. You pretend it's only recent compilers and standards that make it a hard-to-use language, which is also false.

Didn't say anyone was brain dead. I said Java is brain dead, minimal abstract machine, pointerless, etc. There are way more Java programmers than any other language. So that mindset has begun to become pervasive. I suppose that next you are going to tell me you want some of the "mistake-prone" things removed from assembly language so that you can't write code there with overflow, etc?

bob · Post by **bob** » Fri Dec 13, 2013 7:27 pm

wgarvin wrote:
bob wrote:You need to get a refund on your mind-reading training. Have you looked at Crafty? Did you look at the functions to do popcnt, MSB and LSB? DO they appear to be "portable"? When I had an alpha here, I just wrote the correct asm routines for that. When I was using a sun SPARC ditto. For the Cray. Ditto.
I've looked at Crafty before, though not lately. Its a nice program by the way, pretty readable. popcnt,MSB etc., sure.

Earlier this evening I happened to be reading the SMP thread from 2009 when you discovered an atomicity problem with the hash, unless I am mistaken that was the birth of the XOR hashing trick, right? You did seem to care about portability back then, at least a little bit. I suggested you could atomically store the 16-byte entries using SSE and you rejected the idea because it wasn't portable!
bob in 2009 wrote:The problem with XMM is that I am using C and do not want to add more x86 assembly beyond the BSF/BSR code already used. And C doesn't let me work directly with the XMM registers to do this portably.
I know, I know... there was also the 32-byte pawn hash to worry about... I can't resist teasing a bit, sorry!

No, the xor trick was used on Cray Blitz first. We used locks through the YMP, but when we got to 16 and 32 cpus, we encountered too much contention (the Cray had exactly 32 1-bit semaphore registers, the O/S claimed 1/2 of them. You COULD use one of those to create many others, but then there is still contention for the single semaphore bit to gain access to a larger number of individual semaphores done in software).

Portability was never an issue, speed was all that counted. We had exactly the same problem on the Cray (8 byte stores per instruction) that we do on x86, so a solution was needed. I simply neglected to include that when I started Crafty because it was specifically targeted toward non-cray boxes (like the Sparc and PC) where multiple cpus were not mainstream at the time.

If you talk about SSE I would have to somehow return to 2009 to understand my concern. Perhaps that SSE is not exactly standard across all platforms, not being on some, being on others. And it would also not solve my problem because I have a pawn hash table that is much longer than 16 bytes. BSF/BSR are significant speed gains. storing 16 bytes vs the xor trick is not a speed gain that is measurable, which also would make me consider such carefully. I'm willing to do what is needed for speed. the XOR trick provides safety with no measurable speed cost, therefore it was a natural decision.

BTW I might actually reconsider that decision today, because today I only run on X86 anyway, nothing else, unlike a few years ago.

BTW HOW can a 16 byte store be guaranteed to be done atomically when the bus only transfers 8 bytes at a time? I'm guessing there could be some trickery in the cache controllers, but I have not read anything about that happening.

wgarvin · Post by **wgarvin** » Fri Dec 13, 2013 8:35 pm

bob wrote:If you talk about SSE I would have to somehow return to 2009 to understand my concern. Perhaps that SSE is not exactly standard across all platforms, not being on some, being on others. And it would also not solve my problem because I have a pawn hash table that is much longer than 16 bytes. BSF/BSR are significant speed gains. storing 16 bytes vs the xor trick is not a speed gain that is measurable, which also would make me consider such carefully. I'm willing to do what is needed for speed. the XOR trick provides safety with no measurable speed cost, therefore it was a natural decision.

BTW I might actually reconsider that decision today, because today I only run on X86 anyway, nothing else, unlike a few years ago.

I'm just guessing, but I think maybe your objection to it was that there was no convenient/portable way to access it. The necessary intrinsics are better-supported by compilers now, so maybe its worth trying. But the XOR trick works well enough in practice, so perhaps there's no point anyway.

bob wrote: BTW HOW can a 16 byte store be guaranteed to be done atomically when the bus only transfers 8 bytes at a time? I'm guessing there could be some trickery in the cache controllers, but I have not read anything about that happening.

If I ever knew the answer to this question, I have forgotten it. I'm kind of curious about that myself though... no promises but I might dig around in some docs and try to figure it out. If you find the answer somewhere, please do post it.

wgarvin · Post by **wgarvin** » Fri Dec 13, 2013 9:37 pm

wgarvin wrote:
bob wrote: BTW HOW can a 16 byte store be guaranteed to be done atomically when the bus only transfers 8 bytes at a time? I'm guessing there could be some trickery in the cache controllers, but I have not read anything about that happening.
If I ever knew the answer to this question, I have forgotten it. I'm kind of curious about that myself though... no promises but I might dig around in some docs and try to figure it out. If you find the answer somewhere, please do post it.

AHA! I think the answer is that I was wrong back in 2009 for thinking they would be atomic! Apparently there is no such guarantee, and Intel Architecture Manual Vol 3A. Section 8.1.1 (May 2011) even warns against relying on it to be atomic:

An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g. due an page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.4.4), such page faults may occur even if all accesses are to the same page.

emphasis mine. I copied the quote from this stackoverflow thread, not the original document. Anyway, no need to test that idea, its not safe.

How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?