Speaking of the hash table

Discussion of chess software programming and technical issues.

Moderator: Ras

diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

wgarvin wrote:Well, that was sure an entertaining thread to read!
All the sensible people on one side of the argument, and Vincent on the other.
...No prizes for guessing which side is smoking crack.

Storing 16 bytes with two or more instructions is never atomic, you have no control over when the writes reach the cache or main memory or even other cores. Cache misses, paging or thread switching can make the gap between two reads or two writes arbitrarily long.

Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

Unless you always read AND write the entire entry with one atomic operation, you are pretty much guaranteed to get screwed by race conditions. Probably sooner rather than later.

Fortunately, as hgm and others have noted, Bob's lockless hashing trick costs almost nothing, and completely solves the problem.
Don't BS over here. I already posted weeks ago here a reply adding chances.

You'll see a problem can only occur 1 in 10^20 occasions then.

A write error of this kind, the Abcd type - but you guys seem to ignore what i wrote there, without consequences, happens once in each 10^10.

So i guess most here don't even realize what i mean with AbCD as being a bad error that CAN have consequences, which however has odds 1 in 10^20.
Yet because most guys over here are not reading very well - they seem to ignore all this.

Just add chances to it. I posted this already with an example calculation.

http://www.talkchess.com/forum/viewtopi ... 906150f9f9

I guess however for you guys who didn't measure at all, if you smoke crack that you are not very good in explaining the statistical chance something can occur.

The way to measure this is very simple. Make a small 8 bits CRC of your entry prior to storing, then store all bytes (so don't put bunches of instructions between each store).

now you can simply check the CRC at each READ of each entry and write down all write errors you detect.

Is this so hard to perform that test?

You can measure all this easily - which is what i did do.

So who's talking BS here?
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

bob wrote:
Gerd Isenberg wrote:
wgarvin wrote: ...
Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

Unless you always read AND write the entire entry with one atomic operation, you are pretty much guaranteed to get screwed by race conditions. Probably sooner rather than later.

Fortunately, as hgm and others have noted, Bob's lockless hashing trick costs almost nothing, and completely solves the problem.
In the context of SSE 16-byte accesses, following link provides some sample code and discussions.

http://stackoverflow.com/questions/7646 ... ory-access
That seems to fit exactly with the testing I have done in the past, although I did not try the MMX / SSE instructions... Crafty 22.9 clearly shows there's a problem that is easy enough to solve without resorting to locking...
Crafty 22.9 is dead old.

Where to start...

In your 'hashprobe' routine you're doing an in between update of the hashtable.

So you write 8 bytes to the hashtable. You don't update the hashkey doing such update.

Find it weird that you can measure illegal moves in the hashtable then?

There is probably another quadrizillion bugs, but this should be enough to explain weird things you see. We didn't even discuss en passant yet - Chrilly once wrote about bugs in en passant - i can only ack Chrilly there...

Code: Select all

In HashProbe()...

 htable->word1 =
          (word1 & 0x1fffffffffffffffULL) | ((BITBOARD) transposition_id << 61);
Then you probably test with some ultra tiny hashtable as well...

Where to start?

You're measuring bugs in crafty - not anything else.
syzygy
Posts: 5713
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speaking of the hash table

Post by syzygy »

diep wrote:Where to start?
Maybe with your initial remark that started this whole subthread:
diep wrote:In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.
By now you have admitted that:
- it certainly can happen on PC hardware; and
- it will happen quite often if the programmer does not take special precautions (see crafty 22.9).

I don't think anybody has said that this problem is very serious. Firstly, an erroneous probe here and there normally does not affect the outcome of the search. As long as the program does not crash on illegal moves, things are fine. Secondly, it is possible to eliminate such errors in a relatively cheap way using the xor-trick.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

syzygy wrote:
diep wrote:Where to start?
Maybe with your initial remark that started this whole subthread:
diep wrote:In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.
By now you have admitted that:
- it certainly can happen on PC hardware; and
- it will happen quite often if the programmer does not take special precautions (see crafty 22.9).

I don't think anybody has said that this problem is very serious. Firstly, an erroneous probe here and there normally does not affect the outcome of the search. As long as the program does not crash on illegal moves, things are fine. Secondly, it is possible to eliminate such errors in a relatively cheap way using the xor-trick.
Why do you post this blabla nonsense?

Bob writes 8 bytes somewhere randomly to hashtable in his hashprobe.
and he's doing this always if the key matches. I would guess that's 10%+ of the cases...

So that's not his hashstore.

It's obvious he has no communication whatsoever with whomever improved crafty 23.x as i see in 23.x they have fixed all this and write 16 bytes , not 8 bytes.

What i did do some years ago, is measure the number of collissions and write errors under different conditions and i noticed that with 64 bits i had more collissions than write errors.

When i used somewhere in the 70+ bits i no longer could measure any collissions happening.

Note that i use a 128 bits hashkey and i use 64 bits of it to index into the hashtable and the other 64 bits, a part of that i store into the hashtable.

All collissions gone back then.

The AbCD scenario won't ever happen in our lifetime however.

We already have seen cases here where Uri Blass corrected Bob in how crafty nowadays was doing extensions. Bob obviously has no clue what happened in fixes in his hashtable either.

There is so much that has been modified/fixed in crafty and Bob is total clueless there.

Who has been improving crafty?

As it's obvious from the code it's more than 1 person and none of those persons is Bob. In fact it's total trivial from the past few years, after 2007, that Bob doesn't maintain the source code of Crafty...

Note that if your hashtable entry is aligned on the cacheline, that you can never have a problem like AbCD, provided of course you write 16 bytes and not just a few bytes to the hashtable, as in the crafty case of writing 8 bytes which do not contain any hashkey, it's total trivial things go wrong.

A dude can mess up in manners that not a 100 wise men can fix of course.

Please note that there is also radiation coming from outer space. That's why most supercomputers use ECC. This radiation from outer space can cause bitflips.

So if you are patient enough and wait forever, obviously everything *can* get messed up.

Yet we are busy with statistical chances here. I've said enough on this subject.

The only one who measured this as it appears in a normal manner, that's me.

Why don't you just test things out yourself instead of posting endless posts here which are without any logic? You talk yet you test nothing.
syzygy
Posts: 5713
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speaking of the hash table

Post by syzygy »

diep wrote:Why do you post this blabla nonsense?
Because what I wrote is the point being made in this subthread. With all due respect, the label "blabla nonsense" seems more appropriate for what you write about crafty and bob and all that.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

diep wrote:
bob wrote:
Gerd Isenberg wrote:
wgarvin wrote: ...
Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

Unless you always read AND write the entire entry with one atomic operation, you are pretty much guaranteed to get screwed by race conditions. Probably sooner rather than later.

Fortunately, as hgm and others have noted, Bob's lockless hashing trick costs almost nothing, and completely solves the problem.
In the context of SSE 16-byte accesses, following link provides some sample code and discussions.

http://stackoverflow.com/questions/7646 ... ory-access
That seems to fit exactly with the testing I have done in the past, although I did not try the MMX / SSE instructions... Crafty 22.9 clearly shows there's a problem that is easy enough to solve without resorting to locking...
Crafty 22.9 is dead old.

Where to start...

In your 'hashprobe' routine you're doing an in between update of the hashtable.

So you write 8 bytes to the hashtable. You don't update the hashkey doing such update.

Here is the code:

Code: Select all

  if (age != transposition_id || (depth >= draft)) {
    if (word2 != htable->word2) {
      hwhich = (((htable->word2) >> log_hash) & 1) + 1;
      (htable + hwhich)->word1 = htable->word1;
      (htable + hwhich)->word2 = htable->word2;
    }
    htable->word1 = word1;
    htable->word2 = word2;
  } else {
    hwhich = ((word2 >> log_hash) & 1) + 1;
    (htable + hwhich)->word1 = word1;
    (htable + hwhich)->word2 = word2;
  }
I wrote BOTH words back to back. There are three different choices, but that's irrelevant, Once I update word 1 (the score, etc) and word 2 (the sig) I IMMEDIATELY write both to memory, back-to-back. And they get broken and interleaved with other processors...

So again, what are you talking about?



Find it weird that you can measure illegal moves in the hashtable then?

There is probably another quadrillion bugs, but this should be enough to explain weird things you see. We didn't even discuss en passant yet - Chrilly once wrote about bugs in en passant - i can only ack Chrilly there...
There's no bug in the hash code in that version. Works flawlessly and is easy to follow... EP status is hashed in automatically. Always has been. Just because YOU might have such bugs, does NOT mean I do, at least not here


Code: Select all

In HashProbe()...

 htable->word1 =
          (word1 & 0x1fffffffffffffffULL) | ((BITBOARD) transposition_id << 61);
Then you probably test with some ultra tiny hashtable as well...
Do you know what the above does? Simply replaces the existing entry AGE with the current age, since I am using an entry that could have been stored from a previous search (different "AGE"). When I hit on an old entry I update the age so that it gets treated like a current entry since it was just used. What is so unusual about that?? What does that have to do with the current discussion?

I do a probe, I match a sig but with a position from an older search, so I update the search age by itself. So what??? The sig JUST matched. I update the other half to handle the new age.



Tiny hash table? I used 3gb of RAM in the positions I ran. Here is a snippet of the front of the log file:

hash table memory = 768M bytes.
pawn hash table memory = 64M bytes.
show book statistics
don't display ply-1 moves as they are searched.
resign after 5 consecutive moves with score < -9.
noise level set to 4000000.

Crafty v22.9 (1 cpus)

White(1): hash=4096M
hash table memory = 3072M bytes.
White(1): mt=8
max threads set to 8
White(1): st=120
search time set to 120.00.

.....

which included all of this:

time surplus 0.00 time limit 2:00 (+0.00) (2:00)
depth time score variation (1)
starting thread 1
starting thread 2
starting thread 3
starting thread 4
starting thread 5
starting thread 6
starting thread 7
bad move from hash table, ply=36
bad move from hash table, ply=36
bad move from hash table, ply=36
bad move from hash table, ply=36
bad move from hash table, ply=35
bad move from hash table, ply=34
bad move from hash table, ply=35
bad move from hash table, ply=41
bad move from hash table, ply=41
bad move from hash table, ply=40
bad move from hash table, ply=42
bad move from hash table, ply=42
bad move from hash table, ply=40
bad move from hash table, ply=40


So what, exactly, are you talking about? 3GB is "tiny"?


Where to start?

You're measuring bugs in crafty - not anything else.
Where to start? You don't have a clue. There are ZERO bugs in that particular part of Crafty.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

syzygy wrote:
diep wrote:Where to start?
Maybe with your initial remark that started this whole subthread:
diep wrote:In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.
By now you have admitted that:
- it certainly can happen on PC hardware; and
- it will happen quite often if the programmer does not take special precautions (see crafty 22.9).

I don't think anybody has said that this problem is very serious. Firstly, an erroneous probe here and there normally does not affect the outcome of the search. As long as the program does not crash on illegal moves, things are fine. Secondly, it is possible to eliminate such errors in a relatively cheap way using the xor-trick.
Just for the record, my hash IS aligned. I use entries per bucket, each entry = 16 bytes, which is exactly one cache line. Cache is forced to start on a 64-byte boundary as well. This is another red herring from Vincent that is completely irrelevant.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

diep wrote:
syzygy wrote:
diep wrote:Where to start?
Maybe with your initial remark that started this whole subthread:
diep wrote:In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.
By now you have admitted that:
- it certainly can happen on PC hardware; and
- it will happen quite often if the programmer does not take special precautions (see crafty 22.9).

I don't think anybody has said that this problem is very serious. Firstly, an erroneous probe here and there normally does not affect the outcome of the search. As long as the program does not crash on illegal moves, things are fine. Secondly, it is possible to eliminate such errors in a relatively cheap way using the xor-trick.
Why do you post this blabla nonsense?

Bob writes 8 bytes somewhere randomly to hashtable in his hashprobe.
and he's doing this always if the key matches. I would guess that's 10%+ of the cases...

So that's not his hashstore.

It's obvious he has no communication whatsoever with whomever improved crafty 23.x as i see in 23.x they have fixed all this and write 16 bytes , not 8 bytes.
I am the ONLY person that has written any of Crafty's hash probing. And for the record, here is that SAME update from 23.6, just as it has been in 23.5, 23.4, etc...

word1 =
(word1 & 0x007fffffffffffffull) | ((uint64_t) transposition_age <<
55);
htable->word1 = word1;
htable->word2 = word1 ^ word2;

So it is still there, and has been for years.

What i did do some years ago, is measure the number of collissions and write errors under different conditions and i noticed that with 64 bits i had more collissions than write errors.

When i used somewhere in the 70+ bits i no longer could measure any collissions happening.

Note that i use a 128 bits hashkey and i use 64 bits of it to index into the hashtable and the other 64 bits, a part of that i store into the hashtable.
Note that you are nuts. Going beyond 64 bits gives nothing but overhead... # of collisions with 64 bits is vanishingly small, and always has been.




All collissions gone back then.

The AbCD scenario won't ever happen in our lifetime however.

We already have seen cases here where Uri Blass corrected Bob in how crafty nowadays was doing extensions. Bob obviously has no clue what happened in fixes in his hashtable either.
What are you talking about here? I wrote the code. I certainly know what it does. And it is there for everyone to plainly see, except, apparently, for you...


There is so much that has been modified/fixed in crafty and Bob is total clueless there.

Who has been improving crafty?

As it's obvious from the code it's more than 1 person and none of those persons is Bob. In fact it's total trivial from the past few years, after 2007, that Bob doesn't maintain the source code of Crafty...
Do you have any idea how ridiculous you look here? I have ALWAYS been the person that maintains the source code. Tracy has done quite a bit of evaluation modification. That is ALL he has modified.

Note that if your hashtable entry is aligned on the cacheline, that you can never have a problem like AbCD, provided of course you write 16 bytes and not just a few bytes to the hashtable, as in the crafty case of writing 8 bytes which do not contain any hashkey, it's total trivial things go wrong.
If you had looked, you'd notice that my hash bucket IS aligned on a 64 byte boundary, and the four entries I probe ARE in one cache line. Why not look before posting nonsense/??




A dude can mess up in manners that not a 100 wise men can fix of course.

Please note that there is also radiation coming from outer space. That's why most supercomputers use ECC. This radiation from outer space can cause bitflips.

So if you are patient enough and wait forever, obviously everything *can* get messed up.

Yet we are busy with statistical chances here. I've said enough on this subject.

The only one who measured this as it appears in a normal manner, that's me.

Why don't you just test things out yourself instead of posting endless posts here which are without any logic? You talk yet you test nothing.
I tested everything and posted the results. You have posted ZERO results.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

diep wrote:
wgarvin wrote:Well, that was sure an entertaining thread to read!
All the sensible people on one side of the argument, and Vincent on the other.
...No prizes for guessing which side is smoking crack.

Storing 16 bytes with two or more instructions is never atomic, you have no control over when the writes reach the cache or main memory or even other cores. Cache misses, paging or thread switching can make the gap between two reads or two writes arbitrarily long.

Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

Unless you always read AND write the entire entry with one atomic operation, you are pretty much guaranteed to get screwed by race conditions. Probably sooner rather than later.

Fortunately, as hgm and others have noted, Bob's lockless hashing trick costs almost nothing, and completely solves the problem.
Don't BS over here. I already posted weeks ago here a reply adding chances.

You'll see a problem can only occur 1 in 10^20 occasions then.

A write error of this kind, the Abcd type - but you guys seem to ignore what i wrote there, without consequences, happens once in each 10^10.

So i guess most here don't even realize what i mean with AbCD as being a bad error that CAN have consequences, which however has odds 1 in 10^20.
Yet because most guys over here are not reading very well - they seem to ignore all this.

Just add chances to it. I posted this already with an example calculation.

http://www.talkchess.com/forum/viewtopi ... 906150f9f9

I guess however for you guys who didn't measure at all, if you smoke crack that you are not very good in explaining the statistical chance something can occur.

The way to measure this is very simple. Make a small 8 bits CRC of your entry prior to storing, then store all bytes (so don't put bunches of instructions between each store).

now you can simply check the CRC at each READ of each entry and write down all write errors you detect.

Is this so hard to perform that test?

You can measure all this easily - which is what i did do.

So who's talking BS here?

Would you like me to give you the names of a couple of engineers at Intel that can explain the problem to you so that you will understand it? The problem is real. your math is imaginary. The data I posted is real. The problem was real on the Cray. It was real on the alpha. It is real on AMD/Intel as well.

Why you don't grasp that is beyond me. Study how cache lines bounce from CPU to CPU with MOESI (AMD) and MESIF (Intel). It is not THAT hard to understand this problem. There are no 128 bit atomic stores.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

bob wrote:
diep wrote:
wgarvin wrote:Well, that was sure an entertaining thread to read!
All the sensible people on one side of the argument, and Vincent on the other.
...No prizes for guessing which side is smoking crack.

Storing 16 bytes with two or more instructions is never atomic, you have no control over when the writes reach the cache or main memory or even other cores. Cache misses, paging or thread switching can make the gap between two reads or two writes arbitrarily long.

Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

Unless you always read AND write the entire entry with one atomic operation, you are pretty much guaranteed to get screwed by race conditions. Probably sooner rather than later.

Fortunately, as hgm and others have noted, Bob's lockless hashing trick costs almost nothing, and completely solves the problem.
Don't BS over here. I already posted weeks ago here a reply adding chances.

You'll see a problem can only occur 1 in 10^20 occasions then.

A write error of this kind, the Abcd type - but you guys seem to ignore what i wrote there, without consequences, happens once in each 10^10.

So i guess most here don't even realize what i mean with AbCD as being a bad error that CAN have consequences, which however has odds 1 in 10^20.
Yet because most guys over here are not reading very well - they seem to ignore all this.

Just add chances to it. I posted this already with an example calculation.

http://www.talkchess.com/forum/viewtopi ... 906150f9f9

I guess however for you guys who didn't measure at all, if you smoke crack that you are not very good in explaining the statistical chance something can occur.

The way to measure this is very simple. Make a small 8 bits CRC of your entry prior to storing, then store all bytes (so don't put bunches of instructions between each store).

now you can simply check the CRC at each READ of each entry and write down all write errors you detect.

Is this so hard to perform that test?

You can measure all this easily - which is what i did do.

So who's talking BS here?

Would you like me to give you the names of a couple of engineers at Intel that can explain the problem to you so that you will understand it? The problem is real. your math is imaginary. The data I posted is real. The problem was real on the Cray. It was real on the alpha. It is real on AMD/Intel as well.

Why you don't grasp that is beyond me. Study how cache lines bounce from CPU to CPU with MOESI (AMD) and MESIF (Intel). It is not THAT hard to understand this problem. There are no 128 bit atomic stores.
BTW, if I comment out that line that just updates the age on a probe match, here's what the output now looks like:

starting thread 5
starting thread 6
starting thread 7
44 1.55 -3 1. Kb1
bad move from hash table, ply=41
44 1.82 -M 1. Kb1
bad move from hash table, ply=49
44 2.99 0.01 1. Kb1 Kb7 2. Kb2 Ka8 3. Kc2 Kb8 4.
Kd3 Kc7 5. Ke2 Kd7 6. Kf2 Ke8 7. Kg3
Kf7 8. Kh4 Kg6 9. Kh3 Kf6 10. Kh4 Kg6
44-> 3.00 0.01 1. Kb1 Kb7 2. Kb2 Ka8 3. Kc2 Kb8 4.
Kd3 Kc7 5. Ke2 Kd7 6. Kf2 Ke8 7. Kg3
Kf7 8. Kh4 Kg6 9. Kh3 Kf6 10. Kh4 Kg6
45 3.45 +1 1. Kb1!!
45 3.61 +3 1. Kb1!!
bad move from hash table, ply=41
45 3.97 +M 1. Kb1!!
45 4.35 6.74 1. Kb1 Kb7 2. Kc1 Kc7 3. Kd1 Kd8 4.
Kc2 Kc7 5. Kd3 <HT>
45-> 4.35 6.74 1. Kb1 Kb7 2. Kc1 Kc7 3. Kd1 Kd8 4.
Kc2 Kc7 5. Kd3 <HT>
46 4.68 +1 1. Kb1!!
46 5.91 +3 1. Kb1!!
bad move from hash table, ply=45
bad move from hash table, ply=47
bad move from hash table, ply=43
bad move from hash table, ply=45
bad move from hash table, ply=44
bad move from hash table, ply=42
46 7.67 +M 1. Kb1!!
bad move from hash table, ply=44
bad move from hash table, ply=42
bad move from hash table, ply=42
bad move from hash table, ply=44
bad move from hash table, ply=44
bad move from hash table, ply=42
bad move from hash table, ply=44
bad move from hash table, ply=44
bad move from hash table, ply=50
bad move from hash table, ply=48
bad move from hash table, ply=48
bad move from hash table, ply=46
bad move from hash table, ply=46
bad move from hash table, ply=46
bad move from hash table, ply=44
bad move from hash table, ply=42
bad move from hash table, ply=42
bad move from hash table, ply=46
bad move from hash table, ply=46
bad move from hash table, ply=46
bad move from hash table, ply=46
b

as I said, that is NOT the problem. Splitting the two stores is the problem, and there is absolutely nothing that prevents the two stores from being done separately by the CPU. And THAT leads to trouble, as this simple test STILL shows.

But keep waving your hands. When you want to talk to an engineer at Intel, I'll be happy to put you in touch with a couple that I happen to know there. Since you refuse to read their docs which explain this problem quite clearly.