Speaking of the hash table

Michel · Post by **Michel** » Tue Dec 11, 2012 2:03 pm

syzygy wrote: Also there are quite precise estimate for the error term pi(x) - (x / log(x)).

Actually the error terms are for pi(x)-li(x) as x/log(x) is only a crude approximation of li(x) (although it is asymptotically correct)..

The bound for the error term is equivalent to the Riemann-Hypothesis.

So you can earn a million dollars by counting primes!

http://www.claymath.org/millennium/

diep · Post by **diep** » Tue Dec 11, 2012 2:44 pm

Michel wrote:
syzygy wrote: Also there are quite precise estimate for the error term pi(x) - (x / log(x)).
Actually the error terms are for pi(x)-li(x) as x/log(x) is only a crude approximation of li(x) (although it is asymptotically correct)..

The bound for the error term is equivalent to the Riemann-Hypothesis.

So you can earn a million dollars by counting primes!

http://www.claymath.org/millennium/

The prizes for factoring RSA has been withdrawn however!
though it was just $20k or so, rsalabs withdrew all that years ago!

The weird thing is that to generate huge primes, which already has a very good running time to prove, see LL, you can get $250k for the next prime over 100k digits.

What's more interesting to society are the 'industry grade primes' however (PRP's).

In that sense math is f'ed up as well as only some weird exceptions you can get some cash for and what's interesting for society you can make no money with at all!

Rein Halbersma · Post by **Rein Halbersma** » Tue Dec 11, 2012 2:58 pm

syzygy wrote:
AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)

Could you explain this in some more detail? 2GB hash table = 2^31 bytes with 2^4 bytes per entry -> 2^27 entries. So how does this translate this to 2^4 buckets and the 29 in the exponent of 2^(64-29)? Is your hash table laid out as overlapping buckets?

syzygy · Post by **syzygy** » Tue Dec 11, 2012 4:16 pm

Rein Halbersma wrote:
syzygy wrote:
AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
Could you explain this in some more detail? 2GB hash table = 2^31 bytes with 2^4 bytes per entry -> 2^27 entries. So how does this translate this to 2^4 buckets and the 29 in the exponent of 2^(64-29)? Is your hash table laid out as overlapping buckets?

Oops, I messed up after all.

Indeed 2^27 entries, so 2^25 buckets. This means you "lose" 25 bits of uniqueness by picking a bucket. The bucket is filled with 4 = 2^2 positions, each giving you a chance of a false collision, which effectively loses another 2 bits. So 29 should have been 25+2=27, which means that the chance of a collission is 1 in every 2^(64-27) = 137.4 billion nodes. Even closer to Vincent's number.

It's easier to just ignore the buckets: if the hash table stores 2 ^ 27 positions, then a probe (for a position not stored in the hash table) will cause a collision if and only if by chance one of the 2 ^ 27 positions has a hash key identical to the hash key of the position being probed for. The probability is therefore 2^27 / 2^64 (since by construction all positions stored have a different hash key).

Bucket size can make a difference if you don't store the full key. Clearly you don't gain anything by storing the bits used for indexing the bucket, since those bits are the same for all entries of the bucket. So with a bucket size of 1 and 2 ^ 27 buckets, you only need to store 64-27 bits of the hash key to have a probability of 1 in 137.4 billion. With a bucket size of 4 and 2 ^ 25 buckets, you need to store 64-25 bits of the hash key to have the same probability of 1 in 137.4 billion.

My engine only stores 32 bits of the hash key (from the part not used for indexing the bucket). Since my buckets have 4 entries, this increases the probability of collisions by a factor 4 compared to having buckets of 1 entry each.

hgm · Post by **hgm** » Tue Dec 11, 2012 7:23 pm

Indeed, I also use 32 bits always. A collision rate of 1 in a billion is small enough for me. Smaller entries means more entries, and more entries per bucket (so possibilities for more selective replacement), and this is a sure winner compared to driving the collission rate further. In Joker I have seven 9-byte entries per bucket. (4-byte signature, 2-byte move+flags, 2-byte score, 1-byte depth).

Needless to say, I did make sure that Joker never crashes on an illegal move. When it makes/unmakes a castling, it also saves and restores the piece captured by the Rook. Just in case...

bob · Post by **bob** » Tue Dec 11, 2012 9:23 pm

diep wrote:
bob wrote:
diep wrote:
bob wrote:
diep wrote:
zamar wrote:Multithreading. Typical size of hash table entry size is 128 bits.

Thread1.write(64bits)
Thread2.write(64bits)
Thread2.write(64bits)
Thread1.write(64bits)

leads to corrupted entry
First of all this can only happen when you have a parallel searching engine.

Did Rebel already get SMP?

Secondly this happens even more seldom than collissions at PC hardware as it can only happen with stores

Around 1 in 200 billion stores

In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.

Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.

Has 4 memory channels.

Note that when reading it can give a premature abort reading less bytes, which is why it has such fast latency for us to the RAM.

It can abort after reading 32 bytes.

So the only way it can go wrong is when an entry is at the end of a cacheline, in this case 256 bytes and by accident 1 cacheline gets updated quicker than another.

Odds of this happening is real real tiny, especially if your hashtable is some gigabytes in size.

So how you represented it cannot happen as it doesn't write 64 bits. It writes 256 bytes at once.
It can happen far more frequently than that. This has been discussed in the past. With 8 threads (cores) writing pairs of 64 bit words, ordering gets mangled and you can store a position where the first 8 bytes is from position P1, and the second 8 is from position P2, leading to this. I've seen dozens per minute in some positions. with the lockless hashing I use, it never happens, however.
Bob you're making this story up.

You did not see 'dozens of write errors a minute.

I've been extensively testing this and you need dozens of billions of stores to get one.

Of course i have ECC machines here. Maybe you had a bad dimm during your tests or your memory again serves you bad?

Also realize Alpha is a HPC processor. Today we only have x64 cpu's which is a different architecture.

In between alpha and the memory controller there are several steps to connect to other memory. That step can cause the AbCD sometimes which is not possible at x86/x64 hardware as they directly connect to the RAM.

Even then this also happens VERY SELDOM at the alpha.

I had at the time contact with the technical design team of memory controllers to verify how much all this could occur at different processors under which Alpha and the R14000, and i got each time the same answers from the engineers.

Even then at x86 i've been testing this extensively and if i got 1 write error or 1 or 2 collissions that was a lot in every single test.

This testing ran for months in 2003.

So from all persons here seems i'm the only one who really measured the number of write errors and the number of collissions.

p.s. note that the alpha's later on up to 2007 even were used for serving storage. Petabyte level, that's where i worked at the time.
I didn't make ANYTHING up. Please feel free to download Crafty 22.9, which does NOT use lockless hashing at all. Compile it with -DCPUS=8, and then simply run it. look in next.c for the "bad move from hash" and remove the #if test that disables those two lines. Run with 8 cores and you will see dozens of those errors per minute, to tens of thousands if you use fine 70...
I don't know why you're posting this nonsense Bob.

I've already extensively tested this with a lockless hashtable. With a CRC you can easily measure this way faster than completely messing up the bus locking nonstop.

In Diep there are more collissions than write errors with a 64 bits hashkey!

So you're just making this story up.

Note that your normal crafty is using a lockless hashtable! Don't bring in other idiotic factors you're not using in practice! Changing the discussion here just to 'save face' is what they do in Asia Bob, over here you're busted.

Long before you wrote/posted about Tim Mann inventing the XOR manner of doing things lockless, others were already using a 8 bits CRC...

Making what up? I told you the version of Crafty to use. 22.9. I told you how to run it and see the bad move from hash table errors. You will see that in version 23.0, the lockless hashing was added back in, and the error completely goes away. The broken writes still happen, but now the signatures don't match and those positions become "hash sigs don't match, ignore" cases...

I had this problem on the first multiple-cpu Cray I used. It got progressively worse as we moved on to more and more cpus. when we hit the XMP with 4 cpus, we had to stop and find/fix the problem because at the time, we did not know but what it was a serious programming bug. Once we found the problem, we simply used an atomic semaphore on the Cray (very fast) to prevent two threads from accessing the hash table at exactly the same time. When we hit 16 cpus on the T90, we had to stop again and look at the fix because the single semaphore was becoming a bottleneck, that's when Harry suggested the "xor trick" that is now called "lockless hashing". It is not imaginary. The path through the search doesn't contain THAT many instructions. I'd guess 2500 or so per node based on old measurements. That means the probability of two cores executing the reads/writes at the same time is not vanishingly small, particularly when most HT accesses are cache misses and that tends to "stack" things up at the instructions reading/writing the same entry improving the odds quite a bit.

It happens. It is not imagination. It has been a known problem since 1984 for me when we first tested on the 4-cpu Cray XMP/4...

I agree that I DO use lockless hashing today. Do you know why? To address the SPECIFIC problem we are talking about. If the problem doesn't exist, then lockless hashing is completely irrelevant in the first place. I pointed you to an available version of Crafty that had the lockless hashing removed. It was removed as I rewrote lots of code in the 22.x versions. I forgot to add it back until I started using 22.9 extensively and while testing with the -DDEBUG flag set, noticed a lot of "bad move from hash table errors" that do not show up in non-DEBUG mode. You can compare the relevant parts of 22.9 and 23.0 (hash.c) to see that was the only change...

And then you can verify that 22.9 sees a ton of broken writes. Your claims of 1 in 10^20 is pure nonsense. Crafty uses about 2500 instructions to search a node. In those 2500 instructions, you find two points of hash access, once to look up a position, and once to store a position. two pairs of instructions in 2500. The probability of two threads executing the same pair of instructions at the same time is logically (4 / 2500) ^ 2. With 8 threads, multiply by 8. Not big but not small. Then factor in that when the first thread accesses the SAME hash bucket as the second is going to access, the first thread blocks for several thousand clocks while the cache block is filled. If the second thread accesses the same block, it sits and waits, even if it accesses it several hundred instructions AFTER the first one. The first is still waiting on memory, as will be the second. Now they are aligned and we are ready for a broken write if ANYTHING happens on either of the cpus, an interrupt, a pipeline stall, you name it.

This is not THAT hard to understand. And it is quite easy to verify...

As far as your alpha stuff goes, it appears to me you have not run on an alpha. They are NOTORIOUS for out-of-order memory writes, something X86 guarantees will not happen. They added the memory fence instruction to address this on the alpha, so that you could force the order when it is critical. For example, acquiring a lock, storing the tt entry, then clearing the lock is dangerous on the alpha, because the write that zeros the lock can be done BEFORE the writes to actually store the tt entry... One puts a mfence after the tt write, then clears the lock, and all is well...

bob · Post by **bob** » Tue Dec 11, 2012 9:33 pm

diep wrote:
Rebel wrote:Now that we are discussing the TT I have a question about a phenomenon I never understood --> illegal moves in the TT. How is this possible?

I have them, not many, less than 1/10 per mil of the hash hits.

It's not because of collisions, I can test with 48 and 64 bit keys and the percentage remains the same.
Another possibility: a dimm is broken. One that doesn't get used by the OS itself but where Rebel allocates hashtable.

With ECC DRAM???

unlikely...

wgarvin · Post by **wgarvin** » Thu Dec 13, 2012 3:04 am

Well, that was sure an entertaining thread to read!
All the sensible people on one side of the argument, and Vincent on the other.
...No prizes for guessing which side is smoking crack.

Storing 16 bytes with two or more instructions is never atomic, you have no control over when the writes reach the cache or main memory or even other cores. Cache misses, paging or thread switching can make the gap between two reads or two writes arbitrarily long.

Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

Unless you always read AND write the entire entry with one atomic operation, you are pretty much guaranteed to get screwed by race conditions. Probably sooner rather than later.

Fortunately, as hgm and others have noted, Bob's lockless hashing trick costs almost nothing, and completely solves the problem.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Dec 13, 2012 1:33 pm

wgarvin wrote: ...
Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

Unless you always read AND write the entire entry with one atomic operation, you are pretty much guaranteed to get screwed by race conditions. Probably sooner rather than later.

Fortunately, as hgm and others have noted, Bob's lockless hashing trick costs almost nothing, and completely solves the problem.

In the context of SSE 16-byte accesses, following link provides some sample code and discussions.

http://stackoverflow.com/questions/7646 ... ory-access

bob · Post by **bob** » Fri Dec 14, 2012 7:33 am

Gerd Isenberg wrote:
wgarvin wrote: ...
Even if its a single 16-byte write done with an SSE instruction, even aligned on a 16-byte boundary, it might not be atomic. Reads of 16 bytes with two or more instructions are obviously also not atomic, and even a 16-byte read done with a single SSE instruction, aligned on a 16-byte boundary, might not be atomic. L1 cache and write-combining buffers do not magically make wide accesses atomic, as Ronald (syzygy) has repeatedly pointed out, some other core could RFO the cacheline at any time.

Unless you always read AND write the entire entry with one atomic operation, you are pretty much guaranteed to get screwed by race conditions. Probably sooner rather than later.

Fortunately, as hgm and others have noted, Bob's lockless hashing trick costs almost nothing, and completely solves the problem.
In the context of SSE 16-byte accesses, following link provides some sample code and discussions.

http://stackoverflow.com/questions/7646 ... ory-access

That seems to fit exactly with the testing I have done in the past, although I did not try the MMX / SSE instructions... Crafty 22.9 clearly shows there's a problem that is easy enough to solve without resorting to locking...

Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table