Hash cache

hgm · Post by **hgm** » Mon Oct 12, 2015 7:58 pm

jdart wrote:With NUMA, local caches don't necessarily stay local, since the thread scheduler can migrate the thread to another CPU. You can avoid this by pinning theads to CPUs. But if you don't do this you may find more caching does not help as much as you might think.

They won't do that every micro-second, and when they only do it every 10msec time slice, the overhead is negligible. But if there aren't more threads to run then the CPU has cores, migration of threads should be relatively rare. If not, the OS would really suck.

hgm · Post by **hgm** » Mon Oct 12, 2015 8:05 pm

Joost Buijs wrote:A key of 32 bits is not going to work, even when you add the number of bits from your index this is way to unreliable.

You would have one false hit every 4 billion probes. That doesn't really sound like "not going to work". At 1 Mnps you could play for an hour, and chances are pretty good it wouldn't have happened even once.

You are aware that Stockfish only uses a 16-bit signature? Yet I would hesitate to call that an engine that doesn't work...

Joost Buijs · Post by **Joost Buijs** » Mon Oct 12, 2015 8:08 pm

mvk wrote:A depth0-only cache-local mini table can be indexed by hash^alpha and store just the returned score, as long as you don't probe it in PV. No need to even clear it. That is how I do it. The only thing to confirm is that your zobrist scheme doesn't collide between ^alpha vs ^(-(alpha+1)) == ^~alpha, for example in the side-to-move term.

This is something I have to think about, I don't get it immediately, I'm not that flexible anymore.

Joost Buijs · Post by **Joost Buijs** » Mon Oct 12, 2015 8:12 pm

hgm wrote:
Joost Buijs wrote:A key of 32 bits is not going to work, even when you add the number of bits from your index this is way to unreliable.
You would have one false hit every 4 billion probes. That doesn't really sound like "not going to work". At 1 Mnps you could play for an hour, and chances are pretty good it wouldn't have happened even once.

You are aware that Stockfish only uses a 16-bit signature? Yet I would hesitate to call that an engine that doesn't work...

I have no idea, but according to the birthday paradox a false probe can occur once at the square root of the number of different keys, so once in the 65536 probes.

kbhearn · Post by **kbhearn** » Mon Oct 12, 2015 8:37 pm

Joost Buijs wrote:
hgm wrote:
Joost Buijs wrote:A key of 32 bits is not going to work, even when you add the number of bits from your index this is way to unreliable.
You would have one false hit every 4 billion probes. That doesn't really sound like "not going to work". At 1 Mnps you could play for an hour, and chances are pretty good it wouldn't have happened even once.

You are aware that Stockfish only uses a 16-bit signature? Yet I would hesitate to call that an engine that doesn't work...
I have no idea, but according to the birthday paradox a false probe can occur once at the square root of the number of different keys, so once in the 65536 probes.

Incorrect usage of the birthday paradox at fault here. Assuming the 32 bits are not part of the key used for indexing, you have only (# of buckets) / 2^32 chance of a false hit because the previous 60k positions that shared that index are no longer stored and thus can't collide. That said if SF really only uses 16 bits then they do have a pretty high false-positive rate (but i thought they'd pulled that patch out).

Joost Buijs · Post by **Joost Buijs** » Mon Oct 12, 2015 8:50 pm

kbhearn wrote:
Joost Buijs wrote:
hgm wrote:
Joost Buijs wrote:A key of 32 bits is not going to work, even when you add the number of bits from your index this is way to unreliable.
You would have one false hit every 4 billion probes. That doesn't really sound like "not going to work". At 1 Mnps you could play for an hour, and chances are pretty good it wouldn't have happened even once.

You are aware that Stockfish only uses a 16-bit signature? Yet I would hesitate to call that an engine that doesn't work...
I have no idea, but according to the birthday paradox a false probe can occur once at the square root of the number of different keys, so once in the 65536 probes.
Incorrect usage of the birthday paradox at fault here. Assuming the 32 bits are not part of the key used for indexing, you have only (# of buckets) / 2^32 chance of a false hit because the previous 60k positions that shared that index are no longer stored and thus can't collide. That said if SF really only uses 16 bits then they do have a pretty high false-positive rate (but i thought they'd pulled that patch out).

Like I said I have no idea, I never looked at probability theory with respect to this.
I can easily determine it empirically by decreasing my key width to 32 bits and see what percentage of false hits I get.

hgm · Post by **hgm** » Mon Oct 12, 2015 9:15 pm

Joost Buijs wrote:There are some stupid things in my engine, it always calls the evaluation function including a pawn-table probe after a domove when it is not a checking move.
That means the time spent for probing the pawn-table and evaluation is wasted in case of a TT hit.
There is enough room for improvement.

Well, I wondered about that too. Probes to the main TT are pretty expensive, as they come from DRAM. The Pawn hash probably resides permanently in L3, and most probes might actually hit L2, and the eval cache might be L2 too. So doing an evaluation might be pretty fast compared to a hash probe.

This means that if a reasonable fraction of the QS nodes does get a stand-pat cutoff, It might be favorable to test for that first, so that it saves you probing the TT when you stand pat.

jdart · Post by **jdart** » Mon Oct 12, 2015 9:53 pm

It is not the overhead of migration that is of concern, it is the fact that current OSs do not migrate the thread's memory.

--Jon

hgm · Post by **hgm** » Mon Oct 12, 2015 10:35 pm

The private CPU caches simply reloads from memory or the shared L3 when the thread access those memory location from the new CPU, not?

Joost Buijs · Post by **Joost Buijs** » Tue Oct 13, 2015 7:54 am

hgm wrote:
Joost Buijs wrote:There are some stupid things in my engine, it always calls the evaluation function including a pawn-table probe after a domove when it is not a checking move.
That means the time spent for probing the pawn-table and evaluation is wasted in case of a TT hit.
There is enough room for improvement.
Well, I wondered about that too. Probes to the main TT are pretty expensive, as they come from DRAM. The Pawn hash probably resides permanently in L3, and most probes might actually hit L2, and the eval cache might be L2 too. So doing an evaluation might be pretty fast compared to a hash probe.

This means that if a reasonable fraction of the QS nodes does get a stand-pat cutoff, It might be favorable to test for that first, so that it saves you probing the TT when you stand pat.

It is possible, I never measured it.
My evaluation function including the pawn table probe takes between 600 and 700 processor cycles, and I have the feeling that probing the main TT takes less.
But to be sure I will measure it today and let you know the result.

I don't think that any of this cache stuff or the size of a TT entry will have a big influence on the playing strength of an engine, at best a few Elo points.
The quality of the evaluation function is what matters, that is where you can gain hundreds of Elo points.
I left that speed criterion a long time ago.

Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache

Re: Hash cache