Where to enter/read position into hash table in perft?

mvanthoor · Post by **mvanthoor** » Sun Mar 29, 2020 3:27 am

Terje wrote: ↑Sun Mar 29, 2020 3:09 am There are bugs in VICE, but zobrist keys are not one of them. VICE has a separate en passant key for each square (even squares that can never be an en passant square).

You don't need a separate key for "no en passant" as "no key" can represent that. VICE does this.

I have watched this video again and actually read the code line-by-line:

At 6:40 minutes, the code for generating the EP-key is shown.

It uses an "empty" piece, which then has 64 random numbers. (So that's the 13th piece in the init-section somewhat earlier in the video, which I didn't know where he got it from.) I must have completely missed that detail and thus I misunderstood the video. To be honest, I never closely looked at Vice's code, either in the repo or in the video; I mostly listened to the explanation of the concepts.

My bad... it cost me three-quarters of a day of debugging and testing

mvanthoor · Post by **mvanthoor** » Sun Mar 29, 2020 2:25 pm

Zobrist updating in make_move() has been reinstated

Time for calculating Perft 7 dropped from 27 seconds to 18 seconds. Finding the leaf nodes for perft 7 rose from 119 million leaves/sec 170 million leaves/sec

Because I'm using iterative deepening for perft, big parts of the previous calculations don't need to be done anymore when doing the next depth, if the hash table is large enough. I'm having a go at Perft 9, but I think it's going to take some time yet. (Each iteration takes about 10 times the time the previous one did as far as I can see, so I estimate about 37 minutes for Perft 9...)

hgm · Post by **hgm** » Sun Mar 29, 2020 3:04 pm

That is not a very good improvement. In Qperft the time for perft(7) drops from 18.6 to 4.18 sec when I turn on hashing (1M entries in the TT), more than a factor 4 compared to your 33%. That slows to 4.52, 4.80, 5.24, 5.99 sec when I shrink the TT to 512K, 256K , 128K or 64K entries.

mvanthoor · Post by **mvanthoor** » Sun Mar 29, 2020 3:11 pm

Yes, but QPerft has a much more advanced move generator which takes pinned pieces and checks into account, and doesn't do legal move checking in make_move(); so it will already be much faster in perft. At some point, I'll add enhancements such as these.

With regard to hashing, I also don't do anything special. I just overwrite hash entries if there's a collision. Is there a better way to replace hash entries for perft?

hgm · Post by **hgm** » Sun Mar 29, 2020 3:42 pm

I was referring to the improvement you got from hashing, not to the absolute speed.

And indeed, you can do much better than 'always replace'. Both in alpha-beta search and perft results from deep searches are much more valuable than results from shallow searches. A simple replacement scheme is to organize the TT as 'buckets' containing a pair of entries, and have the hashing function only map a position on a bucket (rather than on an individual entry). You then overwrite the entry in that bucket that has the lowest depth ('shallowest of two'). You would always have to check both entries in the bucket for presence of the sought position, on a TT probe.

mvanthoor · Post by **mvanthoor** » Sun Mar 29, 2020 3:57 pm

Hm... so you basically make your hash table twice as big by having the capability to store two hashes in one spot?

So the hash table is not going to store a HashEntry, but a HashBucket, which then has two HashEntry's?

That would allow you to store more entries with less collisions. Does more buckets bring bigger improvemnt?

What I'm thinking about is:

You could have a hash table that has:
- 96 single entries
- Or has 48 entries with 2 buckets
- Or has 24 entries with 4 buckets
- Or even 12 entries with 8 buckets

In all cases the hash table stays the same size, and you start 'stacking' the entries more and more (but have to search more and more buckets). I imangine there being a tipping point somewhere.

I'll try the two-bucket approach first and read up on this some more. Thanks for the pointer

edit: I just tried (again) to only replace a hash entry with a new one if the depth of the new entry is higher than the one already in the table. The results become slower. After dinner this evening I'll implement the buckets and see what this brings, with 2 or even 3 or 4 buckets.

hgm · Post by **hgm** » Sun Mar 29, 2020 4:36 pm

mvanthoor wrote: ↑Sun Mar 29, 2020 3:57 pm Hm... so you basically make your hash table twice as big by having the capability to store two hashes in one spot?

Or keep it the same size and have half the number of spots. Ultimately you only care about the number of entries in your table, not how they are organized in buckets.

So the hash table is not going to store a HashEntry, but a HashBucket, which then has two HashEntry's?

That is how I usually do it. You could also force the hash function to always produce even numbers (by masking away the least significant bit), and use entry N and entry N+1 as if it were a bucket.

That would allow you to store more entries with less collisions.

That would always hold for larger tables, no matter how you organize them.

Does more buckets bring bigger improvemnt?

Yes, it could. Most people use 'shallowest of four'' replacement. One fact that dominates TT design is that modern computer hardware accesses memory in bunces of 64 bytes (a 'cache line'), and that such an access is very expensive (taking about the same time as executing a few hundred instructions!) So it would be detrimental to performance if you had to access more than one cache line for a TT probe. One thus tries to make sure buckets fit in a single cache line. Having 4 entries of 16 bytes each would satisfy that. But some engines use entries of 9 bytes, and cram 7 of those in a bucket. That gives them more entries for the same number of megabytes.

edit: I just tried (again) to only replace a hash entry with a new one if the depth of the new entry is higher than the one already in the table. The results become slower. After dinner this evening I'll implement the buckets and see what this brings, with 2 or even 3 or 4 buckets.

Yes, doing only 'depth-preferred' replacement is known to be a very poor method, even worse than 'always replace'. The point is that, even though very deep searches represent a lot of work, which would be saved when you would retrieve the result from the TT instead of actually doing the search, they will not be requested very frequently. And between store and probe, they are just sitting there, occupying a hash entry. That same entry could also have been used to store thousands of very shallow searches for a short time each, long enough to be successfully probed. As we much more frequently probe for shallow results than for deep results. Each successful probe would have saved only a little bit of work, but because you got so many hits on that same entry, the total savings due to its use can still rival that saved by a deep entry just sitting there.

So it is very important to both retain results from deep searches and shallow searches, but you must be prepared to replace the shallow results much more frequently. This will not really hurt their hit rate; if they are not hit within a certain short time, the search has moved on to an entirely different main branch of the tree, from where they cannot be hit at all. The 'shallowest of two' replacement scheme gives you the best of both worlds, both allowing shallow entries in the table with rapid turn-over, and deep ones with slow turn-over.

Much more can be said about this, if you want to squeeze the last few percent of performance from the hash table. Most engine developers ignore this, however: replacement schemes only have an effect if the TT is small, and in a table that is large enough to hold the entire search tree, there wouldn't be enough replacement going on to observe the effect of it. So even a crummy scheme can be made to work as good as any other by just taking a 16GB TT. But if you would want the engine to also work well with a 1MB TT, you have to pay attention to this. You have to make sure deep results that have become unreachable during progression of the game will be kicked out of the table ('aging'), and there are tricks to get more optimal distribution of the depths in the table ('equi-distributed depth' or 'undercut-replacement').

Terje · Post by **Terje** » Sun Mar 29, 2020 5:36 pm

Or 2 buckets of 3x 10byte entries + 2 filler bytes in one cache line like SF does.

mvanthoor · Post by **mvanthoor** » Sun Mar 29, 2020 9:16 pm

Thanks for your kind explanation, Mr. Mueller. I didn't know that these buckets would cause such a big speed increase. I always saw them as just a more efficient way of storing the hash entries. (A bucket allows you to store 2-3 hash entries at one hash table index, where you'd otherwise need to increase the size of the table to avoid a collision and keep both entries.)

I've implemented buckets now. As you said earlier about QPerft:

hgm wrote: ↑Sun Mar 29, 2020 3:04 pm That is not a very good improvement. In Qperft the time for perft(7) drops from 18.6 to 4.18 sec when I turn on hashing (1M entries in the TT), more than a factor 4 compared to your 33%. That slows to 4.52, 4.80, 5.24, 5.99 sec when I shrink the TT to 512K, 256K , 128K or 64K entries.

With hashing off: 18.6 sec.
With hashing on (1M entries): 4.18 sec.
Time usage hashed vs non-hashed: 22.4%, speedup 77.6%

My own test is below in the code blocks.

Summary:

No hash for perft 7: 113.2 seconds
32 MB hash for perft 7 (1.2M entries, 400K buckets with 3 entries/bucket): 27.0 seconds
Time usage hashed vs non-hashed: 23.85%, speedup 76.15%.

It seems I now get comparable speedup as you do with regard to hashing, and any other speed improvements need to come from optimizing the move generator. But, that is for a later time; I just wanted the hashing working in perft to verify the zobrist keys. I'll actually take it out again for the very first version of the engine, to keep it that version as bare-bones as possible, and then build on it.

Oh, by the way... I'm now achieving this speedup by using hash tables of 32 MB, instead of 24 GB (!).

Keeping the hash at 32 MB and Increasing the number of buckets from 3 to 4, 5 and 6 keeps the time between 26.8 and 27.2 seconds; within the margin of error, IMHO. Increasing bucket size to 7 or above is very detrimental.

All thanks to you for chiming in guys

Code: Select all

No hash.

Results perft 7:

113.2 seconds for perft 7.
Finds 28.2 million leaf nodes/sec

Benchmarking perft 1-7 from starting position...
Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (7 ms, 28183000 leaves/sec)
Perft 5: 4865609 (174 ms, 27963270 leaves/sec)
Perft 6: 119060324 (4264 ms, 27922214 leaves/sec)
Perft 7: 3195901860 (113215 ms, 28228608 leaves/sec)
Total time spent: 117660 ms
Execution speed: 28217188 leaves/second
Finished.

Code: Select all

32 MB Hash
+/- 600.000 buckets
2 entries per bucket
(1.2 million entries total)

Results perft 7:

28.2 seconds (1/4th of time without hash, 75% speedup)
Finds 111.4 million leaf nodes/sec.

Benchmarking perft 1-7 from starting position...
Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (7 ms, 28183000 leaves/sec)
Perft 5: 4865609 (106 ms, 45901971 leaves/sec)
Perft 6: 119060324 (1457 ms, 81716076 leaves/sec)
Perft 7: 3195901860 (28214 ms, 113273618 leaves/sec)
Total time spent: 29784 ms
Execution speed: 111470400 leaves/second
Finished.

Code: Select all

32 MB Hash
+/- 420.000 buckets
3 entries per bucket
(1.26 million entries total)

Results perft 7:

27.0 seconds (~1/4th of time without hash, 76.1% speedup)
Finds 115.4 million leaf nodes/sec.

Benchmarking perft 1-7 from starting position...
Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (10 ms, 19728100 leaves/sec)
Perft 5: 4865609 (109 ms, 44638614 leaves/sec)
Perft 6: 119060324 (1464 ms, 81325357 leaves/sec)
Perft 7: 3195901860 (27077 ms, 118030131 leaves/sec)
Total time spent: 28660 ms
Execution speed: 115842093 leaves/second
Finished.

hgm · Post by **hgm** » Sun Mar 29, 2020 11:00 pm

You can see that the hash probe causes an enormous slowdown in terms of nps. This is normal, and a good illustration of how expensive a memory access is. Your nps drops about a factor 2.5. So that single cach-line read for the hash probe took 1.5 times as much time as generating all moves!

Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?

Re: Where to enter/read position into hash table in perft?