Hybride replacemment strategy worse than always-replace

hgm · Post by **hgm** » Sun Apr 28, 2024 10:51 am

You seem to have the strange notion that I should do the testing to validate your engine. But my critcicsm concerns the testing methodology, and your engine only features in this as an example, which you yourself claim to be representative for development of 'modern engines'. For the benefit of the reader I illustrate the reasons for my concern by examples, using numbers that are typical, where you did not provide actual numbers. If you would really know "what goes on under the hood", you would be able to correct those to numbers that apply to your engine from the top of your head; it beats me how anyone could need two hours for determining how many entries there typically are in the 'current generation' of TT entries.

But let us take a step back rather that losing ourselves in nitpicking over details:

Fact: the shape of your search tree of your engine is completely different when you have no hash hits (i.e. no TT) from what it is when you have a number of hash hits that is as good as you can ever expect (TT much larger than the search tree and an advanced replacement scheme). In the former case you report ~4 times fewer nodes for the same nominal depth (which typically is used to indicate the length of the PV in the full-width part of the search).

Fact: By your own reporting, you only ever tested under conditions where the TT is much larger than the search tree.

Fact: it is well known that tree shaping through reductions and pruning can have a dramatic effect on playing strength.

Fact: it would require insanely large amounts of memory (i.e. orders of magnitude more than you would ever need for any other application) to have the same ratio of tree size to hash size as is used by the rating testers when using the engine for analysis. (e.g. 8MB at 0.2 sec/move would correspond to 144GB for a 1 hour analysis).

Fact: In real-life applications the quality of an engine is determined by the accuracy of its analysis per dollar spent on hardware; it is not relevant for users that the engine would be the strongest in the world when running on a supercomputer, when it sucks on the best hardware they can afford.

I leave it up to the readers to decide if they consider it a legitimate point of concern that an engine uses a totally different, basically untested search under the conditions they need to use it in, than for which it was tested. (A parallel with the car industry comes to mind, though...)

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Apr 28, 2024 3:53 pm

hgm wrote: ↑Sun Apr 28, 2024 10:51 am ...

Fact: it would require insanely large amounts of memory (i.e. orders of magnitude more than you would ever need for any other application) to have the same ratio of tree size to hash size as is used by the rating testers when using the engine for analysis. (e.g. 8MB at 0.2 sec/move would correspond to 144GB for a 1 hour analysis).
...

At least for actual tournaments, something on the order of 100s of GiB of memory is the norm. It's not as preposterous as you seem to think.

It is conceivable that hash pressure would have some bearing on replacement scheme optimality. However, you've neglected to provide any evidence supporting that claim. Even if we accept that hash pressure has bearing on replacement scheme optimality on faith, that still leaves us with the problem of choosing some hash pressure target to optimize for. Whether that's 1MiB@0.2s per move or 8MiB@0.2s is not particularly relevant and does not somehow invalidate the state of the art chess engine testing methodology.

Regardless of what dubious surrogate metric you choose to optimize your TT replacement scheme for, you'll run into the same problem under this assumption.

Viz · Post by **Viz** » Sun Apr 28, 2024 4:15 pm

This is the main problem.
You make vague statements that have 0 practical proofs of them being true - and in fact a lot of them have practical counterexamples.
But keep standing on them.
"If theory isn't supported by reality - just choose another reality" I guess.

Uri Blass · Post by **Uri Blass** » Wed May 22, 2024 1:24 am

hgm wrote: ↑Wed Apr 24, 2024 8:13 pm How many entries were there in your hash table, and what is the nps of your engine?

Unless you had a hash table very much smaller than 1/10 of the typical number of nodes in a search, your results would be purely noise, no matter how convincing or statistically significant it might seem.

Playing games is a very inefficient method to measure this. Much better is to just take a couple of hundred positions from games, and measure average time to depth. (Again, with a very small hash table.)

It is well known that it is very important to always put a new search result in the hash table. If you have no choice but to overwrite a deep entry (as might be the case in a simple always-replace scheme), you should still do it. Replacing the shallowest of two should perform better, though. Because it at least gives you some leeway in avoiding overwriting of the valuable deep entries.

It seems not to be correct for stockfish.
I thought based on your post that 64 mbytes or 512 mbytes hash is clearly unimportant at bullet time control because 64Mbytes is not very much smaller than the typical number of nodes in a search(even with 7 cores and 70+0.7 time control) but I read the following in the stockfish developement page:
https://abrok.eu/stockfish/

Author: Viren6
Date: Sun May 19 09:37:22 2024 +0200
Timestamp: 1716104242

Revert "Simplify Away Quadruple Extensions"

This reverts commit 4edd1a3

The unusual result of (combined) +12.0 +- 3.7 in the 2 VVLTC simplification SPRTs ran was the result of base having only 64MB of hash instead of 512MB (Asymmetric hash).
Vizvezdenec was the one to notice this.

These are the relevant tests.
https://tests.stockfishchess.org/tests/ ... d1d6b05ffb
https://tests.stockfishchess.org/tests/ ... d1d6b0619d

No unbiased results about rating because they are sprt tests but it is still surprising if hash have a significant effect here.

hgm · Post by **hgm** » Wed May 22, 2024 8:19 am

I don't know exactly how Stockfish performs under these conditions, and how its TT is designed. The TC seems to correspond to 1-2 sec/move, and at 2Mnps per thread and 7 threads that would mean ~20M nodes per search. At 64MB hash and 12 bytes per entry there would be 5M entries in the TT. So at least you get into the regime where it is not obvious that the entire tree easily fits in the TT.

One thing should be clear: if there is a measurable drop in playing strength by having on the order of a percent fewer hash hits due to less optimal replacement, you must do something that tremendously degrades the search quality in case of a hash miss... When the only consequence would be a slight slowdown because you have to redo the search rather than take a hash cutoff, the effect on Elo would be very small.

Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace