Hybride replacemment strategy worse than always-replace

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Uri Blass
Posts: 10378
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Hybride replacemment strategy worse than always-replace

Post by Uri Blass »

Pali wrote: Fri Apr 26, 2024 8:44 pm
hgm wrote: Fri Apr 26, 2024 7:46 pm
Pali wrote: Fri Apr 26, 2024 7:10 pmIt's as fragile as the rest of the top 50 engines who use at least a subset of what I use. Each patch is SPRT tested at short and long time controls and the progress is shown by both regression tests I run and the tests run by CCRL, CEGT, IpmanChess and other testers.
That doesn't really answer my question. How large was your hash table during these tests, and what was the TC and nps of the engine?
Your described testing methodology has nothing to do with your suggested hash hits to nodes/nodes to depth methodology. Why are you suggesting methodologies that you do not use?
Oh, I thought you were asking about the simultaneous testing of multiple patches.

Of course I extensively tested many replacement schemes through the described method. That was straightforward, because all my engines did suffer from hash misses, as they they then would spend extra nodes to obtain the sought information before continuing as they would have when they got the info for free from the TT. But I stopped doing that when I found the result was basically the same in all engines. Which makes sense, as the goal is the same always: maximize the information flow from the TT to the search, by preserving the entries that contribute the most information (in terms of saved search effort and frequency of use).
That does answer your question, it means that my test results translate over to testing conditions in various rating lists.
But since you asked:
STC: NPS anchored to 975k/s - 8+0.08 - 8MB
LTC: NPS anchored to 975k/s - 40+0.4 - 64MB

"But I stopped doing that when I found the result was basically the same in all engines"
Then why are you... suggesting this?
I can also tell you with 100% confidence that different replacement schemes gain in different engines. Engine developers try ideas from each other all the time, and replacement schemes are included. We have not converged to an optimum.
I think that this is not what hgm considers
"testing under conditions of severe hash pressure"

I doubt if there is a measurable difference between different hash methods with this big hash.
I also doubt if 64 MB is better than 8 MB at 40+0.4

Did you do SPRT test to prove that 64 mbytes is better than 8 mbytes for 40+0.4?
User avatar
hgm
Posts: 27838
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hybride replacemment strategy worse than always-replace

Post by hgm »

Pali wrote: Fri Apr 26, 2024 8:44 pmBut since you asked:
STC: NPS anchored to 975k/s - 8+0.08 - 8MB
LTC: NPS anchored to 975k/s - 40+0.4 - 64MB
So assuming a game is 60 moves, the first test uses 0.213 sec/move = 200K nodes. With the typical number of transpositions that involves around 20K different positions. Assuming 16 bytes per TT entry, you TT has 500k entries. For the other test that would be 104k positions in 4M entries.

Hmmm, 4% or 2.6% hash fill. That doesn't really sound like hash pressure. The pre-final iteration, which has a factor fewer nodes stil,l would almost completely be preserved even with always-replace. Any half-decent replacement scheme with aging (e.g. the quite unusual oldest-of-two) would virtually have no replacement at all, as there would always be a stale entry in the bucket not relevant to the current search.

It seems that you optimized the engine for situations where hash moves from the previous iteration would always be available, completely ignoring that it could make disastrously wrong decisions in the case a hash move is not available because it was overwritten. It is very questionable whether any of the patches on which you based this design would pass running with a small or without TT.

It is of course perfectly fine to design an engine that can work properly only with generously over-dimensioned hash size, as is commonly used for the rating lists. After all, it is not like any other person than a rating tester would want to use yet another Chess engine. But:
- How can you ever distinguish the performance of replacement schemes when testing under conditions where replacement virtually does not take place?
- And why would you care a hoot?
"But I stopped doing that when I found the result was basically the same in all engines"
Then why are you... suggesting this?
Do you think that strange? Other people might want to test for themselves, rather than blindly following instructions for what they should code. If not, they might as well just copy what Stockfish does.
User avatar
Rebel
Posts: 7025
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: Hybride replacemment strategy worse than always-replace

Post by Rebel »

pgg106 wrote: Fri Apr 26, 2024 2:20 pm
Rebel wrote: Fri Apr 26, 2024 7:18 am I seldom go to Discord, maybe once a month. It's an obscure, cluttered chat box with a bad search function. I never found anything useful there. Make it an organized forum like here.
I agree that the information density and the search leave much to be desired, the thing is, on Discord, you don't have to argue for 5 pages to get a noob to start testing stuff properly. It doesn't matter how well organized talkchess is when bad information gets shared regularly, and basic, known-good information finds 5 pages of attrition from a mod even, you can perfectly index all of this, it's still trash.
Then teach us.
90% of coding is debugging, the other 10% is writing bugs.
Pio
Posts: 335
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: Hybride replacemment strategy worse than always-replace

Post by Pio »

Rebel wrote: Fri Apr 26, 2024 11:20 pm
pgg106 wrote: Fri Apr 26, 2024 2:20 pm
Rebel wrote: Fri Apr 26, 2024 7:18 am I seldom go to Discord, maybe once a month. It's an obscure, cluttered chat box with a bad search function. I never found anything useful there. Make it an organized forum like here.
I agree that the information density and the search leave much to be desired, the thing is, on Discord, you don't have to argue for 5 pages to get a noob to start testing stuff properly. It doesn't matter how well organized talkchess is when bad information gets shared regularly, and basic, known-good information finds 5 pages of attrition from a mod even, you can perfectly index all of this, it's still trash.
Then teach us.
I guess he will win the Nobel prize in physics 2024. 2023 was not such a good year of HGM just missing out of the Nobel prize (see following links https://amolf.nl/events/joint-arcnl-amo ... in-physics and https://loa.ensta-paris.fr/2023/10/04/n ... eneration/
User avatar
Rebel
Posts: 7025
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: Hybride replacemment strategy worse than always-replace

Post by Rebel »

That.
90% of coding is debugging, the other 10% is writing bugs.
Viz
Posts: 66
Joined: Tue Apr 09, 2024 6:24 am
Full name: Michael Chaly

Re: Hybride replacemment strategy worse than always-replace

Post by Viz »

Pio wrote: Sat Apr 27, 2024 3:24 pm
Rebel wrote: Fri Apr 26, 2024 11:20 pm
pgg106 wrote: Fri Apr 26, 2024 2:20 pm
Rebel wrote: Fri Apr 26, 2024 7:18 am I seldom go to Discord, maybe once a month. It's an obscure, cluttered chat box with a bad search function. I never found anything useful there. Make it an organized forum like here.
I agree that the information density and the search leave much to be desired, the thing is, on Discord, you don't have to argue for 5 pages to get a noob to start testing stuff properly. It doesn't matter how well organized talkchess is when bad information gets shared regularly, and basic, known-good information finds 5 pages of attrition from a mod even, you can perfectly index all of this, it's still trash.
Then teach us.
I guess he will win the Nobel prize in physics 2024. 2023 was not such a good year of HGM just missing out of the Nobel prize (see following links https://amolf.nl/events/joint-arcnl-amo ... in-physics and https://loa.ensta-paris.fr/2023/10/04/n ... eneration/
Johannes Stark won Nobel Prize in physics, guess we should take for granted his words about jews and theory of relativity in particular (and this was his occupation and not a hobby)?
You can be whoever you want but if you spurt nonsense you spurt nonsense, period.
"Look at time to depth" is much worse advice than no advice at all - as you can see in basically 3 different engines "always replace" makes time to depth less, well, and it loses elo in all 3 of them.
"Arguments from authority" can be made of course.
I can make one. I have 150+ stockfish contributions, hgm has 0, stockfish is the strongest chess entity ever existed, thus I'm always right on any chess engine topic in coparison, I guess, as well as any chess topic or in topic of math and programming, they are more related to chess engine development than physics are. :lol:
Pali
Posts: 27
Joined: Wed Dec 01, 2021 12:23 pm
Full name: Doruk Sekercioglu

Re: Hybride replacemment strategy worse than always-replace

Post by Pali »

hgm wrote: Fri Apr 26, 2024 10:36 pm
Pali wrote: Fri Apr 26, 2024 8:44 pmBut since you asked:
STC: NPS anchored to 975k/s - 8+0.08 - 8MB
LTC: NPS anchored to 975k/s - 40+0.4 - 64MB
So assuming a game is 60 moves, the first test uses 0.213 sec/move = 200K nodes. With the typical number of transpositions that involves around 20K different positions. Assuming 16 bytes per TT entry, you TT has 500k entries. For the other test that would be 104k positions in 4M entries.

Hmmm, 4% or 2.6% hash fill. That doesn't really sound like hash pressure. The pre-final iteration, which has a factor fewer nodes stil,l would almost completely be preserved even with always-replace. Any half-decent replacement scheme with aging (e.g. the quite unusual oldest-of-two) would virtually have no replacement at all, as there would always be a stale entry in the bucket not relevant to the current search.

It seems that you optimized the engine for situations where hash moves from the previous iteration would always be available, completely ignoring that it could make disastrously wrong decisions in the case a hash move is not available because it was overwritten. It is very questionable whether any of the patches on which you based this design would pass running with a small or without TT.

It is of course perfectly fine to design an engine that can work properly only with generously over-dimensioned hash size, as is commonly used for the rating lists. After all, it is not like any other person than a rating tester would want to use yet another Chess engine. But:
- How can you ever distinguish the performance of replacement schemes when testing under conditions where replacement virtually does not take place?
- And why would you care a hoot?
"But I stopped doing that when I found the result was basically the same in all engines"
Then why are you... suggesting this?
Do you think that strange? Other people might want to test for themselves, rather than blindly following instructions for what they should code. If not, they might as well just copy what Stockfish does.
You are... just wrong? Your assumption that I clear the transposition table between moves, and your assumption that an entry is 16 bytes is wrong (it is 12 bytes), and despite underestimating my TT capacity, you severely undershot the actual value.

Here are the actual values calculated by the engine:
In STC games, the first move fills up 13.4% of the transposition table on average with a maximum of 42.5% and a minimum of 4.5%. These changes obviously occur due to time management.

In an STC game, the current generation of entries (entry.age == current_age) can take up to 40% of the 8MB transposition table.
In an LTC game, the current generation of entries can take up to 27% of the 64 MB transposition table.

In an LTC game, it takes around 8-9 moves to fill up 50% of the transposition table, and 14-15 moves to fill up 75%.
In an STC game, it takes around 5-6 moves to fill up 50% of the transposition table, and 9-10 moves to fill up 75%.
Both get to >90% and stay around there after move 30.

As much as I'd like to provide the amount of "useful" entries, It is not possible to measure it as it would require a proof method that can prove if a position is reachable from the current root.

It is possible to speculate however: A full TT is more useful during end games where transpositions are more likely, it is less useful during openings where trades and pawn moves occur more frequently.

On top of all this, 16 byte entry -> 12 byte entry is the second most recent merge, with the most recent being an NNUE update. This is being VERY charitable towards your position as the assumptions you made on a practically smaller transposition table don't even hold at the one that is 33% larger.

- How can you ever distinguish the performance of replacement schemes when testing under conditions where replacement virtually does not take place?
It does take place, this is how I was able to test and merge replacement scheme patches.


I kindly ask you to not make up numbers in future conversation as I had to waste 2 hours of my time computing the actual values.

I also kindly ask you to not change topics. This is not about whether my engine works under "severe hash pressure," it's about the testing methods one should use. Many engines directly use TT entries in: LMR, Singular Extensions, Null Move Pruning, Internal Iterative Reductions, Move Ordering, History Updates. Depth/Nodes comparisons, TT cut off rates, TT hit rates automatically break due to direct interactions. They are not useful metrics, you yourself do not use them any longer. So yes, I do think it is strange that you are suggesting others use a method you yourself stopped using precisely because it did not work.

- Other people might want to test for themselves, rather than blindly following instructions for what they should code. If not, they might as well just copy what Stockfish does.

No one at any point in this thread has suggested an algorithm to use as far as I am aware, I for sure haven't. The point of promoting good testing methods (i.e. SPRT) is that the person can now actually try something on their own and innovate rather than get lost in metrics that do not correlate to strength.
Pali
Posts: 27
Joined: Wed Dec 01, 2021 12:23 pm
Full name: Doruk Sekercioglu

Re: Hybride replacemment strategy worse than always-replace

Post by Pali »

Uri Blass wrote: Fri Apr 26, 2024 10:13 pm
Pali wrote: Fri Apr 26, 2024 8:44 pm
hgm wrote: Fri Apr 26, 2024 7:46 pm
Pali wrote: Fri Apr 26, 2024 7:10 pmIt's as fragile as the rest of the top 50 engines who use at least a subset of what I use. Each patch is SPRT tested at short and long time controls and the progress is shown by both regression tests I run and the tests run by CCRL, CEGT, IpmanChess and other testers.
That doesn't really answer my question. How large was your hash table during these tests, and what was the TC and nps of the engine?
Your described testing methodology has nothing to do with your suggested hash hits to nodes/nodes to depth methodology. Why are you suggesting methodologies that you do not use?
Oh, I thought you were asking about the simultaneous testing of multiple patches.

Of course I extensively tested many replacement schemes through the described method. That was straightforward, because all my engines did suffer from hash misses, as they they then would spend extra nodes to obtain the sought information before continuing as they would have when they got the info for free from the TT. But I stopped doing that when I found the result was basically the same in all engines. Which makes sense, as the goal is the same always: maximize the information flow from the TT to the search, by preserving the entries that contribute the most information (in terms of saved search effort and frequency of use).
That does answer your question, it means that my test results translate over to testing conditions in various rating lists.
But since you asked:
STC: NPS anchored to 975k/s - 8+0.08 - 8MB
LTC: NPS anchored to 975k/s - 40+0.4 - 64MB

"But I stopped doing that when I found the result was basically the same in all engines"
Then why are you... suggesting this?
I can also tell you with 100% confidence that different replacement schemes gain in different engines. Engine developers try ideas from each other all the time, and replacement schemes are included. We have not converged to an optimum.
I think that this is not what hgm considers
"testing under conditions of severe hash pressure"

I doubt if there is a measurable difference between different hash methods with this big hash.
I also doubt if 64 MB is better than 8 MB at 40+0.4

Did you do SPRT test to prove that 64 mbytes is better than 8 mbytes for 40+0.4?
I have proven that 12 mbytes is better than 8 mbytes for 40+0.4. 64 mbytes is most certainly better.

Edit: I made a mistake calculating the effective TT size. It was in fact 10.6 mbytes being better than 8 mbytes.
User avatar
hgm
Posts: 27838
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hybride replacemment strategy worse than always-replace

Post by hgm »

Pali wrote: Sat Apr 27, 2024 9:55 pmYou are... just wrong? Your assumption that I clear the transposition table between moves, and your assumption that an entry is 16 bytes is wrong (it is 12 bytes), and despite underestimating my TT capacity, you severely undershot the actual value.
That makes it worse, right? You did give the size in MB, not in entries, and gave you the benefit of the doubt.
Here are the actual values calculated by the engine:
In STC games, the first move fills up 13.4% of the transposition table on average with a maximum of 42.5% and a minimum of 4.5%. These changes obviously occur due to time management.
That number is not really helpful without knowing how much time it took on average for this first move. (Many engines are programmed to take extra time on the first move out of book.) But 13.4% is still a quite small filling fraction.
In an STC game, the current generation of entries (entry.age == current_age) can take up to 40% of the 8MB transposition table.
In an LTC game, the current generation of entries can take up to 27% of the 64 MB transposition table.
Again, 'up to' isn't really very elucidating. It makes a huge difference if that would happen every two moves or every 100 moves, and whether the other moves then fill for 30% or for 10%. The important parameter here is the fraction of the nodes reported by the search that creates a new entry in the table (rather than hitting upon an entry that was already there). And in ID most of the nodes visited in earlier iterations (which with an EBF of 1.5 would be 2/3 of the total nodes) will be included in the tree of the final iteration.
High filling fractions reached on moves that took far above average time are not really meaningful; the game will more likely be decided in the other moves. It is the average filling of the TT that matters.

In an LTC game, it takes around 8-9 moves to fill up 50% of the transposition table, and 14-15 moves to fill up 75%.
In an STC game, it takes around 5-6 moves to fill up 50% of the transposition table, and 9-10 moves to fill up 75%.
Both get to >90% and stay around there after move 30.
Of course the table will fill up. But useless (unreachable) entries will also stay around, and in 8-9 moves (16-18 ply) a lot will happen. Most of that 50% could very well be useless.
As much as I'd like to provide the amount of "useful" entries, It is not possible to measure it as it would require a proof method that can prove if a position is reachable from the current root.
Doesn't the search itself probe exactly that? I don't know how your aging works, but it is in principle possible to measure how many entries of the previous search (or any search before it) are probed by the current one, and how many there were initially, and how many were overwitten before they were probed. That should give you a pretty good idea for the typical number of useful entries.
On top of all this, 16 byte entry -> 12 byte entry is the second most recent merge, with the most recent being an NNUE update. This is being VERY charitable towards your position as the assumptions you made on a practically smaller transposition table don't even hold at the one that is 33% larger.
Even if the numbers are 33% larger, they are not indicative of any significant hash pressure. Even if the engine did use the same time for the first move as it on average does for the others, the typical number of nodes added in the table by a search would be 1.333*13.4% = 17.9%. Hash pressure is when that is several times larger than 100%. Like it might be when someone wants to analyze a position for one hour.
- How can you ever distinguish the performance of replacement schemes when testing under conditions where replacement virtually does not take place?
It does take place, this is how I was able to test and merge replacement scheme patches.


I kindly ask you to not make up numbers in future conversation as I had to waste 2 hours of my time computing the actual values.
Waste???? You mean you were ignorant of what your engine does under the hood, and consider it a waste of time to learn that?
I also kindly ask you to not change topics. This is not about whether my engine works under "severe hash pressure," it's about the testing methods one should use. Many engines directly use TT entries in: LMR, Singular Extensions, Null Move Pruning, Internal Iterative Reductions, Move Ordering, History Updates. Depth/Nodes comparisons, TT cut off rates, TT hit rates automatically break due to direct interactions. They are not useful metrics, you yourself do not use them any longer. So yes, I do think it is strange that you are suggesting others use a method you yourself stopped using precisely because it did not work.
The topic is related. The reason that what I suggested to the OP (and which would most likely work perfectly for him) does not work for your engine, because you do something that strikes me as very strange: in case of hash misses you appear to give up on the branch, and hardly search it, as the reduction of the node count by a factor 4 when running without TT shows, rather than to reconstruct the info that you did not get for free from the TT (which would have cost extra nodes) and then proceed as usual. Such an extreme difference in the depth at which you search side branches cannot be good, or it would basically not matter at all how you prune and reduce. So I wonder how you get away with this. By making the search so sensitive for whether you have a hash move or not you seem to have tuned it for a very specific miss rate. That opens the possibility that you use the replacement scheme to tune the fraction of misses to some optimal value, rather than just minimizing it.
No one at any point in this thread has suggested an algorithm to use as far as I am aware, I for sure haven't. The point of promoting good testing methods (i.e. SPRT) is that the person can now actually try something on their own and innovate rather than get lost in metrics that do not correlate to strength.
Wasn't that your question then? Why I didn't just give him the algorithm that I found to be universally best in terms of optimizing hash hits, rather than telling him the method I used for finding that algorithm?
Pali
Posts: 27
Joined: Wed Dec 01, 2021 12:23 pm
Full name: Doruk Sekercioglu

Re: Hybride replacemment strategy worse than always-replace

Post by Pali »

hgm wrote: Sat Apr 27, 2024 11:48 pm
Pali wrote: Sat Apr 27, 2024 9:55 pmYou are... just wrong? Your assumption that I clear the transposition table between moves, and your assumption that an entry is 16 bytes is wrong (it is 12 bytes), and despite underestimating my TT capacity, you severely undershot the actual value.
That makes it worse, right? You did give the size in MB, not in entries, and gave you the benefit of the doubt.

You were the one who asked for hash size, hash size is measured in MB. I would have elaborated further if you had asked for the details of my TT implementation.
You didn't give me the benefit of the doubt. You picked a particularly large entry size and then claimed 4% filling rate. It was massively lower than the actual values.
Here are the actual values calculated by the engine:
In STC games, the first move fills up 13.4% of the transposition table on average with a maximum of 42.5% and a minimum of 4.5%. These changes obviously occur due to time management.
That number is not really helpful without knowing how much time it took on average for this first move. (Many engines are programmed to take extra time on the first move out of book.) But 13.4% is still a quite small filling fraction.
"In STC games", equivalent to wtime 8000 winc 80. The latter parts of this reply explain why this statistic is useless and can't be measured accurately.
The parentheses do not apply to me, the TM algorithm is strictly position/move agnostic.
Your sense of small, or anyone else's for that matter is irrelevant to strength.
In an STC game, the current generation of entries (entry.age == current_age) can take up to 40% of the 8MB transposition table.
In an LTC game, the current generation of entries can take up to 27% of the 64 MB transposition table.
Again, 'up to' isn't really very elucidating. It makes a huge difference if that would happen every two moves or every 100 moves, and whether the other moves then fill for 30% or for 10%. The important parameter here is the fraction of the nodes reported by the search that creates a new entry in the table (rather than hitting upon an entry that was already there). And in ID most of the nodes visited in earlier iterations (which with an EBF of 1.5 would be 2/3 of the total nodes) will be included in the tree of the final iteration.
It is useful, it means that on average, at least one move wrote to 40% of the transposition table. You are not reading, I am measuring the newly created entries. That is what the current generation of entries mean. Also, can you please stop with the maths that use made up numbers. What an EBF of 1.5 does in a hypothetical situation is of zero interest to me.
High filling fractions reached on moves that took far above average time are not really meaningful; the game will more likely be decided in the other moves. It is the average filling of the TT that matters.
They do matter, you are contradicting the point of time management, engines ideally spend more time on moves that are decisive. Average filling does not matter, all TT entries have potential use as long as half move clock is not reset. Again I am repeating, the TT is not cleared between moves.
In an LTC game, it takes around 8-9 moves to fill up 50% of the transposition table, and 14-15 moves to fill up 75%.
In an STC game, it takes around 5-6 moves to fill up 50% of the transposition table, and 9-10 moves to fill up 75%.
Both get to >90% and stay around there after move 30.
Of course the table will fill up. But useless (unreachable) entries will also stay around, and in 8-9 moves (16-18 ply) a lot will happen. Most of that 50% could very well be useless.
The entries only are guaranteed to be useless if a capture is made or if I have entries that have a pawn at a given square, those entries only become useless if that pawn is moved. The transposition table is kept between plies for a reason, it is very reusable.
As much as I'd like to provide the amount of "useful" entries, It is not possible to measure it as it would require a proof method that can prove if a position is reachable from the current root.
Doesn't the search itself probe exactly that? I don't know how your aging works, but it is in principle possible to measure how many entries of the previous search (or any search before it) are probed by the current one, and how many there were initially, and how many were overwitten before they were probed. That should give you a pretty good idea for the typical number of useful entries.
No the search doesn't probe that. You can update the age of the entry if it gets probed again but it is not an accurate measure. If a position is reachable, it's potentially useful and that position is reachable as long as half move clock is not reset.
On top of all this, 16 byte entry -> 12 byte entry is the second most recent merge, with the most recent being an NNUE update. This is being VERY charitable towards your position as the assumptions you made on a practically smaller transposition table don't even hold at the one that is 33% larger.
Even if the numbers are 33% larger, they are not indicative of any significant hash pressure. Even if the engine did use the same time for the first move as it on average does for the others, the typical number of nodes added in the table by a search would be 1.333*13.4% = 17.9%. Hash pressure is when that is several times larger than 100%. Like it might be when someone wants to analyze a position for one hour.
Yes, they are in fact indicative of even less "hash pressure" than your assumptions suggested and yet your formula still resulted in significantly lower numbers.
If someone is analyzing a position for one hour, they should perhaps consider using a larger Hash size? As I replied to Uri Blass, 10.6MB gains over 8MB in LTC conditions, it really is a small price to pay for someone who's willing to spend an hour of compute.
- How can you ever distinguish the performance of replacement schemes when testing under conditions where replacement virtually does not take place?
It does take place, this is how I was able to test and merge replacement scheme patches.


I kindly ask you to not make up numbers in future conversation as I had to waste 2 hours of my time computing the actual values.
Waste???? You mean you were ignorant of what your engine does under the hood, and consider it a waste of time to learn that?
I am not ignorant of what my engine does under the hood, I programmed the entire thing. The waste here was that you didn't care to provide actual calculations and I had to go through the effort of disproving made up numbers that do not correspond to anything in terms of Elo.
I also kindly ask you to not change topics. This is not about whether my engine works under "severe hash pressure," it's about the testing methods one should use. Many engines directly use TT entries in: LMR, Singular Extensions, Null Move Pruning, Internal Iterative Reductions, Move Ordering, History Updates. Depth/Nodes comparisons, TT cut off rates, TT hit rates automatically break due to direct interactions. They are not useful metrics, you yourself do not use them any longer. So yes, I do think it is strange that you are suggesting others use a method you yourself stopped using precisely because it did not work.
The topic is related. The reason that what I suggested to the OP (and which would most likely work perfectly for him) does not work for your engine, because you do something that strikes me as very strange: in case of hash misses you appear to give up on the branch, and hardly search it, as the reduction of the node count by a factor 4 when running without TT shows, rather than to reconstruct the info that you did not get for free from the TT (which would have cost extra nodes) and then proceed as usual. Such an extreme difference in the depth at which you search side branches cannot be good, or it would basically not matter at all how you prune and reduce. So I wonder how you get away with this. By making the search so sensitive for whether you have a hash move or not you seem to have tuned it for a very specific miss rate. That opens the possibility that you use the replacement scheme to tune the fraction of misses to some optimal value, rather than just minimizing it.
What you call very weird gains 10 Elo for my engine and is common practice. I suggest you try it in your own engine as well.
"That cannot be good" is an awful argument when put against at least 20 engines having SPRT'd this.

Please provide actual tests that support your points rather than trying to debate your way out of things that have been proven over and over again by different people all with statistically sound methods.

Chess performance is very measurable. If you think IIR "cannot be good", you are welcome to try and simplify it out of Stockfish, Berserk, Ethereal, RubiChess, Caissa, Obsidian, Seer, Alexandria... do I need to go on?
No one at any point in this thread has suggested an algorithm to use as far as I am aware, I for sure haven't. The point of promoting good testing methods (i.e. SPRT) is that the person can now actually try something on their own and innovate rather than get lost in metrics that do not correlate to strength.
Wasn't that your question then? Why I didn't just give him the algorithm that I found to be universally best in terms of optimizing hash hits, rather than telling him the method I used for finding that algorithm?
No, the question is why you gave him a metric that doesn't serve any purpose to develop an engine.

I will not be replying on this thread any longer unless if:
- I am provided with evidence that IIR in fact loses Elo.
- You show your replacement scheme testing method having strong correlation with SPRT testing

If any of these are provided, I'll be very willing to:
- Discuss why IIR is useful in my testing conditions and causes Elo losses at your testing conditions.
- Experiment with your method on my own engine and share my own findings.

Until then, I am not willing to continue this discussion as I see no way of interpreting your arguments as good faith from this point on considering I've done nothing but provide actual numbers, and you've done nothing but provide verbal explanations that contradict the findings of many.

edit: please excuse the formatting - I will be fixing it
I gave up on fixing this, I have no idea how this website works.
[moderation] I 'wasted' 10 min to shape it up a bit. ;)