Hybride replacemment strategy worse than always-replace

hgm · Post by **hgm** » Fri Apr 26, 2024 8:36 am

JacquesRW wrote: ↑Fri Apr 26, 2024 12:58 amI was going to point this out in my original post, but didn't see any reason, because your whole point of doing A + B over SPRT was for patches that interact, and I had hoped that you would act in good enough faith to not point this out as some kind of "gotcha" (I notice that you didn't make any argument for if they do interact, why not?).

No, that was not the whole point, not the main point, and not even one of the important points. The point for doing this would be that it could require fewer games than sequential SPRT on individual patches, by using the same set of games to measure the effect of a number of patches simultaneously. That it also allows you to detect interaction of the patches is just a free bonus, which was only mentioned to rebute the remark that this would not work when patches interact.

BTW, how many entries were there in the TT for the akimbo bench results that you presented?

JacquesRW wrote: ↑Fri Apr 26, 2024 1:15 amI am concerned about it, and so now your advice is to playing games? Wouldn't it have been simpler to just play games from the start?

It can be advantageous to separate the issues. Unfortunately there is no other way than playing games that I know of, for assessing move quality. But it is a very noisy method, and excessively many games are needed to detect small differences. If you could determine an upper bound for the effect of grafting by testing with a generously large TT and a good replacement scheme (e.g. shallowest of four with aging), so that essentially no overwriting is taking place at all, you might need very many games, but you only need to do that once. You can then test various replacement schemes with small TT by measuring how they reduce node count of the same search tree on a set of positions. The reduced number of TT hits you would have in the small table should decrease the effect of grafting on move quality even more. That way you would prevent having to do the huge number of games for each replacement scheme that you test.

I could be wrong, but I expect the effect of grafting in the middle-game to be very small. Based on the observation of Bob Hyatt that he could replace an amazingly large fraction of the evaluation in the leaves (like close to a percent) without affecting the move choice in the root. And most grafts would not really cause a dramatic change of the score; the latter usually changes only slighty with depth. Only in cases where the extra depth sees something dramatic, such as a checkmate, there is a huge change. But that would only be the case in a very small fraction of the grafts.

hgm · Post by **hgm** » Fri Apr 26, 2024 9:20 am

pgg106 wrote: ↑Thu Apr 25, 2024 11:56 pm Time and time again devs that have opted for alternative testing solutions (mostly because they didn't know sprt existed) have ended up with feature bloated engines that could barely play decent chess, hundreds if not thousands of Elo weaker that they should've been.
If you could prove your testing methodology was somehow as sound as sprt and more efficient (and there are very very serious doubts about this) it would still be far too error prone (compared to just setting up OB and running sprts) to casually suggest using it.
This isn't even about mathematics, this is about the advice being given here on the regular being utter junk, something that should stop.

Edit: it's insane that a conclusion so obvious such as "use the testing every engine in the top 100 CCRL uses" needs 3 pages of arguments, this is why talkchess is a cesspit and the only useful interactions are because actual devs take pity on people being mislead.

It is pretty obvious that statistically unsound criteria for accepting patches will hamper development of an engine. And people that do not understand statistics can be prone to such pitfalls. Innocent people have been jailed for murder here because of improper calculation of the likelihood of "reasonable doubt". SPRT is just a particular method for testing the likelihood for the difference exceeding a threshold with a variable number of games that allows you to stop early when the result is very clear, which necessarily goes at the expense of needing more games if the results is a close call. There is nothing magical about it.

But because it works, it has apparently become a religion for people without proper understanding of statistics. Who consider any suggestion that there could be other methodologies that are sound, and equally or even more efficient blasphemous. It infuriates them, and in blind rage they want to burn the heretics who dare suggest such a thing on the stakes.

There is nothing wrong with testing with a fixed number of games, as the OP did. Provided you make that number large enough. Which means it on average might need to be larger than using SPRT. But if simplicity of the methodology is an argument, it is the simplest thing to do. What you should not do is rerun the test if there isn't a clear result until you get one. If you do two tests of N games, not even the combined result of the 2N games has the reliability of a 'monolithic' 2N-game test. Because you already built in possibility for false acceptance after N games, (without further testing), and additional testing will only increase that probability. If you want to be able to stop after N games, but continue in case of an 'unclear' result, you must set the threshold for acceptance without further testing much higher than when you would never reconsider the patch. The common mistake is that people do not do that, but there is nothing unsound if they do it. It is in fact what SPRT does, and it is the only thing it does. In the end it only matters how small a likelihood for false acceptance your testing method provides. Not how exactly you achieve that.

Anyway, it appears you already have changed your tune from "spreading misinformation" to "casually suggesting a complicated method to the unwary novice". But if you take the trouble to read back you can see that I wasn't really advising this to the OP at all. I was just pointing out that using the SPRT stopping criterion on the game set he already generated would have led him to the same conclusion, with any plausible bound on the confidence. False claims accompanied by the disclaimer "Don't listen to anyone that tells otherwise" deserve a correction, though. Hence the added casual remark that so upsets you.

hgm · Post by **hgm** » Fri Apr 26, 2024 9:33 am

Pali wrote: ↑Fri Apr 26, 2024 12:33 am"It cannot have search all branches to the same depth"
What you are referring to is called pruning, reductions and extensions. If you consider this lying about depth, sure go ahead. But keep in mind that no engine is honest about depth by your definition and as such, you shouldn't give this advice to anyone without explicitly reviewing their code and making sure that they perform no reductions, pruning or extensions at any point.

Sorry for being unclear; I meant: "cannot have searched all branches to the same depth in the search without TT as it did with the TT". The presented results suggest that there is far more pruning or reduction when running without TT. Searches with TT can search the same tree with fewer nodes than the tree actually has, because they eliminate the duplicate work by passing it on through the TT. But a search without TT can never search a tree with more nodes than the number it reports. The tree it was searching was at least 4 times smaller than that in the search with the always-replace scheme, and if the TT was any help in the latter, more likely at least 10 times smaller.

IMO that makes it very misleading to also present it as a 30-ply search. In engines depth is assumed to be a quality measure for the search result; most advanced replacement schemes rely on that. But if this quality is so strongly dependent on whether you have hash hits or not, it becomes a very unreliable indicator.

And yes, it should be obvious that if you want to compare replacement schemes, you can only do it by comparing how many nodes you save through hash cutoffs on the same tree. Comparing entirely different trees is meaningless.

Uri Blass · Post by **Uri Blass** » Fri Apr 26, 2024 12:41 pm

hgm wrote: ↑Fri Apr 26, 2024 9:33 am
Pali wrote: ↑Fri Apr 26, 2024 12:33 am"It cannot have search all branches to the same depth"
What you are referring to is called pruning, reductions and extensions. If you consider this lying about depth, sure go ahead. But keep in mind that no engine is honest about depth by your definition and as such, you shouldn't give this advice to anyone without explicitly reviewing their code and making sure that they perform no reductions, pruning or extensions at any point.
Sorry for being unclear; I meant: "cannot have searched all branches to the same depth in the search without TT as it did with the TT". The presented results suggest that there is far more pruning or reduction when running without TT. Searches with TT can search the same tree with fewer nodes than the tree actually has, because they eliminate the duplicate work by passing it on through the TT. But a search without TT can never search a tree with more nodes than the number it reports. The tree it was searching was at least 4 times smaller than that in the search with the always-replace scheme, and if the TT was any help in the latter, more likely at least 10 times smaller.

IMO that makes it very misleading to also present it as a 30-ply search. In engines depth is assumed to be a quality measure for the search result; most advanced replacement schemes rely on that. But if this quality is so strongly dependent on whether you have hash hits or not, it becomes a very unreliable indicator.

And yes, it should be obvious that if you want to compare replacement schemes, you can only do it by comparing how many nodes you save through hash cutoffs on the same tree. Comparing entirely different trees is meaningless.

First I suggest to use the word iteration and not depth because with all extensions and reductions the number of iteration does not mean the number of plies the engine searched.

Let take a simple example
Suppose that your search reduce all the moves except the hash move by 1 ply.
Without hash you reduce all moves by 1 ply so you practically search to depth 10 in iteration 20(because after every move you make a reduction).
With hash there are moves that you do not reduce so it can take more time to get to iteration 20.

I do not say that it is a good idea to do it but it is clearly possible that some good idea cause the engine to search more nodes to get to the same iteration with more hash because hash tell the engine about cases the engine needs not to prune or to extend.

Pali · Post by **Pali** » Fri Apr 26, 2024 1:01 pm

hgm wrote: ↑Fri Apr 26, 2024 9:33 am
Pali wrote: ↑Fri Apr 26, 2024 12:33 am"It cannot have search all branches to the same depth"
What you are referring to is called pruning, reductions and extensions. If you consider this lying about depth, sure go ahead. But keep in mind that no engine is honest about depth by your definition and as such, you shouldn't give this advice to anyone without explicitly reviewing their code and making sure that they perform no reductions, pruning or extensions at any point.
Sorry for being unclear; I meant: "cannot have searched all branches to the same depth in the search without TT as it did with the TT". The presented results suggest that there is far more pruning or reduction when running without TT. Searches with TT can search the same tree with fewer nodes than the tree actually has, because they eliminate the duplicate work by passing it on through the TT. But a search without TT can never search a tree with more nodes than the number it reports. The tree it was searching was at least 4 times smaller than that in the search with the always-replace scheme, and if the TT was any help in the latter, more likely at least 10 times smaller.

IMO that makes it very misleading to also present it as a 30-ply search. In engines depth is assumed to be a quality measure for the search result; most advanced replacement schemes rely on that. But if this quality is so strongly dependent on whether you have hash hits or not, it becomes a very unreliable indicator.

And yes, it should be obvious that if you want to compare replacement schemes, you can only do it by comparing how many nodes you save through hash cutoffs on the same tree. Comparing entirely different trees is meaningless.

Depth only assumed to be a quality measure only under identical conditions. Hash Size and Replacement Scheme directly affects: Singular Extensions, Double Extensions, Triple Extensions, Negative Extensions, Multi-Cut, History updates, Internal Iterative Reductions. You can't possibly compare depth between replacement schemes with all these being directly affected.

If you want to compare replacement schemes, you do so by playing chess games with the new binary versus the old binary.

Side note: Your testing method is extremely efficient compared to what everyone else is doing. It is efficient to the point that you can test a Transposition Table patch faster than Fish Test hardware with your own. This makes me doubt your methods work even for your own special case considering you've cut down testing costs by orders of magnitude compared everyone else. Where is the engine you've applied these methodologies on?

pgg106 · Post by **pgg106** » Fri Apr 26, 2024 2:20 pm

Rebel wrote: ↑Fri Apr 26, 2024 7:18 am I seldom go to Discord, maybe once a month. It's an obscure, cluttered chat box with a bad search function. I never found anything useful there. Make it an organized forum like here.

I agree that the information density and the search leave much to be desired, the thing is, on Discord, you don't have to argue for 5 pages to get a noob to start testing stuff properly. It doesn't matter how well organized talkchess is when bad information gets shared regularly, and basic, known-good information finds 5 pages of attrition from a mod even, you can perfectly index all of this, it's still trash.

Rebel wrote: ↑Fri Apr 26, 2024 7:18 am And BTW, if you keep posting fill in your name.

I am posting because you addressed me directly, i have no intention to add a name, a name doesn't add extra validity to my arguments, i think the rule itself to be idiotic and the focus on it myopic given the much bigger problems this forum currently has (see this entire thread)

Rebel wrote: ↑Fri Apr 26, 2024 7:18 am And out of curiosity, how is your engine called?

the engine is https://github.com/PGG106/Alexandria, it would probably shunned here for being very "sf-like" (even if somehow clones don't get that treatment), but it was written from scratch and it is a standing proof that sprt testing works, and so are the dozens of new, strong even if at times basic and/or unoriginal engines that come out

hgm · Post by **hgm** » Fri Apr 26, 2024 6:40 pm

Pali wrote: ↑Fri Apr 26, 2024 1:01 pmDepth only assumed to be a quality measure only under identical conditions. Hash Size and Replacement Scheme directly affects: Singular Extensions, Double Extensions, Triple Extensions, Negative Extensions, Multi-Cut, History updates, Internal Iterative Reductions. You can't possibly compare depth between replacement schemes with all these being directly affected.

It sounds very fragile. Are you sure you developed all this while testing under conditions of severe hash pressure?

Side note: Your testing method is extremely efficient compared to what everyone else is doing. It is efficient to the point that you can test a Transposition Table patch faster than Fish Test hardware with your own. This makes me doubt your methods work even for your own special case considering you've cut down testing costs by orders of magnitude compared everyone else. Where is the engine you've applied these methodologies on?

I used it a lot when developing Shokidoki. Typically by running fixed-length matches, and only accepting or rejecting outright if the result was very clear. If the result was less clear I would then only 'half accept' the patch, and run the next test for another patch with 50% of the games also done with engines that also implemented the undecided patch, to get more games to differentiate the two.

This is not very different in spirit from SPRT, actually, where you also add more games as long as the situation is considered still undecided. Except that it doesn't judge that after every game, but only every-so-many games. And of course that some of the game clusters were testing multiple patches at once.

Pali · Post by **Pali** » Fri Apr 26, 2024 7:10 pm

hgm wrote: ↑Fri Apr 26, 2024 6:40 pm
Pali wrote: ↑Fri Apr 26, 2024 1:01 pmDepth only assumed to be a quality measure only under identical conditions. Hash Size and Replacement Scheme directly affects: Singular Extensions, Double Extensions, Triple Extensions, Negative Extensions, Multi-Cut, History updates, Internal Iterative Reductions. You can't possibly compare depth between replacement schemes with all these being directly affected.
It sounds very fragile. Are you sure you developed all this while testing under conditions of severe hash pressure?

Side note: Your testing method is extremely efficient compared to what everyone else is doing. It is efficient to the point that you can test a Transposition Table patch faster than Fish Test hardware with your own. This makes me doubt your methods work even for your own special case considering you've cut down testing costs by orders of magnitude compared everyone else. Where is the engine you've applied these methodologies on?
I used it a lot when developing Shokidoki. Typically by running fixed-length matches, and only accepting or rejecting outright if the result was very clear. If the result was less clear I would then only 'half accept' the patch, and run the next test for another patch with 50% of the games also done with engines that also implemented the undecided patch, to get more games to differentiate the two.

This is not very different in spirit from SPRT, actually, where you also add more games as long as the situation is considered still undecided. Except that it doesn't judge that after every game, but only every-so-many games. And of course that some of the game clusters were testing multiple patches at once.

It's as fragile as the rest of the top 50 engines who use at least a subset of what I use. Each patch is SPRT tested at short and long time controls and the progress is shown by both regression tests I run and the tests run by CCRL, CEGT, IpmanChess and other testers.

Your described testing methodology has nothing to do with your suggested hash hits to nodes/nodes to depth methodology. Why are you suggesting methodologies that you do not use?

hgm · Post by **hgm** » Fri Apr 26, 2024 7:46 pm

Pali wrote: ↑Fri Apr 26, 2024 7:10 pmIt's as fragile as the rest of the top 50 engines who use at least a subset of what I use. Each patch is SPRT tested at short and long time controls and the progress is shown by both regression tests I run and the tests run by CCRL, CEGT, IpmanChess and other testers.

That doesn't really answer my question. How large was your hash table during these tests, and what was the TC and nps of the engine?

Your described testing methodology has nothing to do with your suggested hash hits to nodes/nodes to depth methodology. Why are you suggesting methodologies that you do not use?

Oh, I thought you were asking about the simultaneous testing of multiple patches.

Of course I extensively tested many replacement schemes through the described method. That was straightforward, because all my engines did suffer from hash misses, as they they then would spend extra nodes to obtain the sought information before continuing as they would have when they got the info for free from the TT. But I stopped doing that when I found the result was basically the same in all engines. Which makes sense, as the goal is the same always: maximize the information flow from the TT to the search, by preserving the entries that contribute the most information (in terms of saved search effort and frequency of use).

Pali · Post by **Pali** » Fri Apr 26, 2024 8:44 pm

hgm wrote: ↑Fri Apr 26, 2024 7:46 pm
Pali wrote: ↑Fri Apr 26, 2024 7:10 pmIt's as fragile as the rest of the top 50 engines who use at least a subset of what I use. Each patch is SPRT tested at short and long time controls and the progress is shown by both regression tests I run and the tests run by CCRL, CEGT, IpmanChess and other testers.
That doesn't really answer my question. How large was your hash table during these tests, and what was the TC and nps of the engine?

Your described testing methodology has nothing to do with your suggested hash hits to nodes/nodes to depth methodology. Why are you suggesting methodologies that you do not use?
Oh, I thought you were asking about the simultaneous testing of multiple patches.

Of course I extensively tested many replacement schemes through the described method. That was straightforward, because all my engines did suffer from hash misses, as they they then would spend extra nodes to obtain the sought information before continuing as they would have when they got the info for free from the TT. But I stopped doing that when I found the result was basically the same in all engines. Which makes sense, as the goal is the same always: maximize the information flow from the TT to the search, by preserving the entries that contribute the most information (in terms of saved search effort and frequency of use).

That does answer your question, it means that my test results translate over to testing conditions in various rating lists.
But since you asked:
STC: NPS anchored to 975k/s - 8+0.08 - 8MB
LTC: NPS anchored to 975k/s - 40+0.4 - 64MB

"But I stopped doing that when I found the result was basically the same in all engines"
Then why are you... suggesting this?
I can also tell you with 100% confidence that different replacement schemes gain in different engines. Engine developers try ideas from each other all the time, and replacement schemes are included. We have not converged to an optimum.

Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace

Re: Hybride replacemment strategy worse than always-replace