New engine releases & news 2021

mvanthoor · Post by **mvanthoor** » Mon Nov 08, 2021 11:03 pm

algerbrex wrote: ↑Mon Nov 08, 2021 9:24 pm I also occasionally use SPRT testing if I suspect one version of my engine I'm testing will be significantly stronger than the other. SPRT saves time since it'll end tests after a certain number of games when it's fairly confident it has the right idea about the strength difference between the programs. (Marcel Vanthoor has a nice writeup on this here https://rustic-chess.org/progress/sprt_testing.html).

Thanks for the referral. I have changed this writeup a few days ago to make it more clear what CuteChess is doing. Press CTRL+F5 in your browser to be certain that you get the latest version. I have received a comment that this page isn't (and wasn't) "fully accurate with regard to how an SPRT-test works". However, it _is_ accurate with regard to what CuteChess is actually doing. I empirically tested this myself, because there is huge confusion about this since the time CuteChess implemented SPRT. (It's internal help seems to describe a different behavior than what it actually implements.)

So that page describes what CuteChess does when you give it the "-sprt" parameter... but it could still be that Cutechess itself implements a "strange" version of what SPRT actually is. I haven't been able to find definitive information on this topic, and as I said on that page, I'm not a statistician.

algerbrex · Post by **algerbrex** » Mon Nov 08, 2021 11:08 pm

jmcd wrote: ↑Mon Nov 08, 2021 9:53 pm I'm aware of cutechess. Currently I'm using arena and it does the job. I like whatever Brunetti and whatever some of the other testers here are doing more though because they have info on illegal moves and timeouts which arent displayed in the final result on arena (though I assume there is some way to do it). I've also seen people talking about benchmark or something like that and its on the long list of things that I need to look into.

I dont really guess off a few insignificant games, but my point was that I am not surprised that Hopper exhibits strange behaviour on long time controls because I've never tested it with them.

Cool, glad to hear that. My point was just that you'll often quite a few games to determine if some new features is truly a gain or a wash, and using bullet time controls is what I've found to be the easiest method.

But Blunder also isn't very strong at longer time controls either. Partly because of the strictly linear time management as Guenther pointed out, and partly because my evaluation is quite weak (currently it only considers material + PST + basic mobility).

I already know one way I could improve Blunder's time management is by finding a way to prevent it from wasting time on a search it likely won't finish. The hard part of this feature however would be quantifying "likely."

jmcd wrote: ↑Mon Nov 08, 2021 9:53 pm I've always thought that the time check at every single node seems like a waste. I'm not sure how expensive time check functions are (I assume pretty cheap) but still, if your usable resolution it 1ms and you're going through 1 million nodes a second, why check every node? Seems like a waste to me, but perhaps never going over your allocated time pays off. Also I think that threads might solve this problem entirely if I understand their capabilities correctly.

Well, the time check function isn't quite called every node, but usually something like every 2048 or every 4096 nodes. For example, here's where Blunder checks time:

Code: Select all

// Every 2048 nodes, check if our time has expired.
if (search.nodes & 2047) == 0 {
	search.Timer.Check()
}

// If we're told to stop, abort the current search and return 0. This won't
// affect anything, as the previous search's best move will be used, and
// everything from the current search will be discarded.
if search.Timer.Stop {
	return 0
}

Anyhow, I suspect this method probably is wasteful to a degree, and I had the same thoughts when I started writing Blunder. But it seems to work pretty well so far and it keeps Blunder on schedule and consistent, which to me was more important starting out than being efficient. And the time checkup function should be pretty cheap as well, although that is pure intuition and I would need to actually test things to make any firm conclusions.

And it seems to me threads would work here as well if that's the approach you wanted to use. Spawn a separate thread for the search and in another thread periodically check if the time is up. If it is, let the search thread know it's time to end. Again though, I'm not an expert on threads either, this is just my initial opinion on the matter.

algerbrex · Post by **algerbrex** » Mon Nov 08, 2021 11:09 pm

mvanthoor wrote: ↑Mon Nov 08, 2021 11:03 pm
algerbrex wrote: ↑Mon Nov 08, 2021 9:24 pm I also occasionally use SPRT testing if I suspect one version of my engine I'm testing will be significantly stronger than the other. SPRT saves time since it'll end tests after a certain number of games when it's fairly confident it has the right idea about the strength difference between the programs. (Marcel Vanthoor has a nice writeup on this here https://rustic-chess.org/progress/sprt_testing.html).
Thanks for the referral. I have changed this writeup a few days ago to make it more clear what CuteChess is doing. Press CTRL+F5 in your browser to be certain that you get the latest version. I have received a comment that this page isn't (and wasn't) "fully accurate with regard to how an SPRT-test works". However, it _is_ accurate with regard to what CuteChess is actually doing. I empirically tested this myself, because there is huge confusion about this since the time CuteChess implemented SPRT. (It's internal help seems to describe a different behavior than what it actually implements.)

So that page describes what CuteChess does when you give it the "-sprt" parameter... but it could still be that Cutechess itself implements a "strange" version of what SPRT actually is. I haven't been able to find definitive information on this topic, and as I said on that page, I'm not a statistician.

Hmm, good points. I'll make a note of them, thanks. I'd still be curious to better understand what cutechess is doing if it isn't standard SPRT testing.

mvanthoor · Post by **mvanthoor** » Mon Nov 08, 2021 11:47 pm

algerbrex wrote: ↑Mon Nov 08, 2021 11:09 pm Hmm, good points. I'll make a note of them, thanks. I'd still be curious to better understand what cutechess is doing if it isn't standard SPRT testing.

-sprt elo0=ELO0 elo1=ELO1 alpha=ALPHA beta=BETA
Use a Sequential Probability Ratio Test as a termination
criterion for the match. This option should only be used
in matches between two players to test if engine A is
stronger than engine B. Hypothesis H1 is that A is
stronger than B by at least ELO0 ELO points, and H0
(the null hypothesis) is that A is not stronger than B
by at least ELO1 ELO points. The maximum probabilities
for type I and type II errors outside the interval
[ELO0, ELO1] are ALPHA and BETA. The match is stopped if
either H0 or H1 is accepted or if the maximum number of
games set by '-rounds' and/or '-games' is reached.

According to CuteChess' help, it connects H1 to elo0, which, according to various sources, is incorrect compared to what it internally does. (I have not checked this myself in the code.) Also, the description above is unclear. You normally wouldn't even have elo0 and elo1 as two parameters.

Normally, you state:
- H1: Engine NEW is at least 10 Elo stronger than OLD.

Then H0 automatically becomes:
- H0: Engine New is NOT at least 10 Elo stronger than engine OLD.

So you only have one parameter. So even if the new engine is 7 Elo stronger than the old engine, the test would still fail. (Because it is NOT at least 10 Elo stronger.)

What CuteChess actually _seems_ to be doing is testing against H1 within an elo0 and elo1 range, where elo0 and elo1 define the H1 hypothesis, but only the "elo0" part is negated:

- elo0: 0
- elo1: 10

So this means:

H1: Engine NEW is at least 0 Elo stronger than engine OLD, AND outside a margin of 10 Elo.
H0 Engine NEW is NOT at least 0 Elo stronger than engine OLD, AND outside a margin of 10 Elo.

So if NEW is 15 Elo stronger, then H1 is true: It's at least 0 Elo stronger, and it's outside a margin of 10 Elo.
If NEW is 15 Elo weaker (or "-15 stronger"), then H0 is true: It's NOT at least 0 Elo stronger (it's 15 weaker), AND it's outside a margin of 10 Elo.

This means that Cutechess will keep testing between a -10 and +10 margin.

So yes, it is an SPRT test as far as I can see (as H1 can be as complex as you want, and H0 then automatically becomes "not H1"), but it's not very well described in the help. It may actually be described WRONG in the help, but it's so... unclear to me that I can't even determine if this is true.

Therefore I just tested what Cutechess actually does, and that is what I described on that page.

This also makes it possible to set this:
- Elo0: -5
- Elo1: 7

H1: Engine NEW is expected to be at least -5 Elo against OLD, AND outside a 12 Elo margin
H0: Engine New is NOT expected to be at least -5 Elo stronger than OLD, AND outside a 12 elo margin

Thus Cutechess would keep testing if the NEW engine is between -5 and +7. If the engine is +15 Elo, then it is "at least -5 Elo stronger", and it's outside the 12 Elo margin, so CuteChess accepts H1. If the engine is -8 Elo, it is NOT at least -5 Elo stronger (because it's now -8) AND it's outside the 12 Elo margin (+7 - 8 = -5, and -8 is outside it).

This feels very logical, but if the help is intending to describe this, then it does a poor job at it.

PS: If someone can definitively prove me wrong AND clearly explains what cutechess does, I'll gladly change that page again, and give proper credits.

Gabor Szots · Post by **Gabor Szots** » Tue Nov 09, 2021 8:24 am

I thought this thread was intended for engine announcements, not engine discussion.

Guenther · Post by **Guenther** » Tue Nov 09, 2021 8:33 am

Gabor Szots wrote: ↑Tue Nov 09, 2021 8:24 am I thought this thread was intended for engine announcements, not engine discussion.

This will be fixed, hopefully for the near future, see my reference in this post before:
forum3/viewtopic.php?f=2&t=76209&start=860#p911175

As for now it is probably much too late to split this thread up in several parts (especially for all the months I was away from computerchess),
thus I thought it won't hurt anymore to do what other people did anyway when I was away from talkchess.

Ofc it will work best, if people will remain disciplined ;-)

Brunetti · Post by **Brunetti** » Tue Nov 09, 2021 1:34 pm

jmcd wrote: ↑Mon Nov 08, 2021 5:05 am Hopper 1.8 released

Another good step forward:

Code: Select all

-------------------------------------------------------------------------------------------------------------
Engines in Hopper 1.8 64-bit family
-------------------------------------------------------------------------------------------------------------
Rank  Elo   ±  Engine                                      Score Games  Wins Draws  Loss  Oppo time stal ille
-------------------------------------------------------------------------------------------------------------
   1. 2356  29 Hopper 1.8 64-bit                             48%   438   163    91   184   -15   1%   1%   0%
   2. 2286  21 Hopper 1.7 64-bit                             48%   968   382   164   422   -15   2%   2%   0%
   3. 2240  24 Hopper 1.6 64-bit                             49%   696   273   137   286   -10   1%   6%   0%
   4. 2121  22 Hopper 1.5 64-bit                             49%   867   347   154   366    -3   5%   3%   0%
   5. 2111  28 Hopper 1.4 64-bit                             50%   485   196    96   193    +4  21%   4%   0%
   6. 1987  21 Hopper 1.3 64-bit                             48%   904   363   147   394   -15   1%   1%   0%
   7. 1966  23 Hopper 1.1 64-bit                             47%   755   303    97   355   -30  19%   1%   0%
   8. 1963  21 Hopper 1.2 64-bit                             48%   869   332   169   368   -17   1%   0%   0%
   9. 1877  21 Hopper 20211004 64-bit                        47%   925   356   162   407   -21   1%   0%   0%
  10. 1828  22 Hopper 20211003 64-bit                        49%   861   371   107   383    -3   1%   1%   1%
-------------------------------------------------------------------------------------------------------------

Alex

Tearth · Post by **Tearth** » Tue Nov 09, 2021 11:33 pm

Guenther wrote: ↑Sun Nov 07, 2021 7:05 pm Soon to come:
Inanis (successor of Cosette)
https://github.com/Tearth/Inanis

I've finally managed to register and log in here, so I can talk and reply

Still a lot of work to do (~2300 Elo at this moment) so I won't do an official release before getting at least 100 Elo more than the last version of Cosette, but hopefully it will happen in the next few months.

algerbrex · Post by **algerbrex** » Wed Nov 10, 2021 8:27 am

Blunder 7.2.0 has been released. It does not represent a notable strength improvement over 7.1.0 (at least from my testing), but does include a polyglot opening book loader among some other tweaks. See the release notes for more details: https://github.com/algerbrex/blunder/re ... tag/v7.2.0

tmokonen · Post by **tmokonen** » Wed Nov 10, 2021 8:56 am

Arasan 23.1 released:
https://www.arasanchess.org/
https://github.com/jdart1/arasan-chess

Pulse 1.7.3:
https://github.com/fluxroot/pulse/releases/tag/1.7.3

New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021

Re: New engine releases & news 2021