Stockfish 120324 a Disaster

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Uri Blass
Posts: 10897
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish 120324 a Disaster

Post by Uri Blass »

Draude wrote: Tue Mar 19, 2024 1:00 pm
People don't do tests against weaker engines as they are not the competitors
Well said! Indeed they are not!
But will you realize when your engine becomes weaker at beating weaker engines
Perhaps not! But as you said, they are not the competitors.
Just try Stockfish-Crafty for a few hundred games - Crafty will get way more draws than is to be expected
Yes, your few hundred games tests are statistically sound, and they bring great insight to chess engine programmers! Perhaps Crafty is stronger than you expected?
I did not do the tests but
I understood from peter's posts in a different forum that at long time control Rebel does better than Crafty relative to stockfish.

I do not know if there is a result that is significant statistically but it seems to me that at least peter's impression is that it is not something that is not significant.

I did not read specific results with number of games that he played Crafty-Rebel and number of games that he played Crafty-Stockfish with the same book.
Uri Blass
Posts: 10897
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish 120324 a Disaster

Post by Uri Blass »

Peter Berger wrote: Tue Mar 19, 2024 1:08 pm
Draude wrote: Tue Mar 19, 2024 1:00 pm Yes, your few hundred games tests are statistically sound, and they bring great insight to chess engine programmers! Perhaps Crafty is stronger than you expected?
Actually, probably ten games are enough, as Crafty will get one draw most likely to my experience. Think about, what this means statistically if I am right - a little statistical lesson for the reader. :D

To your other remark: I am no chess engine programmer, so it is not my responsibility to offer great insight, read and think about what I write ( or ignore), just as you see fit. 8-)
one game out of 10 is not significant because it also may be later one out of 100.
If you play 100 games of Crafty against Stockfish and 100 games of Crafty against Rebel and Crafty get 10 draws and 90 losses against stockfish and only 1 draw and 99 losses against Rebel then it seems a significant difference.

I guess based on what I remember from your posts that you played many games of Crafty against both players but I have no data.
Uri Blass
Posts: 10897
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish 120324 a Disaster

Post by Uri Blass »

Uri Blass wrote: Tue Mar 19, 2024 1:44 pm
Peter Berger wrote: Tue Mar 19, 2024 1:08 pm
Draude wrote: Tue Mar 19, 2024 1:00 pm Yes, your few hundred games tests are statistically sound, and they bring great insight to chess engine programmers! Perhaps Crafty is stronger than you expected?
Actually, probably ten games are enough, as Crafty will get one draw most likely to my experience. Think about, what this means statistically if I am right - a little statistical lesson for the reader. :D

To your other remark: I am no chess engine programmer, so it is not my responsibility to offer great insight, read and think about what I write ( or ignore), just as you see fit. 8-)
one game out of 10 is not significant because it also may be later one out of 100.
If you play 100 games of Crafty against Stockfish and 100 games of Crafty against Rebel and Crafty get 10 draws and 90 losses against stockfish and only 1 draw and 99 losses against Rebel then it seems a significant difference.

I guess based on what I remember from your posts that you played many games of Crafty against both players but I have no data.
I checked and statistically my intuition that it seems to be a significant difference in the example I gave is correct.

Here is the formula to test it.
https://online.stat.psu.edu/stat415/lesson/9/9.4

n1=100 n2=100 Y1=10 Y2=1
based on the formula in the site
Z=2.791...>1.96 so we reject the conjecture that the probabilty for draw is the same for Stockfish and Rebel in case the real results is what I suggested.