CCET - a new difficult test suite

fkarger · Post by **fkarger** » Fri Jun 13, 2025 3:55 pm

We have two new leaders !

Std Ranking : Crystal 9, 60 Points (of 160)

Open Ranking :
Stockfish Dev 20250602, 64Threads, 65536MB Hash, AMD Ryzen 7970X, 32 Cores
60 secs/position
44 Points
Submitted by Kingpin1

Thank you for the participation

I was very happy to see the Submission of Kingpin1.

Lunar · Post by **Lunar** » Fri Jun 13, 2025 10:57 pm

One more remark; the "extra comments" field is very short

fkarger · Post by **fkarger** » Sat Jun 14, 2025 8:50 am

Lunar wrote: ↑Fri Jun 13, 2025 10:57 pm One more remark; the "extra comments" field is very short

Hi Patrick,

thank you for your help comments and participation!

The fields "Detailed report" and "Additional information" should allow arbitrary long answers.
Maybe you had a browser issue or do you mean a different field?

fkarger · Post by **fkarger** » Sat Jun 14, 2025 9:13 am

We have a new leader in the Open Ranking:

Crystal 9, 64Threads, 65536MB Hash
AMD Ryzen 7970X
60 secs/position
77 Points = 48%
Submitted by Mark Young - Kingpin1

Congratulations Mark!
This is getting dangerously close to the 50% mark

Also very remarkable is the new entry of Lunar by the author Patrick Hilhorst
which achieved a very good result using a relatively small amount of computing power:

Lunar 0.2.1, 8Threads, 1024MB Hash
AMD Ryzen 9 5900X
10 secs/position
31 Points = 19%
Submitted by Patrick Hilhorst (Lunar author)

Congratulations Patrick!

Hai · Post by **Hai** » Sat Jun 14, 2025 11:07 am

Apple MacBook Pro 16-inch M1 MAX maxed out.
8 GB RAM
8 of 10 cores
MultiPV = 1
1 second per position

File name : CCET.pgn
Total test items : 160
Test for : best moves
Total engines : 1
Timer : movetime: 1
Expand ply : 1
Elapsed : 03:17
Laps : 1
Total tests : 160
Total corrects : 100 (62%)
Ave correct elapse : 91 ms
Status : completed

Correct/Total:
Stockfish 17.1: 100/160

——————————————————————————————

File name : CCET.pgn
Total test items : 160
Test for : best moves
Total engines : 1
Timer : movetime: 1
Expand ply : 1
Elapsed : 03:24
Laps : 1
Total tests : 160
Total corrects : 100 (62%)
Ave correct elapse : 67 ms
Status : completed

Correct/Total:
Stockfish dev-20250606-nogit: 100/160

= That's an easy testsuite for chess engines.

If you want a difficult one, try: Top Chess Engines Testsuite 2024v2 - Stockfish 44%

fkarger · Post by **fkarger** » Sat Jun 14, 2025 11:13 am

Hi Peter,

Thanks for your detailed explanation. I went deeper into the details.

First, a general thought about the positions you’ve analyzed.
These positions share the following characteristics (among others):

White is to move

There is exactly one winning move

The solution is provably correct (via database)

In this sense (provable with exactly one solution), these are optimal positions.
Since there’s only one correct move, the chance of finding the solution by guessing is also minimized.

You’ve observed that some engines struggle to clearly identify the solution move in certain positions, while in others, it’s easier
(like
[d]8/8/7Q/1r4K1/2p5/8/3k4/8 w - -
).
This is entirely intentional!

The CCET has two rankings: the standardized and the open one.
1) To make meaningful entries for the standardized list possible,
it must be feasible to score some points under standardized conditions.
That’s why the tasks have varying difficulty levels.

2) For the open ranking, however, there also needs to be a wide range of very challenging tasks—otherwise,
it would be boring if the list quickly filled up with 100% scores.
This is why the test suite is called Challenging Chess Engine Test, it’s part of the design goal.
The test also aims to give hardcore users (> 100 cores) a challenging task.

peter wrote: ↑Fri Jun 13, 2025 11:36 am So what does this mean?
...
if it's not a weakness of the engines, it's one of the positions,
that are not really better distinctable as for best and second best moves
...

If an engine (+hardware+settings) fails to find the provably best move, it was simply too weak in that context.
It’s hard to blame the position for not having its provably unique solution found.
This is what makes some of the tasks a true challenge, and that’s by design.

Btw I think that all of the endgame positions (81 to 160) could be convincingly solved by Stockfish
given enough computational power. My general experience was: if you double the hardware power (time or #cores)
SF will solve about 10% to 30% of the previously unsolved positions.
And there were also some breakthrough moments where SF suddenly managed to solve all positions of a
given endgame type. This could mean that SF is able to convincingly master an endgame type without
table bases with a fixed search depth.

The position you mentioned earlier:
[d]8/q7/6P1/4K2Q/8/8/8/k7 w - - 0 1
has an interesting story

It’s the only one out of millions of KQPvKQ endgame positions that survived a
very tough selection process and made it into the collection.
In that sense, it’s a gem and, in my opinion, has a beautiful, study-like solution move with a devilish temptation.
You could say something similar about other positions in the set.

Best regards and have continuing fun to solve them!

Frank

fkarger · Post by **fkarger** » Sat Jun 14, 2025 11:24 am

Hai wrote: ↑Sat Jun 14, 2025 11:07 am Apple MacBook Pro 16-inch M1 MAX maxed out.
8 GB RAM
8 of 10 cores
MultiPV = 1
1 second per position

Total corrects : 100 (62%)
Stockfish dev-20250606-nogit: 100/160

= That's an easy testsuite for chess engines.

If you want a difficult one, try: Top Chess Engines Testsuite 2024v2 - Stockfish 44%

Hi Hai,

thank you for your interest.
Your results of a 62% score using Stockfish at 1 second per position
clearly contradict those of the open ranking.
Can you explain further?

Thank you for mentioning the Top Chess Engines Testsuite 2024v2.
That was one of my sources. Any position in that testsuite that
was difficult enough (and had a unique correct solution...) is already included in the CCET.

Best regards,

Frank

Lunar · Post by **Lunar** » Sat Jun 14, 2025 12:15 pm

fkarger wrote: ↑Sat Jun 14, 2025 11:24 am
Hai wrote: ↑Sat Jun 14, 2025 11:07 am Apple MacBook Pro 16-inch M1 MAX maxed out.
8 GB RAM
8 of 10 cores
MultiPV = 1
1 second per position

Total corrects : 100 (62%)
Stockfish dev-20250606-nogit: 100/160

= That's an easy testsuite for chess engines.

If you want a difficult one, try: Top Chess Engines Testsuite 2024v2 - Stockfish 44%
Hi Hai,

thank you for your interest.
Your results of a 62% score using Stockfish at 1 second per position
clearly contradict those of the open ranking.
Can you explain further?

Thank you for mentioning the Top Chess Engines Testsuite 2024v2.
That was one of my sources. Any position in that testsuite that
was difficult enough (and had a unique correct solution...) is already included in the CCET.

Best regards,

Frank

Just a shot in the dark: were you using endgame tablebases?

fkarger · Post by **fkarger** » Sat Jun 14, 2025 12:19 pm

peter wrote: ↑Fri Jun 13, 2025 3:02 pm ...
So here we are:

Richtige Lösung! (5:00.000) 52
Bisher gelöst: 20 von 54 ; 217:26m
...

Hi Peter,

some additional remarks regarding the reliability of the results.

If we always wanted scientifically hard reproducible results
we would have to work with #threads=1, fixed number of nodes, clear hash...
That would be clean on the one hand, but it would also take away some of the fun
(strong hardware would then be uninteresting).

However, there are maybe other useful options:
if there are enough comparable entries in the ranking list,
you could calculate average values and the like which might be a nice extension.

Best regards

Frank

peter · Post by **peter** » Sun Jun 15, 2025 12:38 am

fkarger wrote: ↑Sat Jun 14, 2025 12:19 pm you could calculate average values and the like which might be a nice extension.

Pure comparison of the numbers of solutions alone don't show the for such tests relevant random noise, Frank, more then how many soltutions are solved at the one and the other one run, much more relevant is the question, how often have the same positions been solved in one and in the other one run, you see?

EloStatTS (Frank Schubert) compares position by position and run by run and that each time a new one run is added to old ones already stored and compared to each other.
That's why the error bars get lower and lower, the more and more runs are stored and compared to each other, with each new one run all the old ones get new rating and ranking. But the most lowering of this error bar you see with near to each other high performant runs, because within those the same positions are solved again and again more or less exactly, the time indices count only for these positions solved by two runs in common.

That the points are converted to Elo too (of course not to be compared to Elo- performances of other tests) isn't the important point to me, I like to use this tool especially because it shows how many of the positions have been solved or not solved run by run, not only how many in sum, but how many times the same one positions. That the exact time to solution counts as for WDL- measurements of the runs, is another one fine feature. If you want to know, how exactly the tool works, read about it here:

https://glarean-magazin.ch/wp-content/u ... bert-1.pdf

So the error bars that are given by the tool comparing the two runs of the same engine with the same setting (hardware, threads, hash and TC) show much more than only the difference of the amounts of solved positions would.

Here I did let SF dev. 250602 run the second 80 positions of the 160 (nr. 81-160) with 8threads of the 16x3.5GHz CPU, 8G hash and 3'/pos. two times and let EloStatTS compare the tow runs (R1 and R2):

Code: Select all

    Program                                    Elo   +/-  Matches  Score   Av.Op.   S.Pos.   MST1    MST2   RIndex

  1 Stockfish250602-8t-R2                    : 3503   72     24    50.9 %   3497    19/ 80   45.5s  148.1s   0.68
  2 Stockfish250602-8t-R1                    : 3497   71     24    49.1 %   3503    18/ 80   33.3s  147.0s   0.70

MST1  : Mean solution time (solved positions only)
MST2  : Mean solution time (solved and unsolved positions)
RIndex: Score according to solution time ranking for each position

The point is, there's a difference of only 1 sinlge solution (19 and 18), but look at the error bar of 72 and 71! That doesn't come from so much difference in time- indices, they count only for those positions solved in both runs, that comes from the lack of such positions solved two times in both runs.
Again I stored the complete files of solutions per positions for both runs too and can send them on demand per mail, don't want to make this anyhow big position even bigger for that and upload wasn't worth it to me.

I hope that makes it clearer, what I mean with random noise, that's an intrinsic value connected with the positions of the suite, of course also to the hardware- TC and the engines, but even within the comparison of the same engine (-version and -setting) you get such high an error bar. Comparing that to other ones not much bigger suites of hard (for the used hardware- TC and engines) positions:

Code: Select all

    Program                                    Elo   +/-  Matches  Score   Av.Op.   S.Pos.   MST1    MST2   RIndex

 ...
 31 Monty-250119-6t                          : 3357    8   2328    28.7 %   3516    22/128    4.8s   25.7s   0.14
 32 Dragon3.3-MCTS-6t                        : 3356    8   2333    28.6 %   3515    27/128    9.4s   25.6s   0.09

MST1  : Mean solution time (solved positions only)
MST2  : Mean solution time (solved and unsolved positions)
RIndex: Score according to solution time ranking for each position

The comparison isn't quite fair, because the two engines in this one list are 2 out of 32 runs, which lowers the error bars by the number of runs comparedm but it's the end of the list (weakest performances measured) and the two engines perform yet rather differently. At the top of this list with 8threads best engines and settings solve 96 and 97 positions and the error bars of them are 5 and 6.

Hope I made my point finally yet clear enough, why I wouldn't use these second one 80 positions out of the 160, at least not together with other quite different kind of positions, and they are so special in their character, that I wouldn't know, with which others like just more like those, I'd combine such usefully. More of exactly that kind, ok, yet it would still be a matter of how much hardware- time would one want to spend for such very special and hard to compare to any other ones results.

Not only would they need a very special hardware- TC of their own depending on the engines and the hardware used much too, the character of them is of such special kind and requirement to the engines, they simply should not be mixed up with very different (kinds of) other ones, at least not as for my personal pov. of positional testing, regards

CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite

Re: CCET - a new difficult test suite