Something I was playing around with in my head. Please give your expert opinions. Is this already tried and failed?
For simplification:
Take one test positions and assume 16 legal moves, we don't know which moves are better or which is best. Now we randomly assign points to these moves (say) 1-10,000.
We then run test position on x different engines for (say) 1 second and see how points earned correlates with Elo as determined through CCRL, CEGT, etc. If correlation is low rerun test with other random points until correlation is at its highest. Maybe some factor analysis could help with more quickly converging to right point assignment. I assume you would need positions where there are many legal moves and many engines with diverse playing styles.
Some position will be better, right? Maybe a diverse set, or maybe more positional ones, as they are better predictors for engine strength, I don't know. Then take lots of positions in your test suite, fine tune, and maybe you will have something?
Obviously one position is not enough as any two engines have something like a 50%? chance of returning same move. But, if you had enough position, maybe five hundred, could you get accurate results? Thanks in advance.
Better testing method or already failed?
Moderator: Ras
-
Dann Corbit
- Posts: 12803
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Better testing method or already failed?
Your approach is similar to EPD test sets like:
BS2830.epd
BT2450.epd
BT2630.epd
which do not tend to measure playing strength terribly well, but do tend to measure tactical position solving ability fairly well.
The degree of agreement will vary widely depending on the positions chosen. For instance, most decent chess engines will achive 280/300 or better on WAC if decent hardware is also used. But a really tough set like LCT II will probably have a much lower agreement ratio for various chess engines. So which ratio to believe?
I think that the only way to measure playing strength is to play games (and at least 400 games are needed to have any reliability in the answer).
To misquote Inigo Montoya:
"That test you keep using -- I don't think it means what you think it means."
BS2830.epd
BT2450.epd
BT2630.epd
which do not tend to measure playing strength terribly well, but do tend to measure tactical position solving ability fairly well.
The degree of agreement will vary widely depending on the positions chosen. For instance, most decent chess engines will achive 280/300 or better on WAC if decent hardware is also used. But a really tough set like LCT II will probably have a much lower agreement ratio for various chess engines. So which ratio to believe?
I think that the only way to measure playing strength is to play games (and at least 400 games are needed to have any reliability in the answer).
To misquote Inigo Montoya:
"That test you keep using -- I don't think it means what you think it means."