Indeed, it should be obvious by now it's not a playing strength rating list but a STS rating list and I will re-baptize the page as such. Nevertheless after doing some experiments it can be useful for tuning or discovering a surprising combi parameter setting.Joost Buijs wrote: ↑Sat Jun 16, 2018 4:15 pmReally, this doesn't surprice me at all, just like I already said before is that STS is based on analysis with engines from at least 5 years back, and that were indead engines based on Robbolito and friends. Current results show that it is unreliable to use STS to determine playing strength, maybe it gives a rough indication but thats it.Rebel wrote: ↑Wed Jun 13, 2018 9:40 am Added a lot of new engines. It's amazing the see the old (2010-2012) Robolito based clique (Houdini 1.5, Bouquet, Critter) to dominate the various lists and programs rated 250-300 elo higher, like Komodo and Stockfish, are unable to surpass them.
http://rebel13.nl/mea2.html
http://rebel13.nl/mea3.html
MRL - The MEA Rating List
Moderator: Ras
-
- Posts: 7282
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: MRL - The MEA Rating List
90% of coding is debugging, the other 10% is writing bugs.
-
- Posts: 12743
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: MRL - The MEA Rating List
10% of the problems in STS no longer have the correct answers. I have finished reanalyzing the data, but I did not compile it into a corrected suite yet.
Ed's test set has about 1/3 wrong answers, according to a test I ran over the weekend on 5 machines.
I guess that a strange way to put it is, "These tests used to be correct."
Put another way, with the depths we used to achieve and with the best available engines at the time, those were the key moves chosen, with values selected as calibrated by the scores returned, depths achieved, etc.
However, the new engines are exponentially stronger and the new hardware is exponentially stronger.
When I started calibration of the STS test suite in early 2009, I had a 32 bit OS, and the strongest available engine was Rybka 3:
http://www.talkchess.com/forum3/viewtopic.php?t=26072
I gave an hour per position for each engine I used (3 of them) but we can reproduce that level of analysis in less than a minute now.
Ed's test set has about 1/3 wrong answers, according to a test I ran over the weekend on 5 machines.
I guess that a strange way to put it is, "These tests used to be correct."
Put another way, with the depths we used to achieve and with the best available engines at the time, those were the key moves chosen, with values selected as calibrated by the scores returned, depths achieved, etc.
However, the new engines are exponentially stronger and the new hardware is exponentially stronger.
When I started calibration of the STS test suite in early 2009, I had a 32 bit OS, and the strongest available engine was Rybka 3:
http://www.talkchess.com/forum3/viewtopic.php?t=26072
I gave an hour per position for each engine I used (3 of them) but we can reproduce that level of analysis in less than a minute now.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 7282
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: MRL - The MEA Rating List
Good news is always welcome. Nice job Dann! Eagerly awaiting the new stuff.
90% of coding is debugging, the other 10% is writing bugs.
-
- Posts: 7282
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: MRL - The MEA Rating List
90% of coding is debugging, the other 10% is writing bugs.
-
- Posts: 4845
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: MRL - The MEA Rating List

You just discovered the prodeo tuning method. Optimize eval parameters using 3 TC's on the training set (STS) by maximizing the score with increasing score on 3 TC's, that is score_tc3 > score_tc2 and score_tc2 > score_tc1
average_score= (score_tc1 + score_tc2 + score_tc3)/3
Change param, test again, get the average, if current_average_score > old_best_average_score, then old_best_average_score = current_average_score. Repeat until some interations.
Could be interesting to test at fast tc of say 100ms/200ms/400ms
Running game tests for verification would complete it.
-
- Posts: 7282
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: MRL - The MEA Rating List
Good idea.Ferdy wrote: ↑Wed Jun 20, 2018 5:16 am
You just discovered the prodeo tuning method. Optimize eval parameters using 3 TC's on the training set (STS) by maximizing the score with increasing score on 3 TC's, that is score_tc3 > score_tc2 and score_tc2 > score_tc1
average_score= (score_tc1 + score_tc2 + score_tc3)/3
Change param, test again, get the average, if current_average_score > old_best_average_score, then old_best_average_score = current_average_score. Repeat until some interations.
Six would be even better, I agree.
Yep.
90% of coding is debugging, the other 10% is writing bugs.
-
- Posts: 12743
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: MRL - The MEA Rating List
For the Rebel multiple answer test set, I have attached two files.
The file reb-glob.7z contains the raw data points.
Rebel.7z contains two files: rebel.epd and rglt10.epd.
The file rebel.epd contains the output of my score algorithms.
Some of the records have odd scores (found in rglt10.epd).
Contributing factors for strange scores is all of the scores for every move being negative (and the more negative the best move, the weaker the score) and also having a move with a big score that is better than the score for the move with the deepest analysis.
So the positions in rglt10.epd will all have to be revisited and probably reanalyzed.
The file reb-glob.7z contains the raw data points.
Rebel.7z contains two files: rebel.epd and rglt10.epd.
The file rebel.epd contains the output of my score algorithms.
Some of the records have odd scores (found in rglt10.epd).
Contributing factors for strange scores is all of the scores for every move being negative (and the more negative the best move, the weaker the score) and also having a move with a big score that is better than the score for the move with the deepest analysis.
So the positions in rglt10.epd will all have to be revisited and probably reanalyzed.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 12743
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: MRL - The MEA Rating List
With full data decoration, including Ed's numbers and values attached.
Produced with this query:
SELECT e.Epd + ' ' +
dbo.opcode_format('acd', acd) +
dbo.opcode_format('acs', acs) +
dbo.opcode_format('am', am) +
dbo.opcode_format('bm', bm) +
dbo.opcode_format('c0', c0) +
dbo.opcode_format('c1', c1) +
dbo.opcode_format('c2', c2) +
dbo.opcode_format('c3', c3) +
dbo.opcode_format('c4', c4) +
dbo.opcode_format('c5', c5) +
dbo.opcode_format('c6', c6) +
dbo.opcode_format('cce', round(coef * 444.0,0))+
dbo.opcode_format('ce', ce) +
dbo.opcode_format('dm', dm) +
dbo.opcode_format('id', id) +
dbo.opcode_format('pm', pm) +
dbo.opcode_format('pv', pv) +
dbo.opcode_format('white_wins', white_wins) +
dbo.opcode_format('black_wins', black_wins) +
dbo.opcode_format('draws', draws) +
dbo.opcode_format('Opening', Opening) ,
round(coef * 444.0,0) as oce,ce, -round((coef * 444.0 - ce),0) as distance, e.Epd, acd, pv, bm, c3, e.pm, white_wins, black_wins, draws, (white_wins+black_wins+draws) as games, acs, acn, id, Opening, dm
FROM Epd e where c5 like 'rebel.pos.%' order by c4 desc
Produced with this query:
SELECT e.Epd + ' ' +
dbo.opcode_format('acd', acd) +
dbo.opcode_format('acs', acs) +
dbo.opcode_format('am', am) +
dbo.opcode_format('bm', bm) +
dbo.opcode_format('c0', c0) +
dbo.opcode_format('c1', c1) +
dbo.opcode_format('c2', c2) +
dbo.opcode_format('c3', c3) +
dbo.opcode_format('c4', c4) +
dbo.opcode_format('c5', c5) +
dbo.opcode_format('c6', c6) +
dbo.opcode_format('cce', round(coef * 444.0,0))+
dbo.opcode_format('ce', ce) +
dbo.opcode_format('dm', dm) +
dbo.opcode_format('id', id) +
dbo.opcode_format('pm', pm) +
dbo.opcode_format('pv', pv) +
dbo.opcode_format('white_wins', white_wins) +
dbo.opcode_format('black_wins', black_wins) +
dbo.opcode_format('draws', draws) +
dbo.opcode_format('Opening', Opening) ,
round(coef * 444.0,0) as oce,ce, -round((coef * 444.0 - ce),0) as distance, e.Epd, acd, pv, bm, c3, e.pm, white_wins, black_wins, draws, (white_wins+black_wins+draws) as games, acs, acn, id, Opening, dm
FROM Epd e where c5 like 'rebel.pos.%' order by c4 desc
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 12743
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: MRL - The MEA Rating List
Some analysis logs of the rebel positions using Arena and multi-pv (set to 9) analysis (it is only a fraction of the data used, but was created recently and might be interesting to some):
Engines used:
Stockfish (bleeding edge) {crashed after position 154 of 657}
Houdini 6.1 Tactical
Komodo-12.1.1
Engines used:
Stockfish (bleeding edge) {crashed after position 154 of 657}
Houdini 6.1 Tactical
Komodo-12.1.1
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 4845
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: MRL - The MEA Rating List
I had been studying how engine score can be converted to a point system. One method to handle the negative scores is by using logistic function.Dann Corbit wrote: ↑Wed Jun 20, 2018 8:21 pm For the Rebel multiple answer test set, I have attached two files.
The file reb-glob.7z contains the raw data points.
Rebel.7z contains two files: rebel.epd and rglt10.epd.
The file rebel.epd contains the output of my score algorithms.
Some of the records have odd scores (found in rglt10.epd).
Contributing factors for strange scores is all of the scores for every move being negative (and the more negative the best move, the weaker the score) and also having a move with a big score that is better than the score for the move with the deepest analysis.
So the positions in rglt10.epd will all have to be revisited and probably reanalyzed.
Code: Select all
scoring_rate = 1/[1 + 10 ^(-score_cp/400)]
Code: Select all
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nc6; c3 Nc6; acd 34; ce -3;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nxd5; c3 Nc6; acd 26; ce -37;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Re8; c3 Nc6; acd 26; ce -49;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Be6; c3 Nc6; acd 26; ce -79;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Na6; c3 Nc6; acd 26; ce -85;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm a5; c3 Nc6; acd 26; ce -104;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm h6; c3 Nc6; acd 26; ce -106;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bg4; c3 Nc6; acd 26; ce -122;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bd7; c3 Nc6; acd 26; ce -140;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nc6; c3 Nc6; acd 34; ce -3;
ce = -3
sr (scoring_rate) = 1/(1+10^(-(-3)/400)) = 0.49568
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bd7; c3 Nc6; acd 26; ce -140;
ce = -140
sr = 1/(1+10^(-(-140)/400)) = 0.30876
If you want the top to get 10 points,
factor = 10/0.49568 = 20.174
bm Nc6, ce = -3, sr = 0.49568
pt = factor * sr = 10
bm Bd7, ce = -140, sr = 0.30876
pt = factor * sr = 6
Or just apply a factor directly say 100 to get the points.
bm Nc6, ce = -3, sr = 0.49568
pt = 100 * sr = 50
Nc6=50
bm Bd7, ce = -140, sr = 0.30876
pt = 100 * sr = 31
Bd7=31
So different pos may have different max pt depending on the engine score.
The one thing that I like this system is that, I can see immediately that the move Nc6 has a 50% chance of winning. And that Bd7 has a winning chance of only 31%.
The general formula can be modified to fit the actual engine (the engine that does the analysis) capability.
sr = 1/(1 + 10 ^(-score_cp/400))
We can introduce a factor K to see for example what is the sr of the engine if given a 1 pawn advantage against the strongest engine available. If the analyzing engine is SF, can SF always wins in every game if given a 1 pawn advantage in the starting position against itself or Komodo?
sr = 1/(1 + 10 ^(-score_cp*K/400))
This can easily be tested by running a match say 100 games from selected position at different phases by material remaining on the starting position with 1 pawn advantage for side to move.
If SF against itself or Komodo can score 80 wins, 20 draws, and 0 loses, its game scoring rate is (80 + 20/2)/100 = 0.9
Solving for K in,
scoring_rate = 1/(1 + 10 ^(-score_cp*K/400))
when score_cp is 100 and scoring_rate = 0.9
K = -400/score_cp * log10((1/scoring_rate) - 1)
K = -400/100 * log10((1/0.9) - 1)
K = 3.8
So for SF the approximate sr (based only on 1 pawn advantage tests) is,
sr = 1/(1 + 10 ^(-score_cp*3.8/400))
Back to Nc6, ce = -3
sr = 1/(1 + 10 ^(-score_cp*3.8/400))
sr = 1/(1 + 10 ^(-(-3)*3.8/400)) = 0.4836
old_sr = 0.49568
new_sr = 0.4836
Using the new sr of 0.4836, the points by using a factor of 100 would then be,
pt = 100 * new_sr = 100 * 0.4836 = 48
Nc6=48
When applied as test suite, after the tests, we take the average points, if an engine get an average score of total_pt/total_pos = 35, we can say that this engine may perform at 35% scoring probablity acording to SF - the engine that is used to score the test set. This can also be used to test humans and their average score can be interpreted the same i.e according to SF.
Yours could be complicated as you consider updates, perhaps from stronger engines, or same engine but with higher depth searched and other factors.