MRL - The MEA Rating List

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: MRL - The MEA Rating List

Post by Rebel »

Joost Buijs wrote: Sat Jun 16, 2018 4:15 pm
Rebel wrote: Wed Jun 13, 2018 9:40 am Added a lot of new engines. It's amazing the see the old (2010-2012) Robolito based clique (Houdini 1.5, Bouquet, Critter) to dominate the various lists and programs rated 250-300 elo higher, like Komodo and Stockfish, are unable to surpass them.

http://rebel13.nl/mea2.html
http://rebel13.nl/mea3.html
Really, this doesn't surprice me at all, just like I already said before is that STS is based on analysis with engines from at least 5 years back, and that were indead engines based on Robbolito and friends. Current results show that it is unreliable to use STS to determine playing strength, maybe it gives a rough indication but thats it.
Indeed, it should be obvious by now it's not a playing strength rating list but a STS rating list and I will re-baptize the page as such. Nevertheless after doing some experiments it can be useful for tuning or discovering a surprising combi parameter setting.
90% of coding is debugging, the other 10% is writing bugs.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: MRL - The MEA Rating List

Post by Dann Corbit »

10% of the problems in STS no longer have the correct answers. I have finished reanalyzing the data, but I did not compile it into a corrected suite yet.

Ed's test set has about 1/3 wrong answers, according to a test I ran over the weekend on 5 machines.

I guess that a strange way to put it is, "These tests used to be correct."

Put another way, with the depths we used to achieve and with the best available engines at the time, those were the key moves chosen, with values selected as calibrated by the scores returned, depths achieved, etc.

However, the new engines are exponentially stronger and the new hardware is exponentially stronger.

When I started calibration of the STS test suite in early 2009, I had a 32 bit OS, and the strongest available engine was Rybka 3:
viewtopic.php?t=26072
I gave an hour per position for each engine I used (3 of them) but we can reproduce that level of analysis in less than a minute now.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: MRL - The MEA Rating List

Post by Rebel »

Good news is always welcome. Nice job Dann! Eagerly awaiting the new stuff.
90% of coding is debugging, the other 10% is writing bugs.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: MRL - The MEA Rating List

Post by Rebel »

Rewritten the page, gave examples why STS is (still) useful.

http://rebel13.nl/rebel13/mrl.html
90% of coding is debugging, the other 10% is writing bugs.
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: MRL - The MEA Rating List

Post by Ferdy »

Image

You just discovered the prodeo tuning method. Optimize eval parameters using 3 TC's on the training set (STS) by maximizing the score with increasing score on 3 TC's, that is score_tc3 > score_tc2 and score_tc2 > score_tc1

average_score= (score_tc1 + score_tc2 + score_tc3)/3
Change param, test again, get the average, if current_average_score > old_best_average_score, then old_best_average_score = current_average_score. Repeat until some interations.

Could be interesting to test at fast tc of say 100ms/200ms/400ms

Running game tests for verification would complete it.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: MRL - The MEA Rating List

Post by Rebel »

Ferdy wrote: Wed Jun 20, 2018 5:16 am Image

You just discovered the prodeo tuning method. Optimize eval parameters using 3 TC's on the training set (STS) by maximizing the score with increasing score on 3 TC's, that is score_tc3 > score_tc2 and score_tc2 > score_tc1

average_score= (score_tc1 + score_tc2 + score_tc3)/3
Change param, test again, get the average, if current_average_score > old_best_average_score, then old_best_average_score = current_average_score. Repeat until some interations.
Good idea.
Ferdy wrote: Wed Jun 20, 2018 5:16 am Could be interesting to test at fast tc of say 100ms/200ms/400ms
Six would be even better, I agree.
Ferdy wrote: Wed Jun 20, 2018 5:16 am Running game tests for verification would complete it.
Yep.
90% of coding is debugging, the other 10% is writing bugs.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: MRL - The MEA Rating List

Post by Dann Corbit »

For the Rebel multiple answer test set, I have attached two files.
The file reb-glob.7z contains the raw data points.
Rebel.7z contains two files: rebel.epd and rglt10.epd.

The file rebel.epd contains the output of my score algorithms.
Some of the records have odd scores (found in rglt10.epd).
Contributing factors for strange scores is all of the scores for every move being negative (and the more negative the best move, the weaker the score) and also having a move with a big score that is better than the score for the move with the deepest analysis.

So the positions in rglt10.epd will all have to be revisited and probably reanalyzed.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: MRL - The MEA Rating List

Post by Dann Corbit »

With full data decoration, including Ed's numbers and values attached.
Produced with this query:
SELECT e.Epd + ' ' +
dbo.opcode_format('acd', acd) +
dbo.opcode_format('acs', acs) +
dbo.opcode_format('am', am) +
dbo.opcode_format('bm', bm) +
dbo.opcode_format('c0', c0) +
dbo.opcode_format('c1', c1) +
dbo.opcode_format('c2', c2) +
dbo.opcode_format('c3', c3) +
dbo.opcode_format('c4', c4) +
dbo.opcode_format('c5', c5) +
dbo.opcode_format('c6', c6) +
dbo.opcode_format('cce', round(coef * 444.0,0))+
dbo.opcode_format('ce', ce) +
dbo.opcode_format('dm', dm) +
dbo.opcode_format('id', id) +
dbo.opcode_format('pm', pm) +
dbo.opcode_format('pv', pv) +
dbo.opcode_format('white_wins', white_wins) +
dbo.opcode_format('black_wins', black_wins) +
dbo.opcode_format('draws', draws) +
dbo.opcode_format('Opening', Opening) ,
round(coef * 444.0,0) as oce,ce, -round((coef * 444.0 - ce),0) as distance, e.Epd, acd, pv, bm, c3, e.pm, white_wins, black_wins, draws, (white_wins+black_wins+draws) as games, acs, acn, id, Opening, dm
FROM Epd e where c5 like 'rebel.pos.%' order by c4 desc
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: MRL - The MEA Rating List

Post by Dann Corbit »

Some analysis logs of the rebel positions using Arena and multi-pv (set to 9) analysis (it is only a fraction of the data used, but was created recently and might be interesting to some):

https://drive.google.com/file/d/1Wdu2IU ... sp=sharing

Engines used:
Stockfish (bleeding edge) {crashed after position 154 of 657}
Houdini 6.1 Tactical
Komodo-12.1.1
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: MRL - The MEA Rating List

Post by Ferdy »

Dann Corbit wrote: Wed Jun 20, 2018 8:21 pm For the Rebel multiple answer test set, I have attached two files.
The file reb-glob.7z contains the raw data points.
Rebel.7z contains two files: rebel.epd and rglt10.epd.

The file rebel.epd contains the output of my score algorithms.
Some of the records have odd scores (found in rglt10.epd).
Contributing factors for strange scores is all of the scores for every move being negative (and the more negative the best move, the weaker the score) and also having a move with a big score that is better than the score for the move with the deepest analysis.

So the positions in rglt10.epd will all have to be revisited and probably reanalyzed.
I had been studying how engine score can be converted to a point system. One method to handle the negative scores is by using logistic function.

Code: Select all

scoring_rate = 1/[1 + 10 ^(-score_cp/400)]
Example from reb-glob.epd

Code: Select all

rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nc6; c3 Nc6; acd 34; ce -3;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nxd5; c3 Nc6; acd 26; ce -37;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Re8; c3 Nc6; acd 26; ce -49;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Be6; c3 Nc6; acd 26; ce -79;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Na6; c3 Nc6; acd 26; ce -85;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm a5; c3 Nc6; acd 26; ce -104;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm h6; c3 Nc6; acd 26; ce -106;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bg4; c3 Nc6; acd 26; ce -122;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bd7; c3 Nc6; acd 26; ce -140;
Example calculations:
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nc6; c3 Nc6; acd 34; ce -3;
ce = -3
sr (scoring_rate) = 1/(1+10^(-(-3)/400)) = 0.49568

rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bd7; c3 Nc6; acd 26; ce -140;
ce = -140
sr = 1/(1+10^(-(-140)/400)) = 0.30876

If you want the top to get 10 points,
factor = 10/0.49568 = 20.174

bm Nc6, ce = -3, sr = 0.49568
pt = factor * sr = 10

bm Bd7, ce = -140, sr = 0.30876
pt = factor * sr = 6

Or just apply a factor directly say 100 to get the points.

bm Nc6, ce = -3, sr = 0.49568
pt = 100 * sr = 50
Nc6=50

bm Bd7, ce = -140, sr = 0.30876
pt = 100 * sr = 31
Bd7=31

So different pos may have different max pt depending on the engine score.
The one thing that I like this system is that, I can see immediately that the move Nc6 has a 50% chance of winning. And that Bd7 has a winning chance of only 31%.

The general formula can be modified to fit the actual engine (the engine that does the analysis) capability.
sr = 1/(1 + 10 ^(-score_cp/400))

We can introduce a factor K to see for example what is the sr of the engine if given a 1 pawn advantage against the strongest engine available. If the analyzing engine is SF, can SF always wins in every game if given a 1 pawn advantage in the starting position against itself or Komodo?

sr = 1/(1 + 10 ^(-score_cp*K/400))

This can easily be tested by running a match say 100 games from selected position at different phases by material remaining on the starting position with 1 pawn advantage for side to move.
If SF against itself or Komodo can score 80 wins, 20 draws, and 0 loses, its game scoring rate is (80 + 20/2)/100 = 0.9

Solving for K in,
scoring_rate = 1/(1 + 10 ^(-score_cp*K/400))
when score_cp is 100 and scoring_rate = 0.9

K = -400/score_cp * log10((1/scoring_rate) - 1)
K = -400/100 * log10((1/0.9) - 1)
K = 3.8

So for SF the approximate sr (based only on 1 pawn advantage tests) is,
sr = 1/(1 + 10 ^(-score_cp*3.8/400))

Back to Nc6, ce = -3
sr = 1/(1 + 10 ^(-score_cp*3.8/400))
sr = 1/(1 + 10 ^(-(-3)*3.8/400)) = 0.4836

old_sr = 0.49568
new_sr = 0.4836

Using the new sr of 0.4836, the points by using a factor of 100 would then be,
pt = 100 * new_sr = 100 * 0.4836 = 48
Nc6=48

When applied as test suite, after the tests, we take the average points, if an engine get an average score of total_pt/total_pos = 35, we can say that this engine may perform at 35% scoring probablity acording to SF - the engine that is used to score the test set. This can also be used to test humans and their average score can be interpreted the same i.e according to SF.

Yours could be complicated as you consider updates, perhaps from stronger engines, or same engine but with higher depth searched and other factors.