Ordo v0.7

Michel · Post by **Michel** » Mon Aug 12, 2013 6:33 am

Mathematically there seem to be two ways of pinning elo with the Ordo approach

(1) Assume that the pinned engines have played a large number of games
among each other with the expected score.

(2) Drop the elo's of the pinned engines from the equations, and likewise drop the equations that match the scores of the pinned engines. I that way you keep
the same number of variables and equations.

It is not clear to me if both methods are mathematically equivalent.

For BE there is no problem. BE maximizes a certain function (the Likelihood function). Pinning just changes that function.

IWB · Post by **IWB** » Mon Aug 12, 2013 8:27 am

michiguel wrote:
It is not comic if you play 80k games or more. At one point, you may have an error of 2 elo points or so. Generally, it is good to have two significant figures for the error, and the value should have the same number of decimals than the error. For those situations, one decial is the way to go.

With fewer number of games, probably integer numbers will suffice but if I have to choose an output, I do the one that satisfy most situations, and that is what Ordo has now. I could make this variable and add a switch for it, but it will be an overkill. If you really want to manipulate the output, Ordo gives you already the chance. Just select to output as .csv (comma separated valued). That format is compatible with Excel or any other spreadsheet. Just double click it and format it any way you like it.

Miguel
PS: It was already discussed a lot some of the issues that make BE alter their scale in terms of what elo number equals what. That is the origin of discrepancy (I believe). I think Kai Laskos (and Michel too) is the one that followed this more closely. For Ordo, if you want to see what x ratings points equal what probability to win you can use the switch -T and you will get a table of probabilities (you can alter this if you want).

Just my 2 cents: 0.x is purly academic. The only "persons" who are looking at ratings are humans and humans can't destinguish 10 Elo (my personal border) not to talk about 0.1. Even for engine development 1 Elo +/- is more than enough. Besides that, to calculate the ELo internaly you can use as many digits as you like and you would be off 0.9 Elo max. Showing 0.x is "comic" (at least to me) as us humans cant see/feel/taste that.
But, it doesn't hurt. If you want to do it ...

Yes your percentage/winning probability is (as expected) 7 Elo/1% for the first 10 Elo difference. Usually I use that 7 Elo thumb rule for close engines while I have a testrun.

Thx for the tool again, I usually have a look with it at my games!

Bye
Ingo

Ajedrecista · Post by **Ajedrecista** » Mon Aug 12, 2013 11:10 am

Hello:

Vinvin wrote:I ran Ordo 0.7 on my latest list ( http://www.talkchess.com/forum/viewtopic.php?t=48738 ) :

Comparison to 0.6 : 8 points difference at the top and 33 at the bottom :

I did not expected changes in Ordo ratings between versions 0.6 and 0.7.

By the way, I want to ask about this tool for compare two files. Where can it be downloaded? What is its name? I find it very useful. Thanks in advance.

Sorry for going off-topic.

Regards from Spain.

Ajedrecista.

Vinvin · Post by **Vinvin** » Mon Aug 12, 2013 12:59 pm

Ajedrecista wrote:Hello:

Vinvin wrote:I ran Ordo 0.7 on my latest list ( http://www.talkchess.com/forum/viewtopic.php?t=48738 ) :

Comparison to 0.6 : 8 points difference at the top and 33 at the bottom :
....
I did not expected changes in Ordo ratings between versions 0.6 and 0.7.

By the way, I want to ask about this tool for compare two files. Where can it be downloaded? What is its name? I find it very useful. Thanks in advance.

Sorry for going off-topic.

Regards from Spain.

Ajedrecista.

It's an option in "Total Commander" ( http://www.ghisler.com/ ).
Select 2 files and then go to the menu "Files" -> "Compare by content".

Don · Post by **Don** » Mon Aug 12, 2013 1:40 pm

To pin the ELO rating all you have to do (and it's mathematically sound) is to add or subtract a constant from all engines. For example if Shredder comes out 2750 and you want it to be the reference program at 2800 you would add 50 ELO to all programs.

Michel wrote:Mathematically there seem to be two ways of pinning elo with the Ordo approach

(1) Assume that the pinned engines have played a large number of games
among each other with the expected score.

(2) Drop the elo's of the pinned engines from the equations, and likewise drop the equations that match the scores of the pinned engines. I that way you keep
the same number of variables and equations.

It is not clear to me if both methods are mathematically equivalent.

For BE there is no problem. BE maximizes a certain function (the Likelihood function). Pinning just changes that function.

lucasart · Post by **lucasart** » Mon Aug 12, 2013 2:43 pm

Don wrote:To pin the ELO rating all you have to do (and it's mathematically sound) is to add or subtract a constant from all engines. For example if Shredder comes out 2750 and you want it to be the reference program at 2800 you would add 50 ELO to all programs.

Michel wrote:Mathematically there seem to be two ways of pinning elo with the Ordo approach

(1) Assume that the pinned engines have played a large number of games
among each other with the expected score.

(2) Drop the elo's of the pinned engines from the equations, and likewise drop the equations that match the scores of the pinned engines. I that way you keep
the same number of variables and equations.

It is not clear to me if both methods are mathematically equivalent.

For BE there is no problem. BE maximizes a certain function (the Likelihood function). Pinning just changes that function.

BayesELO can do that, using the "offset" command. But you can only pin one engine, obviously. So I don't really understand Michel's question. Hpow could you pin several engines, without inducing some strange distortion in the model.

michiguel · Post by **michiguel** » Mon Aug 12, 2013 4:22 pm

Michel wrote:
I think I guess what you mean but I am not sure. Could you give an example to illustrate what you need?
Typically you know the elo of the foreign engines you use for testing very accurately. After all they may have played millions of games. I would simply like to prefeed that elo information to Ordo or BayesElo when running a new test.

Currently I do this by having one large pgn that contains all tests I ever ran with the same set of foreign engines. But this is becoming very unwieldy.

So to give the requested example. Assume that X,Y,Z are foreign engines and a,b,c,d,e,f,g,... are test versions.

The information I have is a pgn with

X,Y,Z,a,b,c,d,e,f,g (*)

I run a test "h versus X,Y,Z"

To get accurate elo information I run say Ordo on

X,Y,Z,a,b,c,d,e,f,g,h

and consult the result.

What I would like to do is to prefeed Ordo the elo of X,Y,Z (known from (*)) and then run Ordo on

X,Y,Z,h

This won't be entirely the same of course since the elo of X,Y,Z is not completely known (there are some small error bars remaining) but it would be good enough to compare different test version which typically have much larger error bars.

Got it.

This is an old file I have in my computer (plus I added four "fake" games = engine x drew twice to spark, and twice to crafty). The rating is

Code: Select all

   # ENGINE               &#58; RATING    POINTS  PLAYED    (%)
   1 spark                &#58; 2355.4    9320.0   16002   58.2%
   2 toga-1.4             &#58; 2321.1    8550.0   16000   53.4%
   3 Gaviota_ke152        &#58; 2319.4   21115.0   40000   52.8%
   4 texel-1.01           &#58; 2318.6    8493.0   16000   53.1%
   5 glaurung-2.2         &#58; 2317.7    8473.0   16000   53.0%
   6 x                    &#58; 2315.0       2.0       4   50.0%
   7 Gaviota_853-rt2      &#58; 2305.0   20298.5   40000   50.7%
   8 fruit-051103         &#58; 2293.4    7919.0   16000   49.5%
   9 daydreamer-1.5x      &#58; 2291.8    7883.5   16000   49.3%
  10 Gaviota_853          &#58; 2288.2   19346.5   40000   48.4%
  11 komodo-1.3-4s        &#58; 2283.6    7696.5   16000   48.1%
  12 scorpio-2.7          &#58; 2279.5    7603.5   16000   47.5%
  13 Gaviota_851          &#58; 2275.1   18607.5   40000   46.5%
  14 crafty-23.4          &#58; 2274.6    7493.5   16002   46.8%
  15 critter-1.0-32-3s    &#58; 2261.8    7202.5   16000   45.0%

Then I re-run this "pinning" spark to 2350 and crafty to 2250. Then I get

Code: Select all

   # ENGINE               &#58; RATING    POINTS  PLAYED    (%)
   1 spark                &#58; 2350.0    9320.0   16002   58.2%
   2 toga-1.4             &#58; 2306.0    8550.0   16000   53.4%
   3 Gaviota_ke152        &#58; 2304.3   21115.0   40000   52.8%
   4 texel-1.01           &#58; 2303.5    8493.0   16000   53.1%
   5 glaurung-2.2         &#58; 2302.6    8473.0   16000   53.0%
   6 x                    &#58; 2300.0       2.0       4   50.0%
   7 Gaviota_853-rt2      &#58; 2289.9   20298.5   40000   50.7%
   8 fruit-051103         &#58; 2278.2    7919.0   16000   49.5%
   9 daydreamer-1.5x      &#58; 2276.7    7883.5   16000   49.3%
  10 Gaviota_853          &#58; 2273.1   19346.5   40000   48.4%
  11 komodo-1.3-4s        &#58; 2268.5    7696.5   16000   48.1%
  12 scorpio-2.7          &#58; 2264.4    7603.5   16000   47.5%
  13 Gaviota_851          &#58; 2260.0   18607.5   40000   46.5%
  14 crafty-23.4          &#58; 2250.0    7493.5   16002   46.8%
  15 critter-1.0-32-3s    &#58; 2246.7    7202.5   16000   45.0%

If I run the simulations (-s100) to get the errors:

Code: Select all

   # ENGINE               &#58; RATING  ERROR   POINTS  PLAYED    (%)
   1 spark                &#58; 2350.0   ----   9320.0   16002   58.2%
   2 toga-1.4             &#58; 2306.0    5.2   8550.0   16000   53.4%
   3 Gaviota_ke152        &#58; 2304.3    3.8  21115.0   40000   52.8%
   4 texel-1.01           &#58; 2303.5    5.5   8493.0   16000   53.1%
   5 glaurung-2.2         &#58; 2302.6    5.0   8473.0   16000   53.0%
   6 x                    &#58; 2300.0  348.0      2.0       4   50.0%
   7 Gaviota_853-rt2      &#58; 2289.9    3.7  20298.5   40000   50.7%
   8 fruit-051103         &#58; 2278.2    5.8   7919.0   16000   49.5%
   9 daydreamer-1.5x      &#58; 2276.7    5.0   7883.5   16000   49.3%
  10 Gaviota_853          &#58; 2273.1    3.9  19346.5   40000   48.4%
  11 komodo-1.3-4s        &#58; 2268.5    5.6   7696.5   16000   48.1%
  12 scorpio-2.7          &#58; 2264.4    5.0   7603.5   16000   47.5%
  13 Gaviota_851          &#58; 2260.0    3.7  18607.5   40000   46.5%
  14 crafty-23.4          &#58; 2250.0   ----   7493.5   16002   46.8%
  15 critter-1.0-32-3s    &#58; 2246.7    5.8   7202.5   16000   45.0%

I can do this just by converting the elo from crafty and spark to a constant, not a parameter. In each iteration, when it comes the time to "adjust" crafty's or spark's rating, it just doesn't. That means it is treated as a constant.

This hack was done inserting at the beginning:

Code: Select all

&#123;
int prefed = 0; // number of pins
int j;
	for &#40;j = 0; j < N_players; j++) &#123;
		Prefed&#91;j&#93; = FALSE;

		if (!strcmp&#40;Name&#91;j&#93;, "spark") ) &#123;
			Prefed&#91;j&#93; = TRUE;
			Ratingof&#91;j&#93; = 2350;
		&#125; 
		if (!strcmp&#40;Name&#91;j&#93;, "crafty-23.4")	) &#123;
			Prefed&#91;j&#93; = TRUE;
			Ratingof&#91;j&#93; = 2250;
		&#125;
		if &#40;Prefed&#91;j&#93;) prefed++;
	&#125;
&#125;

Then, in the main loop that calculates the rating iteratively I added

Code: Select all

for &#40;j = 0; j < N_players; j++) &#123;
	if &#40;Prefed&#91;j&#93;) continue;

.... // main calculation here where are the ratings are adjusted step by step
&#125;

So, spark and crafty will have 2350 and 2250, and the rest of the engines will adjust around these numbers. Engine x behaves as expected (it will be exactly at the average of both engines), but the rest is not that simple since they cannot compress so easily since they played each other so many games that the keep them at a certain distance between each other.

Miguel

Michel · Post by **Michel** » Mon Aug 12, 2013 6:06 pm

But you can only pin one engine, obviously. So I don't really understand Michel's question. Hpow could you pin several engines, without inducing some strange distortion in the model.

I think you should first read the example I have presented in my post.

(in short: you know the elo of foreign engines accurately and you don't want to throw that information away for every new test)

For BayesElo there is no theoretical problem. Just fill in the pinned elo's as contants in the likelihood function and maximize for the unpinned elo's. The way MLE works implies that the matches between pinned engines (if there
are such) will simply be ignored.

For Ordo it was initially bit less clear to me what the correct theoretical solution is (I still need to read Miguel's post below). I was proposing to make the elo for the pinned engines constants and drop the corresponding equations for the scores. This is what you get if you think of the Ordo model as a "drawless" BayesElo model.

Michel · Post by **Michel** » Mon Aug 12, 2013 6:21 pm

First of all thanks for implementing this!

Then I re-run this "pinning" spark to 2350 and crafty to 2250.

If I understand correctly you are pinning these at an elo difference 100 which is quite far from their measured difference (80 elo) in this pgn.

This would of course only make sense if you had a lot more games somewhere else which indicated that the elo difference was really 80.

I can do this just by converting the elo from crafty and spark to a constant, not a parameter. In each iteration, when it comes the time to "adjust" crafty's or spark's rating, it just doesn't. That means it is treated as a constant.

Yes I see! That corresponds to the implementation I have in mind. I think it is theoretically the correct solution if you think of Ordo as implementing a "drawless" MLE estimator.

So, spark and crafty will have 2350 and 2250, and the rest of the engines will adjust around these numbers. Engine x behaves as expected (it will be exactly at the average of both engines), but the rest is not that simple since they cannot compress so easily since they played each other so many games that the keep them at a certain distance between each other.

Yes of course. As I said above. Pinning spark and crafty at 80 only makes sense if you have other information.

michiguel · Post by **michiguel** » Mon Aug 12, 2013 10:57 pm

Michel wrote:First of all thanks for implementing this!

Then I re-run this "pinning" spark to 2350 and crafty to 2250.
If I understand correctly you are pinning these at an elo difference 100 which is quite far from their measured difference (80 elo) in this pgn.

Yes, I chose something different to observe the effect, particularly on "engine x".

This would of course only make sense if you had a lot more games somewhere else which indicated that the elo difference was really 80.

Yes, it would save reloading and recalculating the whole thing.

Miguel

I can do this just by converting the elo from crafty and spark to a constant, not a parameter. In each iteration, when it comes the time to "adjust" crafty's or spark's rating, it just doesn't. That means it is treated as a constant.
Yes I see! That corresponds to the implementation I have in mind. I think it is theoretically the correct solution if you think of Ordo as implementing a "drawless" MLE estimator.

So, spark and crafty will have 2350 and 2250, and the rest of the engines will adjust around these numbers. Engine x behaves as expected (it will be exactly at the average of both engines), but the rest is not that simple since they cannot compress so easily since they played each other so many games that the keep them at a certain distance between each other.
Yes of course. As I said above. Pinning spark and crafty at 80 only makes sense if you have other information.

Ordo v0.7

Re: Ordo v0.7

Re: Ordo v0.7

Re: Ordo v0.7.

Re: Ordo v0.7.

Re: Ordo v0.7

Re: Ordo v0.7

Re: Ordo v0.7

Re: Ordo v0.7

Re: Ordo v0.7

Re: Ordo v0.7