Continued: Finding equivalent times for different engines

Sven · Post by **Sven** » Sat May 21, 2011 6:35 pm

I create this thread to continue the technical discussion started by Uri Blass in this thread, since I believe the topic fits well into the "Programming and Technical Discussions" subforum.

As a very short summary, the topic is how to "quickly" find a time control like 1 minute for engine A and N minutes for B for which A and B have equal strength over many games. Uri made a good proposal for it using a simple formula, and I added that a formula based on some form of ELO calculation might be better.

Uri Blass wrote:I agree that it is possible to try to calculate rating based on assumptions that doubling of the time give 70 elo or something like that but it does the model more complicated when it is not clear if you get big improvement from it.

In the same post Uri has also made an enhanced proposal.

Further down I will now present my attempt of defining such a model. And yes, that model is more complex than the one proposed by Uri. So one question is, which model is better.

Sven

Sven · Post by **Sven** » Sat May 21, 2011 6:39 pm

The following describes what we know so far, and provides some definitions that are used later on:

Code: Select all

- Two engines E1, E2 with unknown ELO rating are playing a sequence of time-odds chess games. The idea is to find a time control for which E1 and E2 have equal strength over many games.


- E1 is always assigned the same time T0 per game (T0 > 0).


- In game number i, E2 plays with color C(i) which is an element of { W, B }.


- In game number i, E2 is assigned a time of T(i) per game (T(i) > 0).


- The time-odds factor for E2 in game i is therefore: K(i) = T(i) / T0


- To remain fully independent from any "a priori" knowledge about the strength of E1 and E2 and also from any external input, K(1) is set to 1 (alternatively it would also be possible to let K(1) be defined externally, e.g. as user input).


- The result of E2 in game number i is W(i) and is an element of { 0, 0.5, 1 }.


- After N games the total result of E2 is: WT(N) = SUM(i=1..N) (W(i))


- The available record of information after N games is the set S(N) containing N tuples G(i) = { C(i), T(i), W(i) } where each tuple represents one game and i is the game number.


- The goal is to find a function F that takes S(N) as input and returns T(N+1) such that the function P(X) := WT(X)/X converges "sufficiently fast" towards 0.5 for increasing X.


- The quality Q(F,E,M) of a function F is defined in terms of the number of games M needed to converge as described above with a given error rate E, where Q(F,E,M) is higher for smaller M.

  (OPEN ISSUE: how to formally define the convergence condition and the quality Q(F,E,M) ?)


- The reason for a formal definition of Q(F,E,M) is that a program that implements F should be able to terminate automatically if either
    (i)  M        >= M_max (maximum allowed number of games reached), or
    (ii) Q(F,E,M) >= Q_min (minimum desired quality reached).

Sven

Sven · Post by **Sven** » Sat May 21, 2011 6:43 pm

Two proposals for F exist, both derive T(N+1) only from G(N):

Code: Select all

  a) initial proposal by Uri Blass ("F_Uri_initial")

     if W(N) == 0 then T(N+1) := T(N) * (1 + 1/sqrt(N))
     else
     if W(N) == 1 then T(N+1) := T(N) / (1 + 1/sqrt(N))
     else
     /* W(N) == 0.5 */
                       T(N+1) := T(N)

  b) enhanced proposal by Uri Blass ("F_Uri_enhanced")

     if W(N) == 0   and C(N) == B then T(N+1) := T(N) * (1 + 1/sqrt(N+Z1))
     else
     if W(N) == 0   and C(N) == W then T(N+1) := T(N) * (1 + 1/sqrt(N+Z2))
     else
     if W(N) == 0.5 and C(N) == B then T(N+1) := T(N) * (1 + 1/sqrt(N+Z3))
     else
     if W(N) == 1   and C(N) == B then T(N+1) := T(N) / (1 + 1/sqrt(N+Z1))
     else
     if W(N) == 1   and C(N) == W then T(N+1) := T(N) / (1 + 1/sqrt(N+Z2))
     else
     /* W(N) == 0.5 and C(N) == W */   T(N+1) := T(N) / (1 + 1/sqrt(N+Z3))

     with proposed values Z1 := 19, Z2 := 20, Z3 := 200

The question is now, do these two proposed functions have the desired convergence property?

I have not analyzed this yet in depth but a quick simulation attempt did not show clearly the expected convergence, so I have investigated in a different model.

Sven

Sven · Post by **Sven** » Sat May 21, 2011 6:46 pm

The definition of a function "F_Sven_initial" in the given context is developed below.

Code: Select all

- The main idea is to define an estimated ELO rating difference D(N) after N games based on all information available with S(N), and then using D(N) to derive T(N+1).


- D(N) would be defined as R2(N) - R1, where R1 is an arbitrary, fixed ELO rating of E1 and R2(N) is the estimated ELO rating of E2 after the N games described by S(N).


- The key point is that D(N) shall estimate the ELO rating difference for games with equal time control but is based on S(N) which contains time-odds games.


- A further assumption would be that it is possible to estimate the relation between the ELO ratings of engines E2 and E3, where E3 is identical to E2 but runs with double speed compared to E2, by adding a constant rating difference DR:

  R3 := R2 + DR

  where existing proposals for realistic values of DR vary between DR = 70 and DR = 100.


- As a generalization of the latter, I assume that the ELO rating "advantage" of E2 when playing E1 with a time-odds factor K compared to playing E1 with equal time control can be estimated by:

  RA2(K) := R2(K) - R2(1) = DR * log2(K)

  , at least for sufficiently small rating differences between E2 and E1.

  Note that this "advantage" RA2(K) can also be negative for K < 1.


- The estimated ELO rating difference after N games "with time-odds advantage", DTO2(N), can be derived directly from the total result WT(N), by using the logistic curve:

  WT(N) / N = 1 / (1 + 10 ^ -(DTO2(N) / 400))

  so we get:

  DTO2(N) = -400 * log10(N / WT(N) - 1)


- The average "time-odds advantage" ARA2(N) of E2 over all N games could be defined as the arithmetic mean of RA2(K), based on the generalized assumption above:

  ARA2(N) := (DR / N) * sum(i=1..N) (log2(K(i)))


- Now I have a first result:

  D(N) := DTO2(N) - ARA2(N)

  which means that I get the "real" estimated rating difference based on equal time control by subtracting the average "time-odds advantage" from the rating difference that is shown by the game results.


- Since WT(N) / N shall converge towards 0.5 and therefore DTO2(N) shall converge towards 0, D(N) shall converge towards -ARA2(N).


- Similar to the generalized assumption above I can state that the time-odds factor K which would be necessary to equalize the ELO rating difference D(N) could be described by:

  D(N) = -DR * log2(K)

  (note that a positive "time-odds advantage" and therefore K>1 means a negative D(N)) which leads to this formula for the next K to use in game N+1:

  K(N+1) := sqrt(2 ^ ((ARA2(N) - DTO2(N)) / DR))

          = sqrt(2 ^ (  (1 / N) * sum(i=1..N) (log2(K(i)))
                      + (400 / DR) * log10(N / WT(N) - 1)
                     )
                )

  Note also that sqrt("...") is used instead of "..." to avoid jumping values and to allow for a smooth convergence.

Sven

Sven · Post by **Sven** » Sat May 21, 2011 6:47 pm

Open issues include:

- Verification of formal correctness of my model described above

- How to define convergence condition and quality function?

- Verification of convergence and quality for all proposed functions F by simulation and/or by real tests

I hope this makes sense. Qualified comments are welcome!

Sven

Uri Blass · Post by **Uri Blass** » Sat May 21, 2011 7:10 pm

My reply to the last post of Muller in that thread:

The problem is that
I am not sure about the Elo function and there may be diminishing returns so the program does not earn fixed elo from multiplying the time control
by a constant.

I am practically also interested in time handicap that cause the program to be equal and not in elo(I know that elo is dependent on games against many opponents and one opponent may be slightly misleading).

If having one opponent is a weakness in testing then we can learn how much it can be misleading in practice by this type of testing(espacially when usually this type of testing is done by version X+1 against version X when you get result that is close to 50% and we are not interested in the exact elo difference).

Edit:Example if we have a case when
B needs 2 minutes against 1 minutes of A and C needs 3 minutes against 2 minutes of B then we can see how much is wrong the assumption that C needs 3 minutes against 1 minute of A.

Note that based on the FRC CCRL list I find that in almost all cases the program with the higher rating win a match of 100 games so I believe that it is hard to tune a program to win a match against a specific program at least if we talk about 50 elo difference and not about 10 elo difference.

Note that we should use random positions out of many fixed opening positions in every game without books and with books it is easy to construct engines when A beat B, B beat C. and C beat A(regardless of the time control simply because book A leads to winning position against B book B leads to winning position against C and book C leads to winning position against A).

I do not think that random positions is the best idea to converge fast and
it may be better to have a fixed order of positions that are different enough to expose the weakness and the strength of different programs(so an engine that is weak in some type of opening positions get the right percentage of these positions in the first games and the percentage of these positions is not dependent on luck) but it is not an easy task to find the right positions.

Uri Blass · Post by **Uri Blass** » Sat May 21, 2011 7:23 pm

Sven Schüle wrote:Open issues include:

- Verification of formal correctness of my model described above

- How to define convergence condition and quality function?

- Verification of convergence and quality for all proposed functions F by simulation and/or by real tests

I hope this makes sense. Qualified comments are welcome!

Sven

I think that we can simply repeat the test many times when we finish a test only after we get error that is smaller than a specific number(let say 0.05 minutes)

If we get in 95% of the cases that B needs 1.95-2.05 minutes against A with a possible error of 0.05 minute then our error function seems to be a good estimate for the real error.

After having a good error function it is possible to compare different models by the number of games that we need to get a specific error(and smaller number are better).

I think that we probably need O(n*n) games for an error of 1/n.

Sven · Post by **Sven** » Sat May 21, 2011 8:06 pm

Uri Blass wrote:I think that we can simply repeat the test many times when we finish a test only after we get error that is smaller than a specific number(let say 0.05 minutes)

If we get in 95% of the cases that B needs 1.95-2.05 minutes against A with a possible error of 0.05 minute then our error function seems to be a good estimate for the real error.

Your mathematical skills are probably a lot higher than mine, so please excuse my silly questions below

How exactly would you like to decide about termination? How do you calculate that error, given a sequence of N games with their individual results W(i) and their individual time assignments T(i)?
a) Do you want to take the average of all T(i), and see how many of them are close enough to it, or
b) do you want to take the most recent time value T(N) as a reference, instead of the average?
I assume b) but I am not sure about your intention.

In your first model you proposed not to change the time factor in case of a draw, which poses the question when to terminate if there are a lot of draws in a row. (Maybe this would be solved by applying the error function, but the problem described next would remain open.)

In my model I have focussed on letting the total sum of all game results converge towards 0.5 * N, as opposed to your strategy of letting the time converge. A possible problem of letting the time converge could be that you have no clear statement about the question whether the initial requirement "equal strength" has been fulfilled, and I could imagine a case where you stop based on the error function for the time values but the sum of game results differs a lot from 0.5 * N. One could also define an error function based on my convergence model, so what do you think about it?

Sven

Uri Blass · Post by **Uri Blass** » Sat May 21, 2011 8:34 pm

I want to take the most recent time value T(N) as a reference(note that I do not think that my formula to increase the time is the best formula and probably it is better to have a bigger change in the time when the difference between the total result and 50% is bigger and not only when there are small number of games)

The total sum of all results in all games also converge to 50% but it is an indirect result and I do not have a rule to increase the time of the side that scored less than 50% as long as the total result is less than 50% and my target is to increase the time to have expected result of 50% in the next game(assuming colors unknown).

Uri Blass · Post by **Uri Blass** » Sat May 21, 2011 8:41 pm

I did not explain how to calculate the error and it is not something that I can say immediately without thinking about it.

We probably need some model about rating to calculate the distribution of T(N) but there is a also a practical test.

Simply repeat the test for N games many times and you will get different values of T(N)

By looking at the results we can see if our theoretical distribution of T(N)
make sense.

Continued: Finding equivalent times for different engines

Continued: Finding equivalent times for different engines

Re: Continued: Finding equivalent times for different engine

Re: Continued: Finding equivalent times for different engine

Re: Continued: Finding equivalent times for different engine

Re: Continued: Finding equivalent times for different engine

Re: Continued: Finding equivalent times for different engine

Re: Continued: Finding equivalent times for different engine

Re: Continued: Finding equivalent times for different engine

Re: Continued: Finding equivalent times for different engine

Re: Continued: Finding equivalent times for different engine