more on engine testing

Discussion of chess software programming and technical issues.

Moderator: Ras

Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: more on engine testing

Post by Carey »

bob wrote:
Carey wrote:
bob wrote:
Bill Rogers wrote:Carey
A chess program without an evaluation routine can not really play chess in the normal sense. It could only make random moves no matter how fast it could search. A search routine all by itself only generates moves and does not know which move is better or not. In fact it does not even know when it makes a capture or what the piece is worth.
Even the simplest evaluation subroutine which understands good captures from bad ones won't be able to play a decent game of chess. It might beat some begining chess players but not anyone with any kind of real chess playing skills, computer or not.
Bill
This is not quite true. See Don Beal's paper about a simple random evaluation. It played quite reasonable chess. And due to the way the search works, it managed to turn that random evaluation into a strong sense of mobility. I can explain if you want. But it actually does work...
Bob,

If you don't mind, I'd like to hear the explanation.

I don't happen to have that issue of the ICCAJ and I don't see the article on the web.

Who knows, maybe in my next program, I'll just stick a random number generator in there... :P It'd be a heck of a lot easier... :lol:


Seriously though, from what you've been reporting in this and the other tests, ripping out chunks of Crafty's eval, about hash errors being irrelevant, etc., it's really starting to sund like: "Just get things in the general area, and let the search do the hard work for you" kind of thing.

Just get the evaluator reasonable, add a few special cases for things the search & eval can't handle if it's a leaf, and just let the search do its thing.

Carey
Sure. Let's say you produce an evaluation function that is pure material, plus a random number between -1/2 pawn and +1/2 pawn. So your search will at least play sane moves and not throw material away. But how does it play decently?

Here's the idea. At a position P (and you can even suppose it is one ply away from leaf positions for simplicity) if you have a large number of legal moves, then each time you make one and evaluate the resulting position, you get a random component in your evaluation. Since there are a large number of legal moves in this position, you have a good chance of getting a "good" random number.

In another position at the same depth, you are in check, so you only have one possible move, and one chance to get a good random evaluation.

What Don found was that branches where you have lots of alternatives gives you a better chance to get good random numbers, while branches where you have few alternatives makes you "get what you get". And if you follow that, you will note that it is actually a "poor-man's mobility" approximation.

The results were surprising when I first saw them, but after thinking about the explanation, it was one of those "of course..." type realizations...

Hope that helps...
Interesting!

You're right, it's a poor-man's mobility. A first order aproximation based on randomness based on the previous move's mobility.

And you depend on the search itself to smear out the exact nature of the randomness. The deeper the search, the less direct effect the randomness has on the root. Who cares if the 10th ply move is garbage as long as it results in the 9th ply move being good enough to cause the 8th ply move to be good enough for the 7th ply etc. etc.

Mabye I will put a random generator in my next eval! :lol: It'd be an interesting conversational piece with friends.

Thanks.
Tony

Re: more on engine testing

Post by Tony »

Hi Bob,

did you check if the different results come from all the engines, or are the results against one engine more changing ?

ie If there is only 1 engine that behaves dependent, then the whole test would behave that way.

Tony
User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

Uri Blass wrote:Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.

Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.
The rating model of BayesElo is such that a draw effectively is twice as significant as a win or loss. This is because it uses F(x) = 1/(1+exp(x/400)) as the expected score as a function of rating difference x, and it assumes a draw probability F(x+d) - F(x-d), with some fit parameter d set to reproduce the average draw rate. As d is small, this makes the draw probability essentially a derivative, F'(x) ~ exp(x/400)/(1+exp(x/400))^2 = 1/(1+exp(x/400)) * 1/(1+exp(-x/400)) = F(x) * F(-x).

So the probability for a single draw is the same as the probability of a win (~F(x)) and a loss (~F(-x) = 1-F(x)), i.e. 2 games.

That means if you get 0.5 point out of 2 games, against a weak and a strong opponent, you get a higher rating if you drew the strong opponent than when you drew the weak opponent. Because the rating is exactly the same as when you had one win out of 3 games, but in the former games 2 of the 3 games would be against a strong opponent, while in the latter case they would be against the weaker opponent. So the effective average opponent strength would be higher in the former case, while you aciecved the same result.

So it matters where you score your draws: if the draws are against the stronger opponents, BayesElo will grant you a higher rating than when they are against the weak opponents.
User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

bob wrote:And if one would recognize that I have already done as much of the above as is possible, we wouldn't be wasting time. Games are run at random times. They are run on random cores. Since the games last for random amounts of times, things are further scrambled. If you'd like, I can certainly run a couple of tests and list match script to node it is run on to show that it is different every time due to the way the SGE scheduler works. But don't let details interfere with blindness.
Interesting. Because if the games of two long runs between the same opponenets were randomly interleaved, the actual action of playing them becomes the same, no matter how you interleaved them. They would all be simply games between the same two opponents.

So even if the computer on which the games were played would totally doctor the results, e.g. let A win in all games of the first half, and B in the second, or A in all odd games, and B in all even games, it could not affect the difference in result of the long runs. These results would remain distributed with a standard deviation as if the games were totally uncorrelated, as it would just be the equivalent of randomly drawing half the marbles out of a vase with colored marbles. And that is totally insensitive to the algorithm that was used to color the marbles, on if they were colored in groups and with intent, or randomly. The act of drawing the later would totally randomize the result again.

So if you observe a correlation in that case, it cannot be due to the computer that play the games at all. It must be the random selection process that decides which game counts for which run.

But of course what you say above totally denies what you were saying earlier: you said you started one run after the other finished. So the games of the two runs were not randomly interleaved at all, which they should have been if one of the runs was intended as a background correction for the other, to correct for low-frequency noise in the engine strength. So either you were bulshitting us then, or you are bullshitting us now...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:
bob wrote:And if one would recognize that I have already done as much of the above as is possible, we wouldn't be wasting time. Games are run at random times. They are run on random cores. Since the games last for random amounts of times, things are further scrambled. If you'd like, I can certainly run a couple of tests and list match script to node it is run on to show that it is different every time due to the way the SGE scheduler works. But don't let details interfere with blindness.
Interesting. Because if the games of two long runs between the same opponenets were randomly interleaved, the actual action of playing them becomes the same, no matter how you interleaved them. They would all be simply games between the same two opponents.

So even if the computer on which the games were played would totally doctor the results, e.g. let A win in all games of the first half, and B in the second, or A in all odd games, and B in all even games, it could not affect the difference in result of the long runs. These results would remain distributed with a standard deviation as if the games were totally uncorrelated, as it would just be the equivalent of randomly drawing half the marbles out of a vase with colored marbles. And that is totally insensitive to the algorithm that was used to color the marbles, on if they were colored in groups and with intent, or randomly. The act of drawing the later would totally randomize the result again.

So if you observe a correlation in that case, it cannot be due to the computer that play the games at all. It must be the random selection process that decides which game counts for which run.

But of course what you say above totally denies what you were saying earlier: you said you started one run after the other finished. So the games of the two runs were not randomly interleaved at all, which they should have been if one of the runs was intended as a background correction for the other, to correct for low-frequency noise in the engine strength. So either you were bulshitting us then, or you are bullshitting us now...
What I said above doesn't "totally deny" anything. Might be a bit of a vocabulary issue, but here is what is done. First, let's define "run" as a complete set of games, 40 positions, 4 games per position, 5 opponents to play against current crafty, N repeats. N varies but is typically set to 32 so that one opponent plays crafty 40 * 4 * 32 = 5120 games. And with 5 opponents, that is 5 * 5120 total games in one run, or 25,600 total games. That will match the second set (of 2) results I posted. The first set used a number smaller than 32 to produce just 800 games. This is called a run.

To actually perform this test, assuming I am using the entire cluster of 260 processors, I have a shell script that first creates a "command" to play each single position 4 times alternating colors (I can also produce 4 commands but this is done for efficiency which I will explain in a minute). If you do the math, that turns into 25,600 / 4 commands. These are saved in a file. I then run, on the "head node" (not one of the 130 nodes we have) a program that submits N of these commands to the SGE queueing system. Typically N is 300 here so that there are 260 running, and 40 waiting to run when anything finishes. As one of these 4-game mini-matches completes, the program fires off another command from the file, so that there are always "jobs waiting in the queue".

The SGE engine schedules the jobs on random nodes. Whenever one finishes, the grid engine waits until the node shows that nothing is left running (the load average drops to near zero) and then picks the next command in the queue and schedules it on that processor. This continues until all of the commands have been executed and the test is completed. The initial seeding of the commands is completely random, and then as a command finishes, the next one falls into that node and begins. I'd be happy to show you the scheduling log to see what ran where. It is always different, but since all nodes are absolutely identical, that would make no difference. I have the option of rebooting a node before using it. Since these are bare linux nodes, that takes under 30 extra seconds, but I have found no difference between booting and not booting with respect to the variability of the results.

Hopefully, that explains the process clearly and precisely. We have the gridengine configured so that it will _never_ start a command on a node with a load factor that is not < .05, which means nothing running. Only down side is that this wastes quite a bit of time as it takes time for the load average to settle to near zero after being at 1.0 for a while.

Running like this, I can produce 100% repeatable results, so long as I avoid using time as the limiting constraint for the game tree searches. Node count limits work perfectly. Depth limits work perfectly. But all of the programs I am testing against do partial last iterations and time out when they feel like it, and timing jitter makes these games completely non-reproducible.

I assume you saw the post where someone played fruit vs fruit, same starting position, no book, and did not get one duplicate game? That is what I see as well. Some games actually do repeat. They are apparently forced enough that there are no last minute score / best-move changes that would be sensitive to time jitter. But most games, even if they have the same result, have a different sequence of moves at some point...

Clear enough? I have explained that exact set-up many times. It hasn't changed. There's no bullshit coming from me. You might have some in your ears of course, but I can't help that. The nodes are initially assigned randomly intentionally, since this cluster can run MPI jobs, PVM jobs, etc, which depend on network bandwidth/latency as well. This randomness makes the results show "average performance" since it is mixed up every time. My testing is not doing any network activity until after the games are played and I save the PGN in a common location.

Now before you make any more comments, read the above carefully and see if there are any points you don't understand or want clarification on. I have several ways I can modify the testing. For lots of runs, I submit "commands" that play 4 games per processor" rather than just one. This is more efficient as there are fewer of those "pause until load average drops to near zero" conditions which waste time. I can crank this up to 8, 16, or whatever I want. Going too far begins to create commands that vary way too much in how long they will run causing load-balancing issues as the run winds down. running too few creates a lot of idle time waiting on the load average to drop. My "automatic submit" program also looks at what else is running, and if it notices other users with jobs in the queue, it slows down the submission process to use fewer nodes to let those jobs run. As the queue empties, it picks the pace back up. This way I can run as much as I want without starving other users who are also trying to run things (nobody uses all the nodes as I do so there is always some major fraction of the cluster available).

any questions or comments???

Edit:

As far as the "which game goes with which run" There are two issues.

(1) one run (run as defined above) is completed before another is started. So the 25,000 games are not intermingled with another 25,000 game run. As far as the individual games go, each game is assigned a unique "ID". I use this so that if I choose to create them, crafty can use this "ID" as the logfile number so that each different game gets a different logfile when I am looking for problems. It is then easy enough to collect the individual game results and "group" them in the same order as the commands were initially produced. So that if I want the first 4 games from each position, I can see those precise games. In fact, the PGN is actually stored like that. The filesnames are:

matchX.Y.Z

X is a number between 1 and 5 to indicate which opponent Crafty played. Z is the position number, 1-40, and Y is the number of the 4-game mini-match, 1-32 in the usual case, but sometimes smaller. So it is easy to figure out the "logical order" if not the physical order the games were played. I use "logical" in everything I do since the games are collected in that way. In general, if you ignore the parallel activity of 260 simultaneous games, the games are actually played in that order, or at least they are started in that order, what happens after a game is started is up to the timing issues already discussed which can make one of the four supposedly identical games take way longer (or shorter) time than the rest.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Tony wrote:Hi Bob,

did you check if the different results come from all the engines, or are the results against one engine more changing ?

ie If there is only 1 engine that behaves dependent, then the whole test would behave that way.

Tony
I have tried (and posted) results against all engines. In fact, if you look at the first post in this thread, I gave the data for each individual engine. the results varied. In fact, the final standings of the engines varied... If you look at the first post, rather than looking at the two odd Elo values for crafty, look at the individual opponent Elo numbers. Those are only games against crafty, so for each opponent, change the sign and that would be the Elo of Crafty in that particular set of games for that specific opponent.
Uri Blass
Posts: 10803
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: more on engine testing

Post by Uri Blass »

hgm wrote:
Uri Blass wrote:Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.

Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.
The rating model of BayesElo is such that a draw effectively is twice as significant as a win or loss. This is because it uses F(x) = 1/(1+exp(x/400)) as the expected score as a function of rating difference x, and it assumes a draw probability F(x+d) - F(x-d), with some fit parameter d set to reproduce the average draw rate. As d is small, this makes the draw probability essentially a derivative, F'(x) ~ exp(x/400)/(1+exp(x/400))^2 = 1/(1+exp(x/400)) * 1/(1+exp(-x/400)) = F(x) * F(-x).

So the probability for a single draw is the same as the probability of a win (~F(x)) and a loss (~F(-x) = 1-F(x)), i.e. 2 games.

That means if you get 0.5 point out of 2 games, against a weak and a strong opponent, you get a higher rating if you drew the strong opponent than when you drew the weak opponent. Because the rating is exactly the same as when you had one win out of 3 games, but in the former games 2 of the 3 games would be against a strong opponent, while in the latter case they would be against the weaker opponent. So the effective average opponent strength would be higher in the former case, while you aciecved the same result.

So it matters where you score your draws: if the draws are against the stronger opponents, BayesElo will grant you a higher rating than when they are against the weak opponents.
I do not know how base elo works but it seems to me that it is based on wrong assumptions.

If 2 players score the same number of points against the same opponents then it does not seem to me logical to give higher rating to one of them.

I see that bayeselo finds the maximum-likelihood ratings based on the following link

http://remi.coulom.free.fr/Bayesian-Elo/

"bayeselo finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the Links section below."

It does not seem to me logical
If the rating has 51% probability to be 1500 and 49% probability to be 1600 then I think that 1549 is a better estimate than 1500 that is the maximal liklihood rating(the example is simple and of course practically rating is continous variable and not a descrete variable).

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Uri Blass wrote:
hgm wrote:
Uri Blass wrote:Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.

Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.
The rating model of BayesElo is such that a draw effectively is twice as significant as a win or loss. This is because it uses F(x) = 1/(1+exp(x/400)) as the expected score as a function of rating difference x, and it assumes a draw probability F(x+d) - F(x-d), with some fit parameter d set to reproduce the average draw rate. As d is small, this makes the draw probability essentially a derivative, F'(x) ~ exp(x/400)/(1+exp(x/400))^2 = 1/(1+exp(x/400)) * 1/(1+exp(-x/400)) = F(x) * F(-x).

So the probability for a single draw is the same as the probability of a win (~F(x)) and a loss (~F(-x) = 1-F(x)), i.e. 2 games.

That means if you get 0.5 point out of 2 games, against a weak and a strong opponent, you get a higher rating if you drew the strong opponent than when you drew the weak opponent. Because the rating is exactly the same as when you had one win out of 3 games, but in the former games 2 of the 3 games would be against a strong opponent, while in the latter case they would be against the weaker opponent. So the effective average opponent strength would be higher in the former case, while you aciecved the same result.

So it matters where you score your draws: if the draws are against the stronger opponents, BayesElo will grant you a higher rating than when they are against the weak opponents.
I do not know how base elo works but it seems to me that it is based on wrong assumptions.

If 2 players score the same number of points against the same opponents then it does not seem to me logical to give higher rating to one of them.

I see that bayeselo finds the maximum-likelihood ratings based on the following link

http://remi.coulom.free.fr/Bayesian-Elo/

"bayeselo finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the Links section below."

It does not seem to me logical
If the rating has 51% probability to be 1500 and 49% probability to be 1600 then I think that 1549 is a better estimate than 1500 that is the maximal liklihood rating(the example is simple and of course practically rating is continous variable and not a descrete variable).

Uri
So you should _never_ compute a player's rating until after he has retired? Otherwise you are computing the rating after each event? Real world is after each _game_, which is how the Elo system was defined. So of course it makes a difference when you play someone because when affects their rating at the time of the game...

I don't see why you think this is an issue, or yet another "unfounded assumption" gets posted when this is not a new or surprising detail. Remi went to great trouble to make the game-by-game rating calculations as accurate as possible by factoring in the "white bias", where normal Elo ratings are just based on the two opponent ratings and the game result. This factors in that white has a slight advantage to start with and so winning with white is not quite as good as winning with black, when calculating a rating.
Uri Blass
Posts: 10803
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: more on engine testing

Post by Uri Blass »

bob wrote:
Uri Blass wrote:
hgm wrote:
Uri Blass wrote:Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.

Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.
The rating model of BayesElo is such that a draw effectively is twice as significant as a win or loss. This is because it uses F(x) = 1/(1+exp(x/400)) as the expected score as a function of rating difference x, and it assumes a draw probability F(x+d) - F(x-d), with some fit parameter d set to reproduce the average draw rate. As d is small, this makes the draw probability essentially a derivative, F'(x) ~ exp(x/400)/(1+exp(x/400))^2 = 1/(1+exp(x/400)) * 1/(1+exp(-x/400)) = F(x) * F(-x).

So the probability for a single draw is the same as the probability of a win (~F(x)) and a loss (~F(-x) = 1-F(x)), i.e. 2 games.

That means if you get 0.5 point out of 2 games, against a weak and a strong opponent, you get a higher rating if you drew the strong opponent than when you drew the weak opponent. Because the rating is exactly the same as when you had one win out of 3 games, but in the former games 2 of the 3 games would be against a strong opponent, while in the latter case they would be against the weaker opponent. So the effective average opponent strength would be higher in the former case, while you aciecved the same result.

So it matters where you score your draws: if the draws are against the stronger opponents, BayesElo will grant you a higher rating than when they are against the weak opponents.
I do not know how base elo works but it seems to me that it is based on wrong assumptions.

If 2 players score the same number of points against the same opponents then it does not seem to me logical to give higher rating to one of them.

I see that bayeselo finds the maximum-likelihood ratings based on the following link

http://remi.coulom.free.fr/Bayesian-Elo/

"bayeselo finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the Links section below."

It does not seem to me logical
If the rating has 51% probability to be 1500 and 49% probability to be 1600 then I think that 1549 is a better estimate than 1500 that is the maximal liklihood rating(the example is simple and of course practically rating is continous variable and not a descrete variable).

Uri
So you should _never_ compute a player's rating until after he has retired? Otherwise you are computing the rating after each event? Real world is after each _game_, which is how the Elo system was defined. So of course it makes a difference when you play someone because when affects their rating at the time of the game...

I don't see why you think this is an issue, or yet another "unfounded assumption" gets posted when this is not a new or surprising detail. Remi went to great trouble to make the game-by-game rating calculations as accurate as possible by factoring in the "white bias", where normal Elo ratings are just based on the two opponent ratings and the game result. This factors in that white has a slight advantage to start with and so winning with white is not quite as good as winning with black, when calculating a rating.
Real world is for humans who learn from previous games and not for computer programs with no learning.

The discussion here is about chess programs with no learning so the best that you can do for them is to calculate rating based on all the history and
recalculate when the history is changed.

Edit:I do not say that what Remi did is bad but if
Remi's rating calculations is for humans then it is better to use different formula.

It is logical to give human who lose 2 games and after them win 2 games against the same opponent higher rating than the opponent because I assume that the human learned from the losses.

It is not logical to do the same if we have 2 chess programs with no learning and in this case there is no reason to think that one program is better so there is no reason to give one program an higher rating.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
hgm wrote:
Uri Blass wrote:Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.

Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.
The rating model of BayesElo is such that a draw effectively is twice as significant as a win or loss. This is because it uses F(x) = 1/(1+exp(x/400)) as the expected score as a function of rating difference x, and it assumes a draw probability F(x+d) - F(x-d), with some fit parameter d set to reproduce the average draw rate. As d is small, this makes the draw probability essentially a derivative, F'(x) ~ exp(x/400)/(1+exp(x/400))^2 = 1/(1+exp(x/400)) * 1/(1+exp(-x/400)) = F(x) * F(-x).

So the probability for a single draw is the same as the probability of a win (~F(x)) and a loss (~F(-x) = 1-F(x)), i.e. 2 games.

That means if you get 0.5 point out of 2 games, against a weak and a strong opponent, you get a higher rating if you drew the strong opponent than when you drew the weak opponent. Because the rating is exactly the same as when you had one win out of 3 games, but in the former games 2 of the 3 games would be against a strong opponent, while in the latter case they would be against the weaker opponent. So the effective average opponent strength would be higher in the former case, while you aciecved the same result.

So it matters where you score your draws: if the draws are against the stronger opponents, BayesElo will grant you a higher rating than when they are against the weak opponents.
I do not know how base elo works but it seems to me that it is based on wrong assumptions.

If 2 players score the same number of points against the same opponents then it does not seem to me logical to give higher rating to one of them.

I see that bayeselo finds the maximum-likelihood ratings based on the following link

http://remi.coulom.free.fr/Bayesian-Elo/

"bayeselo finds the maximum-likelihood ratings, using a minorization-maximization (MM) algorithm. A description of this algorithm is available in the Links section below."

It does not seem to me logical
If the rating has 51% probability to be 1500 and 49% probability to be 1600 then I think that 1549 is a better estimate than 1500 that is the maximal liklihood rating(the example is simple and of course practically rating is continous variable and not a descrete variable).

Uri
So you should _never_ compute a player's rating until after he has retired? Otherwise you are computing the rating after each event? Real world is after each _game_, which is how the Elo system was defined. So of course it makes a difference when you play someone because when affects their rating at the time of the game...

I don't see why you think this is an issue, or yet another "unfounded assumption" gets posted when this is not a new or surprising detail. Remi went to great trouble to make the game-by-game rating calculations as accurate as possible by factoring in the "white bias", where normal Elo ratings are just based on the two opponent ratings and the game result. This factors in that white has a slight advantage to start with and so winning with white is not quite as good as winning with black, when calculating a rating.
Real world is for humans who learn from previous games and not for computer programs with no learning.
So you are proposing to re-define the meaning of the term "Elo rating" it would seem? Which I would probably agree with except that we need a new term, to avoid confusion with one that has been in existence for many years.

I don't believe the current Elo system applies very well to programs, not just for the reason you gave, but for others as well, because for any position you encounter in computer chess, there is a random element that contributes to the outcome, quite unlike humans.

But here we are, and have been, talking specifically about Elo ratings, which have only one way for us to calculate them.

The discussion here is about chess programs with no learning so the best that you can do for them is to calculate rating based on all the history and
recalculate when the history is changed.

Edit:I do not say that what Remi did is bad but if
Remi's rating calculations is for humans then it is better to use different formula.

It is logical to give human who lose 2 games and after them win 2 games against the same opponent higher rating than the opponent because I assume that the human learned from the losses.
Not logical to me. Last tournament I played in years ago, friend of mine got sick and lost the last 3 rounds pretty badly. He came back the next event and blew everyone away (he was the only 2200+ player in the second event, and he easily beat the last three opponents he had lost to among others... so why does that have to be learning? And not simply the result of feeling better? Humans have their quirks.

It is not logical to do the same if we have 2 chess programs with no learning and in this case there is no reason to think that one program is better so there is no reason to give one program an higher rating.

Uri
However, to do this one would need to re-input _all_ the old games each time a new game is added, so that each game can be weighted equally. Elo weights older games more to give some stability to the rating. Seems impractical.