IPON ratings calculation

Frank Quisinsky · Post by **Frank Quisinsky** » Fri Dec 30, 2011 5:52 pm

Hi Don,

that is the point what the most not understand.
More important as games are how many opponents an engine have.

And this one is very easy to check with a database simulation.

In SWCR with 39 participants now 375 games played. Have a look on the results, have a look on the results after 1.600 games. This is allways the same.

Perhaps in 2012 are different new ELO calculation programs available. This one could be very interesting. So we can create a rating list of ELO calculation programs

But all in all it's right.
List x with AMD hardware (much engines are optimated with Intel). List Y with a mix from SSE and without SSE, or List z with a mix from AMD and Intel hardware. For the next are the games not available

So many different things!
No rating list is perfect but interesting is the combination of all the available results. Should be the same +-15.

Have a good new year Don.
I like the new Komodo and now ... the playing style too !! Much more important as 3.020 or 3.019 points

Best
Frank

Don · Post by **Don** » Fri Dec 30, 2011 5:53 pm

IWB wrote:
Houdini wrote: ...
I think IPON is fine as it is, there's little point in bringing down the random error to below 15 points when the systematic error is probably larger than that.
Hi Robert,

I agreee that the 50 positions are artifical and there is room for improvement (in 2012), but I doubt that the systematic error is that big.
If I compare my results with the rating lists which are using books (which has different problems which might even be bigger) the differences to my list in average are much smaller than 15 ELo ... and yes, for an individual engine my testset might be a bad or a perfect choice. Of course not on purpose but the possibility is there, no doubt.

Anyhow, there is no 100% result in statistics of rating lists but I try to come close

BYe
Ingo

Although I have some complaints, I like your test most of all. Something really bother me about a bunch of people testing programs on all sorts of different hardware under not so strict conditions. The result is that any given program will perform best if it's played more on hardware it "likes." Of course the positive aspect of this is that they can test a huge variety of programs and get very large samples. So it's always a trade-off.

Don · Post by **Don** » Fri Dec 30, 2011 6:04 pm

Houdini wrote:
IWB wrote: ... and yes, for an individual engine my testset might be a bad or a perfect choice. Of course not on purpose but the possibility is there, no doubt.
Ingo, that's exactly what I mean, your opening positions could very well create 10 to 20 Elo difference for some engines.
Also your choice of hardware (Phenom II X6) could easily make 5 to 10 Elo difference for some engines.

However, ANY choice in hardware might make 5-10 ELO difference. So this is not a very big consideration. I assume your point of view is that the most popular chip is the Intel i5/i7 based chip and that results based on it are more valid, but a LOT of people are big AMD fans.

I could argue that the time control can make a difference, the hardware, the openings used, the adjudication rules, even the tester itself but there is no real standard that we have to go by. It's probably a good thing that there are a variety of different testers that use different testing conditions.

All this means that there's no point in bringing down the random error to 10 Elo or below, it's insignificant when you look at the larger picture.

Ribert

I think there is a big point - this is variable that is directly under his control. Given that he tests at a certain time control on a specific machine, we can at least get a more reliable sample. Presumably if Ingo chooses to do this at some point he would also increase the number of openings which is something that you would probably endorse.

IWB · Post by **IWB** » Fri Dec 30, 2011 6:26 pm

Don wrote: ...

Although I have some complaints, I like your test most of all. Something really bother me about a bunch of people testing programs on all sorts of different hardware under not so strict conditions. The result is that any given program will perform best if it's played more on hardware it "likes." Of course the positive aspect of this is that they can test a huge variety of programs and get very large samples. So it's always a trade-off.

First of all, thanks!

"Different people" could be a problem but I do not see any problem with the existing lists regarding that matter. MY biggest concern and main issue to start my list was the use of different hardware with time controls which are adapted (IF they are adapted) by just one benchmark - which means they have to be wrong for nearly every engine except Crafty.
Different people using different books might shift a result - even if I have no doubt that this is not the intension of a single person there and that is true for my selection of opening positions as well - no intension, but possible!

After running my lsit for a while I have to admit that I was suprised how similar my results where with the CEGT 40/20. Basicaly the main difference is, that the IPON is a bit faster with the top engines. Thats it!
(And btw, I compare all lists and regardless of the time control there is no major difference for ANY engine. AS long as the time control is not too short, and all list starting from 40/3 upwards are long enough nowadays an extraordinary increase in playing strength for a particular engine is not visible - I think this "more time and my prefered engine will get better" is a myth (when propper tested)! Of course the quality of the game IS getting better, but that is true for all engines.

Bye
Ingo

Don · Post by **Don** » Fri Dec 30, 2011 6:56 pm

IWB wrote:
Don wrote: ...

Although I have some complaints, I like your test most of all. Something really bother me about a bunch of people testing programs on all sorts of different hardware under not so strict conditions. The result is that any given program will perform best if it's played more on hardware it "likes." Of course the positive aspect of this is that they can test a huge variety of programs and get very large samples. So it's always a trade-off.
First of all, thanks!

"Different people" could be a problem but I do not see any problem with the existing lists regarding that matter. MY biggest concern and main issue to start my list was the use of different hardware with time controls which are adapted (IF they are adapted) by just one benchmark - which means they have to be wrong for nearly every engine except Crafty.
Different people using different books might shift a result - even if I have no doubt that this is not the intension of a single person there and that is true for my selection of opening positions as well - no intension, but possible!

After running my lsit for a while I have to admit that I was suprised how similar my results where with the CEGT 40/20. Basicaly the main difference is, that the IPON is a bit faster with the top engines. Thats it!
(And btw, I compare all lists and regardless of the time control there is no major difference for ANY engine. AS long as the time control is not too short, and all list starting from 40/3 upwards are long enough nowadays an extraordinary increase in playing strength for a particular engine is not visible - I think this "more time and my prefered engine will get better" is a myth (when propper tested)! Of course the quality of the game IS getting better, but that is true for all engines.

Bye
Ingo

People get "anal" about all sort of things and it's true that various factors can make a +/- 5-10 ELO difference, but in practice there is not much we can do about that and most of these things makes only a small difference in the big picture. It's a matter of semantics what you consider the "reference" point to be consider the normal case. Does certain hardware make you look bad or is it the "other" hardware that makes you look good? Semantics.

1. Time control - I think it seems to not be a major issue as long as it's "long enough." There is a very clear difference between programs once you go below 2 or 3 minutes on modern hardware but that tapers off. I think it makes a 5-10 or more difference going up to really long time controls too but that is very difficult to prove and curve is very gentle.

2. Book - I think that is a bigger issue. We test with a huge opening book with thousands of openings culled from master play only to ply 10, or 5 moves each. We want Komodo to play most of the game itself against other engines. However I have seen good opening books play seemingly most of the game FOR the engine. I don't really know how deep the book the testers use is - but since I am an engineer and the idea is to make a strong overall program I want the book to get out of my way.

3. ponder - hotly debated but I don't think it's as important as most of the other things on this list. If you have the resources it's better to have ponder than not. Pondering is going to help the stronger program more than the weaker programs so it might increase the difference in the range from low to high a little bit.

4. Hardware - of course each program responds differently to different hardware. We also must not forget that how a program is compiled is a very similar issue. I think most compilers have settings to make different trade-offs to optimize for specific hardware - and the classic example is the Intel compiler specifically designed to make AMD look bad. I don't know if that is still an issue or not.

When we speak of hardware and compiles, SSE4 (or more specifically ABM which stands for Advanced Bit Manipulation) makes a huge difference in some programs such as Komodo and less in others. So if you do not have SSE hardware then Komodo is crippled, or if you prefer Komodo has an advantage if you do.

It's amazing that the lists mostly agree within a few ELO given all these factors and probably many more.

Frank Quisinsky · Post by **Frank Quisinsky** » Fri Dec 30, 2011 7:05 pm

Code: Select all

It's amazing that the lists mostly agree within a few ELO given all these factors and probably many more.

+1

Such a comment I read in times the fruit sources are available. Fabian gave us the information .... programming can be much more easy.

To understand ratings can be easy too and I think we make the topic to complicated.

And the proof are all the available rating list and more or less the same results with different conditions. 50 ELO more or less nobody can see it. We are slaves from our computers and often the computers play a wicked game with us.

That is the situation with the start in the computer chess year 2012.

Frank Quisinsky · Post by **Frank Quisinsky** » Fri Dec 30, 2011 7:22 pm

it is closed for non chess players!

If non chess players read TalkChess they would thinking ...

What on earth have all this people for problems in detail?

ELO?
3020
What have Komodo to do with chess and what means 2.980 if Houdini have 3.020. Houdini? ELO? I know EURO ...

Honestly, this forum should be closed for our womens or we are forever alone with all our problems.

lkaufman · Post by **lkaufman** » Fri Dec 30, 2011 8:12 pm

pohl4711 wrote:Hi Larry,

in my NEBB-Rankinglists Komodo 4 is 42 Elo better than Komodo 3 in Blitz (4'+2'')...

http://talkchess.com/forum/viewtopic.ph ... 53&t=41677

Greetings - Stefan

Only 30 below Houdini 2 and this without SEE (which is worth much more to Komodo than to others) is a very good result for blitz. Once Don gets finished with MP and we can focus on improvements I'm confident that we will close the gap.

Larry

lkaufman · Post by **lkaufman** » Fri Dec 30, 2011 8:22 pm

Sven Schüle wrote:
lkaufman wrote:
Sven Schüle wrote:
lkaufman wrote:
Michel wrote:
Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated, but watching the IPON results of Critter 1.4, I began to wonder why its performance was 2978 after 2106 games. I took the 22 performances, added them up, and then divided them by 22 and came up with 3000.59, so why is the total performance 2978?
The calculation method of BayesElo is explained here:

http://remi.coulom.free.fr/Bayesian-Elo/#theory

The elo's are the result of a maximum likelihood calculation seeded
with a prior (afaics this can only be theoretically justified in a Bayesian
setting).

The actual algorithm is derived from this paper

http://www.stat.psu.edu/~dhunter/papers/bt.pdf
I think the "prior" may be the problem; it appears to have way too much weight. If an engine performs 3000 against every opponent in over 2000 games, it should get a rating very close to 3000, maybe 2999. But apparently the prior gets way too much weight, because I believe such an engine on the IPON list would get only around 2975.
Part of the problem is that "match performance" is an almost irrelevant number, and also that you can't take the arithmetic average of it due to non-linearity of the percentage expectancy curve. See also the other thread where this has been discussed (link was provided above).

Sven
I know all about the averaging problem, but if an engine had a 3000 performance against every other engine, it impliies that it would neither lose nor gain rating points if the games werre all rated at once starting from a 3000 rating. So that should be the performance rating if thousands of games have been played so that the prior has no significance. I believe that elostat would give a 3000 rating in this case. Elostat is wrong when the performance ratings differ, but if they are the same it should be right, I think.
The basic error is to look at "match performance" numbers at all, as if they would make any sense. A "match performance", in the world of chess engine ratings, is inherently misleading and has no value in my opinion. Total ratings of engines are derived from a whole pool of game results and have zero relation to "match performances", the latter are at best some by-product of the overall ratings that have already been calculated at that point, and you can't draw any conclusions from these numbers.

So you can at most blame the way how a match performance is being calculated, and that it is published at all.

Human player ratings are a totally different animal, for the very reason that the rating principle is completely different. Here you have current ratings for each player, then the next tournament event appears and affects the current ratings of its participants, so ratings evolve over time, and you have an incremental rating process where the most recent events have highest weight while the oldest events are fading out. Calculating match performance makes some sense here. Engine rating is done at once for the whole pool of games, though, so a "match performance" in this case can only be derived similar to a rating of a human player from a set of games against unrated opponents.

Regarding your remarks about EloStat and "prior", either I have misunderstood you or there is some inconsistency in your statement. The program that uses a "prior" is BayesElo, not EloStat. And AFAIK the final IPON ratings that are published were calculated with BayesElo. But nevertheless I believe that the "prior" has little to no impact on the final ratings when considering the number of games involved here.

Sven

I know that "prior" is only in BayesElo. I did make a mistake in my statement about EloStat, but it has nothing to do with prior. I believe Elostat would also get this situation (all 3000 performances) wrong, but for a different reason than BayesElo.

I have not studiedthe BayesElo literature, but some simple tests will show whether there is a problem or not. Suppose program X scores 760-240 against a 2800 engine (performance about 3000), and 909-91 against a 2600 engine (performance about 3000). Assume both the 2600 and 2800 ratings are based on many thousannds of games vs. other engines, and these are the only results for X. What will Bayes-Elo give for X? If it is not very close to 3000, something is wrong. I can't run it myself now as I'm in a chess tournament.

IWB · Post by **IWB** » Fri Dec 30, 2011 8:58 pm

Don wrote: ...

1. Time control - I think it seems to not be a major issue as long as it's "long enough." There is a very clear difference between programs once you go below 2 or 3 minutes on modern hardware but that tapers off. I think it makes a 5-10 or more difference going up to really long time controls too but that is very difficult to prove and curve is very gentle.
....

What to say, we basicaly agree. Especialy the "long enough" is something I sopport fully - even if I go lower than you on modern hardware, the important point is, that there is a lower boundary and longer than that nothing serious happens anymore.

Thinking this to an end, the downside for many is that testing long time controls is a waste of time and money if you just want a rating/ranking. It only makes sence if there is someone looking at the games (but then I have doubts that humans can understand modern computer chess, they need the help of computers to get an idea ...)

Bye
Ingo

IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: TalkChess can be fantastic if ...

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation