IPON ratings calculation

bob · Post by **bob** » Fri Dec 30, 2011 11:44 pm

Don wrote:
bob wrote: I subscribe to the philosophy of "test like you plan on running". If you are testing yourself and only want to measure playing skill improvements, then ponder=off is perfectly OK. Might not give you the same final Elo number as with ponder=on, but if something helps with PON, it should help with POFF, unless you are changing the basic pondering code (or time allocation is different with PON and POFF).

I'm not going twiddle and tune my wife's car, then on saturday night my son and I take his mustang to the drag strip. Or I would not "practice" with a nitrous system turned off, then race with it on.
I agree with you in theory but not in practice. In real tournaments a human operates the machine, I'm sure you don't test your program by manually entering the moves do you? Of course not because it's not a workable thing in practice. In theory that is how you play but in practice you would never get 100,000 games that way.

So I'm afraid that you have to pick and choose which concessions you make for the sake of practicality. We try to pick them in order of how much we believe they are relevant.

Here is a list of concessions that most of us make - probably a few exceptions such as in your case when you have a major hardware testing infrastructure but you probably make some of the same concessions too:

1. Time control A
2. Time control B
3. Ponder vs No ponder
4. Book
5. Hardware
6. Opponents

One at a time:

1. There are 2 issues with time control. The first is playing with the same style time control with same ratio of time and increments if used or moves. For example 40/2 classic should scale to 20/2 if you want to speed up the test. If you want to play in 5 minutes + 5 seconds then you should test at 1 minute + 1 second, preserving the same ratio.

2. The other time control issue is actually playing the exact time control of the tournament you are playing in. If you want Crafty to play well at 40/2 then do you test only at 40/2 ???

3. Ponder vs No ponder. I test with a 6 core i7-980x and it's not much, it's a huge bottleneck for us. Larry has a bit more than I do but it's still a huge bottleneck. We test with ponder off. If we tested with ponder ON we would have to reduce our samples by half, or increase our testing time by 2X to get the same number of games. We cannot afford to do this just to be anal retentive about this issue.

See: http://en.wikipedia.org/wiki/Anal_retentiveness

4. Book. Does Crafty use the same book that it will compete with? You would have to in order to follow your principle of testing the same as you will play.

5. Hardware. Does Crafty use the same exactly hardware and configuration for you big 20,000 game samples that you intend to compete with? I doubt it.

6. Opponents. When you test Crafty I'm sure you don't test against the same players and versions you will compete with in tournaments. This is not possible anyway since you don't know who will be there and what they will bring and what hardware they will use.

As you see, it is not even CLOSE to possible to "test like you plan on running." I don't mean to be critical about this but I don't understand why people latch on to what is probably the LEAST important factor in the list above and make it seem like a major blunder, as if there is absolutely no correlation between how a program will do with ponder vs not pondering - when all major testing is done with an opening book that does not resemble in any way, shape, or form what a program will use in a serious competition. Which do you think is the greater issue?

Larry and decided long ago that testing with Ponder although better in some idealist sense is a major trade-off in the wrong direction, where sample size means so much more.

So if the principle is "test like you plan on running" how would you justify most of these concession? Do you think ponder is more important that time control or using your tournament book or running the same exact hardware?

When it comes to things like this a good engineer knows the different between the lower order bits and and the higher order bits. The truth of the matter is that hardly anyone has the luxury of "testing like we plan on running" but we have a very good sense of what the tradeoffs are. We know that if the program improves, it will probably show up still at a different time control if it's not ridiculously different.

Let me ask you this: If we make an improvement to the evaluation function which shows a definite improvement with no ponder, do you think the results is invalid because we did not test with ponder on? I don't think so .....

You might find this interesting, but I'm a lot more anal retentive about stuff like this than Larry is, but compared to you I'm not disciplined at all!

I have played Crafty in a few human events over the past 5 years. I DO enter the moves by hand, and play in "console mode".

As far as your last question, you are missing the point. The rating lists produce RATINGS. It is not about whether version A is better than version B. It is about the RATING of each program... If you change your eval, either way, if done consistently, should tell you reliably whether the change was good or not. But it might not reliably tell you the Elo of your program so that it can be compared to others.

And there, there IS a difference between ponder=on and ponder=off. Simple example. Broken pondering so you always ponder the wrong move. In a ponder=off match, you get rating X. In a ponder=on match, you get rating of X-70 or something similar, because your opponent always predicts you right, saves time, and searches deeper, you never get a ponder right, and never save time. Time == Elo. I even allocate time differently with ponder=on vs off, because I know I will save time by pondering, and I want to use some of that time "before" it is saved, during an important part of the game (early middlegame).

Adam Hair · Post by **Adam Hair** » Sat Dec 31, 2011 10:39 am

IWB wrote:
Adam Hair wrote: ...
If your total focus is testing the top engines and you are fortunate enough to have multiple computers available for testing, then why not use ponder on? It is easy enough when you only have to stay on top of a couple of dozen engines. If you can afford to do it, then do it.
Yes, I do - for the reasons named above!

I would too!

IWB wrote:
Adam Hair wrote: However, when you try to maintain multiple lists containing 200 to 300 engines (and adding more all of the time), ponder off makes a lot of sense.
Here we go, you say it by yourself and I repeat in the word of my initial posting: The only people who are doing this are a few rating lists! Do you see what I wanted to say?

Adam Hair wrote: In addition, when you compare the results of ponder off testing with ponder on testing, it is hard to discern much difference. Given the differences in focus between IPON and CEGT/CCRL and the lack of truly demonstrative proof that ponder off is less accurate in practice than ponder on ....
That is a different discussion. I still believe that there adre differences in playing style AND in rating, alone I have to admit that the differnce might be very small and hard to prove. I once had a good exapmle (Shredder 12/Naum4) but noone cares for Naum 4 anymore ... and again that has nothing to do with the relevance of "ponder off" for chess!
(And regarding the difference between POFF and PON: I know of 3 engines which are doing a complete different timing with ponder on as they simply assume ponder hit and therefore more time. I am 100% sure that more engines are doing this. There IS a difference between PON and POFF!)

I'm not claiming there is no difference between PON and POFF. For example, it appears Spark does worse, relative to the other top engines, with PON. Though, I don't know if it is really due to PON or if it is related to the 50 positions used in your testing. For the most part, I have not seen a many differences between PON and POFF in practice. What I really should do, instead of trying to compare different rating lists, is test PON and POFF for myself.

Relevance? As you stated, the top engines are used for basically 2 things.
Analysis and to play against (OTB and server). In the case of determining the best engines to use for analysis, PON does not matter. In the case of OTB, very few people in the world can do anything against the engines you are testing, so if an engine is 30 Elo better or worse with PON is irrelevant. In computer chess competitions, the limited number of games and the use of prepared books limits the relevance of any list in predicting the results, regardless if it uses PON or POFF. Certainly PON is relevant to games played on a server. But how relevant is a rating list for that sort of thing? The engine (and its book) will have its rating ultimately determined from the results of its play on the server.

Frank hit upon the best reason for PON. Better games. Not because the ratings produced from PON are more significant in general, but because the games produced are better (though perhaps a longer time control would produce the same quality).

IWB wrote:
Adam Hair wrote: , I find the statement "I consider Ponder off as completly artifical and useless, sorry." to be off the mark.
Actualy, you might not like it and obviously it offends you (and I hope only a little and appologize again) but that is what I think (reasoning above) about POFF Testing/Lists!

Regards
Ingo

I'm not offended

. It takes more than that to offend me.

The reason why I responded is that various people say that POFF is useless. They use the fact that humans are always PON, thus chess engines should be tested in conditions comparable to how humans play chess. For me, I thought that was why people conducted matches with chess engines on servers. Rating lists are better suited for narrower goals, such as determining the relative strengths of various engines and tracking improvements in strength of various versions.

IWB wrote: PS: I write this with all respect for the programmers and you have to believe me that I know how many work is to do to make a list (more than outsiders think) But your statement about "hundrets of engines" is abit exaggregating or better "showy" . Yes there are hundrets of engines, but who is really intereted in them. I once had a mistake in my list for an engine ranked between 15 - 20 (individual list, so best of its kind). The error was there for monthes and nobody found it! The vast majority of people care for the Top 10, a few for the top 20 and then it ends. With later engines you have the programmer and maybe a hand full of people world wide beeing interested! Again, I am full of respect for the work to make such a list and I have even more respect for the programmer making these engines (as I cant do it), but I see the relevance as well!
Another disadvantage with testing 100s of engines is, that If I look at the "huge" rating lists I see more holes than testing. I know that not many people look closely at the condition or the way a list is done but that holes are a good part why I started my list! I personaly think that the POFF list should be more focused. And again, no offence ment! If you want to discuss this in detail give me a PM.

The statement about "200 to 300 engines" has nothing to do with being"showy" but instead to emphasize the different intent. The larger lists are for the programmer as much as for the end user. We have different points of view. I know that many people only care about the top engines. Many of those same people could give a rat's ass for most of the programmers or the work they have put into this hobby. Many do not care that their favorite author most likely made use of the pool of ideas provided by other authors and that they received encouragement and support from those other authors when they started out. Those people only care about the end result. I am not lumping you into that group of people. But it appears that those are the people whose opinions you care about.

Regardless, let me state again something that I have said before. IPON is the best list for determining the relative strengths of the best engines.

Adam

Adam Hair · Post by **Adam Hair** » Sat Dec 31, 2011 10:58 am

Don wrote:
IWB wrote:
Houdini wrote: ...
I think IPON is fine as it is, there's little point in bringing down the random error to below 15 points when the systematic error is probably larger than that.
Hi Robert,

I agreee that the 50 positions are artifical and there is room for improvement (in 2012), but I doubt that the systematic error is that big.
If I compare my results with the rating lists which are using books (which has different problems which might even be bigger) the differences to my list in average are much smaller than 15 ELo ... and yes, for an individual engine my testset might be a bad or a perfect choice. Of course not on purpose but the possibility is there, no doubt.

Anyhow, there is no 100% result in statistics of rating lists but I try to come close

BYe
Ingo
Although I have some complaints, I like your test most of all. Something really bother me about a bunch of people testing programs on all sorts of different hardware under not so strict conditions. The result is that any given program will perform best if it's played more on hardware it "likes." Of course the positive aspect of this is that they can test a huge variety of programs and get very large samples. So it's always a trade-off.

Using the same hardware improves precision but can introduce bias. If a particular program does not do well on the unified hardware, then the results, while perhaps being more reproducible, could be less accurate (in terms of how well that program will do on a randomly chosen computer).

Meanwhile, using the results from games played on different computers gives less precision, but perhaps gives a better estimate of how well a program will do on a randomly chosen computer.

As you say, there are tradeoffs.

lucasart · Post by **lucasart** » Sat Dec 31, 2011 11:18 am

Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated

Clearly !

Don · Post by **Don** » Sat Dec 31, 2011 2:16 pm

Adam Hair wrote: Using the same hardware improves precision but can introduce bias. If a particular program does not do well on the unified hardware, then the results, while perhaps being more reproducible, could be less accurate (in terms of how well that program will do on a randomly chosen computer).

Meanwhile, using the results from games played on different computers gives less precision, but perhaps gives a better estimate of how well a program will do on a randomly chosen computer.

As you say, there are tradeoffs.

We have experience testing on different hardware and it's almost useless unless you standardize the test. What happens is that program X plays better on one hardware at a given time control and program Y play better on another. At fast time controls the effect is ridiculous. Critter is an absolute monster at fast time controls and you can double the level and it will drop 20 ELO. This means you cannot compare results at different time controls or even on different hardware normalized without accept a significant amount of error.

The solution if you want to combine results is to "fix" the testing conditions and the exact number of games and really everything about the test - then you can use it for comparison purposes. If you don't do that, the results are highly dependent on who decides to run a given test.

The way it is structured now with some of the rating agencies, is that people run whatever matches they want. They might play some gauntlets, some round robins or some head to head matches (which is technically a gauntlet too) and they each use any book they choose and any set of programs they choose.

One good thing is that they adjust the level to the hardware. I'm not saying the results are worthless, they are not. But what I am saying is that there is a kind of random bias that adds significantly to the error.

ELO ratings are not particularly well behaved when you set up matches differently from one test to another either. If you play program X vs Y in a long match you will get a different result than if you play each program against a wide range of opponents with different strengths. The rating of the stronger player will be inflated. This is semantics however, you could also take the point of view that if you play a wide variety of opponents the ratings with be deflated.

The reason computer chess rating pools tend to be spread out when compared to human ratings are for this reason. Most testers will focus on matches between very closely match opponents and often between the top few.

Anyway, that is why I like Ingo's test - despite any perceived flaws (such as the limited number of openings and too small sample) the results are more consistent and his test is constructed in a more scientific way.

IWB · Post by **IWB** » Sat Dec 31, 2011 2:36 pm

Don wrote: ...
The solution if you want to combine results is to "fix" the testing conditions and the exact number of games and really everything about the test - then you can use it for comparison purposes. If you don't do that, the results are highly dependent on who decides to run a given test.
...

Exactly that is how the IPON was formed. I do beta testing and an engine test has to be reproducable! At a certain point I realized that this is a proper ranking as well if I play other engine combinations. The IPON can and still is used for betatesting - other List cant do that with their conditions.

Don wrote: Anyway, that is why I like Ingo's test - despite any perceived flaws (such as the limited number of openings and too small sample) the results are more consistent and his test is constructed in a more scientific way.

Thanks, I like to hear that and maybe the problems will ease over time ...

Happy new year to everyone btw.
Ingo

Uri Blass · Post by **Uri Blass** » Sat Dec 31, 2011 2:59 pm

What I dislike in all the tests is that I usually see no matches between version X and version X+1 of the same program and the result is that comparison between programs suffer.

When people compare between houdini2.0 and houdini1.5 I see no matches between houdini2 and houdini1.5

Many correspondence players use houdini1.5 for their correspondence games and they can get no idea from the rating list if using houdini2 is better(note that they are not interested in the performance of houdini2 against weak programs but in the performance of houdini2 against houdini1.5)

It seems that testers assume that the performance of X+1 against X is misleading because programmers tuned X+1 to be better against X but I think that it is not easy to do it and if there is a problem of tuning then maybe it is the case that programmers tuned their program to fast time control and advantage at fast time control disappears at longer time control as seems to happen with Komodo4 that seems not to be better than Komodo3 at CCRL 40/40

Don · Post by **Don** » Sat Dec 31, 2011 3:26 pm

Uri Blass wrote:What I dislike in all the tests is that I usually see no matches between version X and version X+1 of the same program and the result is that comparison between programs suffer.

When people compare between houdini2.0 and houdini1.5 I see no matches between houdini2 and houdini1.5

Many correspondence players use houdini1.5 for their correspondence games and they can get no idea from the rating list if using houdini2 is better(note that they are not interested in the performance of houdini2 against weak programs but in the performance of houdini2 against houdini1.5)

It seems that testers assume that the performance of X+1 against X is misleading because programmers tuned X+1 to be better against X but I think that it is not easy to do it and if there is a problem of tuning then maybe it is the case that programmers tuned their program to fast time control and advantage at fast time control disappears at longer time control as seems to happen with Komodo4 that seems not to be better than Komodo3 at CCRL 40/40

Ipon has the full list too where you can compare:

Code: Select all

Full list:

     Name                      Elo    +    -   games score oppo.  draws

   1 Houdini 2.0 STD          3016   13   13  2900   79%  2784   25% 
   2 Houdini 1.5a             3009   11   11  4000   79%  2777   26% 
   3 Critter 1.4 SSE42        2977   13   13  2400   77%  2769   32% 
   4 Komodo 4 SSE42           2975   13   13  2500   76%  2777   30% 
   5 Komodo64 3 SSE42         2965   12   12  2800   74%  2780   31% 
   6 Deep Rybka 4.1 SSE42     2956   10   10  3700   72%  2794   37% 
   7 Deep Rybka 4             2954    9    9  4900   74%  2772   33% 
   8 Critter 1.2              2952   11   11  3100   72%  2788   37%

Sven · Post by **Sven** » Sat Dec 31, 2011 4:06 pm

Don wrote:
Uri Blass wrote:What I dislike in all the tests is that I usually see no matches between version X and version X+1 of the same program and the result is that comparison between programs suffer.

When people compare between houdini2.0 and houdini1.5 I see no matches between houdini2 and houdini1.5

Many correspondence players use houdini1.5 for their correspondence games and they can get no idea from the rating list if using houdini2 is better(note that they are not interested in the performance of houdini2 against weak programs but in the performance of houdini2 against houdini1.5)

It seems that testers assume that the performance of X+1 against X is misleading because programmers tuned X+1 to be better against X but I think that it is not easy to do it and if there is a problem of tuning then maybe it is the case that programmers tuned their program to fast time control and advantage at fast time control disappears at longer time control as seems to happen with Komodo4 that seems not to be better than Komodo3 at CCRL 40/40
Ipon has the full list too where you can compare:
Code: Select all
Full list:

     Name                      Elo    +    -   games score oppo.  draws

   1 Houdini 2.0 STD          3016   13   13  2900   79%  2784   25% 
   2 Houdini 1.5a             3009   11   11  4000   79%  2777   26% 
   3 Critter 1.4 SSE42        2977   13   13  2400   77%  2769   32% 
   4 Komodo 4 SSE42           2975   13   13  2500   76%  2777   30% 
   5 Komodo64 3 SSE42         2965   12   12  2800   74%  2780   31% 
   6 Deep Rybka 4.1 SSE42     2956   10   10  3700   72%  2794   37% 
   7 Deep Rybka 4             2954    9    9  4900   74%  2772   33% 
   8 Critter 1.2              2952   11   11  3100   72%  2788   37% 

Correct, but Uri meant that there are no games between different Houdini versions contributing to that rating list, which I can confirm after checking the PGN file which can be downloaded.

Sven

IWB · Post by **IWB** » Sat Dec 31, 2011 5:20 pm

Sven Schüle wrote: ....
Correct, but Uri meant that there are no games between different Houdini versions contributing to that rating list, which I can confirm after checking the PGN file which can be downloaded.
...

I do'nt do this as I consider it useless. Usually a newer version will play a lot of draws vs the precessor and the winning games are going directly to the weak points of the older version (as the programmer 'hopefully' worked on that). To prove a difference you need a lot of games. NO list is playing enough games between individual engines, the IPON is just playing 100 which is by far not enough to have any statistical relevance (and 100 is already a lot compared to other lists). If I would play a higher number I can not include the games into the list as it would distart the overall rating ...

So, I can understand Uris request and I think it is a valid argument for the given scenario, but I fear it is something a rating list with its limited number of games vs engines can't help with.

Bye and happy new year
Ingo

IPON ratings calculation

Re: Not realistic!

Re: Not realistic!

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation