IPON will change ...

Matthias Gemuh · Post by **Matthias Gemuh** » Sat Jan 28, 2012 10:26 am

Graham Banks wrote:
Matthias Gemuh wrote:Fire, Ivanhoe, RobboLito are not in the list ?

Why then are Strelka, Houdini, Rybka, Protector, etc. in it ?
Protector is not a controversial engine as far as I'm aware.

Protector is not a clone. I recall some Fruit/Toga-related talk, though.

IWB · Post by **IWB** » Sat Jan 28, 2012 10:27 am

Moin moin,

Matthias Gemuh wrote: ...
you are being mislead by your own rules.
The only thing that determines how useful a technically correct rating list is, is the extent to which it covers engines in common use.
If commonly used engines are missing on a rating list, the list has holes.
Rules cannot fill the holes.

I disagree with that conclusion!

If I would make a list of engines "in common use" I have to throw out engines and not to put in!

And again, if that concept is so missleading something better will replace the older lists. I have no problem with that as that is usually the way it goes (and the IPON will for sure go sooner or later).

Bye
Ingo

IWB · Post by **IWB** » Sat Jan 28, 2012 10:31 am

I will not start a new thread for this.

You can find the "bigger" IPON now on my web site including a few explanation and data.

http://www.inwoba.de

For those who don't want to go there, this is the essence of it:

Code: Select all

   1 Houdini 2.0 STD           3017   11   11  4000	  +2   
   2 Critter 1.4a SSE42        2975   11   11  3500	  = 
     Komodo 4 SSE42            2975   11   11  3600     +3
   4 Deep Rybka 4.1 SSE42      2952    9    9  4800     -2
   5 Stockfish 2.2.2 JA SSE42  2950   11   11  3400     -2
   6 Chiron 1.1a               2830   10   10  3700     -3
   7 Naum 4.2                  2826    7    7  7900     =
   8 Fritz 13 32b              2819   11   11  2800     (not tested)
   9 Deep Shredder 12          2800    7    7  9000     =
  10 Gull 1.2                  2792    9    8  4900     -3
  11 Deep Sjeng c't 2010 32b   2789    8    8  5900     +1
  12 Spike 1.4 32b             2786    9    9  5000     +1
  13 Protector 1.4.0           2761    9    9  5100     +2
  14 spark-1.0 SSE42           2759    8    8  5600     +4 (!)
  15 Hannibal 1.1              2755    9    9  4400     -3
  16 HIARCS 13.2 MP 32b        2749    9    9  5400     +1
  17 Deep Junior 12.5          2731    9    9  4700     -1
  18 Zappa Mexico II           2717    6    7 10300     +1
  19 Deep Onno 1-2-70          2684    8    8  6900     =
  20 Strelka 2.0 B             2668    9    9  5000     -3

At the end of each line you see the change in Elo.

Bye
Ingo

ernest · Post by **ernest** » Sat Jan 28, 2012 5:47 pm

IWB wrote:Hello Vincent
Vinvin wrote:Thanks, Ingo ! But why not Strelka 5.1 instead of Strelka 2.0 B ?

... and Strelka 5.1 does not work in Ponder mode in Fritz GUI or Shredder GUI

Don · Post by **Don** » Sun Jan 29, 2012 5:42 am

IWB wrote:... and as a preparation for that I played a tourney with the top 19 engines (Top 20 - F13) with a set of new 25 openings.

That is a new set of 8550 games.

The result of that individual tourney might interest a few:

So the IPON will grow, the 'secret' is that it will be played with 75 opening and not with 50 (old 50 + 25 new), therefore "just" with the top 20 engines.

Exact rating and details later this weekend, but I can already say that nothing serious changes. Basicaly all enignes came out well within their error ranges of the "old" IPON. It might be that one or th eother engine will change ranking if they are very (!) close together but rating will come out very similar ...

Bye
Ingo

Hi Ingo,

I think this is a good change and definite improvement.

Please don't test several variations of the same program even if the authors come forward. The Houdini version if Ippolit is enough.

IWB · Post by **IWB** » Sun Jan 29, 2012 9:43 am

Don wrote:
Hi Ingo,

I think this is a good change and definite improvement.

Please don't test several variations of the same program even if the authors come forward. The Houdini version if Ippolit is enough.

Thx Don,

Dont worry, right now I do not see anything reasonable to add a Litto.

I have a question to a programmer (and other might answer too!):

For every new engine, which presumably belongs to the top 20, I intent to play against the full set off engines and opening. That is 2850 games. For a new engine that is no problem (and it would throw out Strelka 2.0, Tornado and Nemo were VERY close!) but what (e.g.) about a new Komodo? Should it play against its "parent" as well?
On one hand I have the feeling that playing this is some kind of inbreeding and might distort the rating because of either a high draw rate or a lot of wins, as it plays against the weaknesses of the old version - on the other Hand the "parent" IS part of the top 20 and scientificaly I have to play it.
What happens if a newer version is behind the parent because of a high draw rate. According to my rules I would have to stick with the older version and throw the new one out ...? If I dont play this match I am missing 150 games, if I play it I might have 150 useless games ...

It is not easy, but what is your opinion about this "parent - child" comparision in a 19 other engine enviroment?

Bye
Ingo

Don · Post by **Don** » Sun Jan 29, 2012 2:10 pm

IWB wrote:
Don wrote:
Hi Ingo,

I think this is a good change and definite improvement.

Please don't test several variations of the same program even if the authors come forward. The Houdini version if Ippolit is enough.
Thx Don,

Dont worry, right now I do not see anything reasonable to add a Litto.

I have a question to a programmer (and other might answer too!):

For every new engine, which presumably belongs to the top 20, I intent to play against the full set off engines and opening. That is 2850 games. For a new engine that is no problem (and it would throw out Strelka 2.0, Tornado and Nemo were VERY close!) but what (e.g.) about a new Komodo? Should it play against its "parent" as well?
On one hand I have the feeling that playing this is some kind of inbreeding and might distort the rating because of either a high draw rate or a lot of wins, as it plays against the weaknesses of the old version - on the other Hand the "parent" IS part of the top 20 and scientificaly I have to play it.

I can easily produce 20 versions of Komodo, each of which is different and at least 18 or 19 of them would be top 20. So I don't think there is a good reason to "include" previous versions of the same program in your list. Otherwise the programs that release the most often would proliferate in your top 20.

Here is my suggestion on how you might structure this:

When you get a new program which is either a new version of an existing program or a new program you have to decide whether it's a top 20 program. I would suggest that you always play a new program against the current top 20 (even it's own parent) to get it's rank. This will give you 21 players. After doing this, keep the top 20. If a player has a parent, keep the stronger one and remove the other for the list and for future contestants. If the parent is stronger you can make a note in the naming by giving it a combined name or asterisk (especially if the rankings of either are the same.) Here is an example:

1. suppose I release Komodo 5
2. Play against all 20 program including Komodo 4
3. Let's say Komodo 5 comes out slightly weaker.
4. Also let's say they come out at 2nd and 3rd place
5. On your table put "Komodo 4/5" for the name in second place.
6. Remove Komodo 5 from future tests.
7. At this point Komodo 4 represents the "Komodo" family.
8. When Komodo 6 comes out, it plays against Komodo 4 and the other top 20

The "imbreeding" issue is not something you have to worry about. First of all the effect of it is very minor anyway. A heads to heads match of 2 similar program may distort the difference by 5 ELO - but let's for the case of argument pretend that it's enormous and makes a 20 ELO distortion. The effect of this one match is going to be giving 1/20 of the weight so at most you will get a 1 ELO distortion. The impact will be split between the two programs in questions so really EACH program will be distorted by 10 ELO which is given 1/20 weight - so one program could be 1/2 point under-rated and the other 1/2 point over-rated. Keep in mind that your error margin is FAR GREATER than that, so there is no chance that you will be able to even see the effect of this.

The best way to fight imbreeding is to encourage diversity, something you are already doing by letting only one program represent a "family" of programs.

What happens if a newer version is behind the parent because of a high draw rate. According to my rules I would have to stick with the older version and throw the new one out ...? If I dont play this match I am missing 150 games, if I play it I might have 150 useless games ...

After re-ranking you could "throw out" all the results of the program that is being ejected from the top 20 list. It's not likely this would change anything but there is a small chance that it would. For example the 20th player might end up with a lower rating that the previous 20th player because the games you throw out could change things just enough to make it ambiguous. I don't think you need to obsess about this however, to be scientific just state the conditions that you will use to determine the top 20. You can either:

A) View the 20 player match as a pre-test to determine if the new program deserves to be in the top 20. Once accepted, eject the previous 20th player (or the other version of the same program) and remove it's results with no apologies. Re-rate the games and you have a pure list of just the 20 players playing each other.

B) Keep the ratings and results of the 21 games, only display the top 20 players and consider player 21 as out of future competitions.

Either has pro's and con's but I favor A because it's a "pure" list in the sense that it is composed of 20 players with ONLY games against each other. There are good arguments for option B too.

It is not easy, but what is your opinion about this "parent - child" comparision in a 19 other engine enviroment?

Bye
Ingo

lucasart · Post by **lucasart** » Sun Jan 29, 2012 2:33 pm

kranium wrote: you are promoting and testing clearly GPL plagiarized engines like Houdini, Rybka, Strelka, etc.,
while simultaneously blacklisting a clean, unique, and incredibly innovative engine like IvanHoe?

+1

Comrades, let's all boycott the IPON capitalist rating list!

IWB · Post by **IWB** » Sun Jan 29, 2012 3:22 pm

Thx for the suggestions Don, I will ponder a bit about it.

Just one remark. I intend to do A AND B. I just cant do it right now as I don't have a full set of F13 as the CB Turney-Interface is so clumsy. I hesitate to do that job and want to wait for DF13 ...

Regarding stopping to test a version: Rybka 4.1 or Onno 1.2.7 (to name a few) where a little less good as their precessors according to Bayes or Elostat with their release. But looking into the details I saw that vs the same opponents with the same openings they where better. The problem was the good performance vs older engines of the two predecessors. In both cases I knew that I only have to play more games and R4.1 and O127 will be better than their parrents. If I would have stopped back then they would be out. The problem continues: Today, both newer engines are behind their parenta again - becasue they played against a lot of stronger engines (Kommodo/Critter/Houdini ...) which dropped their rating, while their parents where stopped at their peak rating ....

I think I have to be very carefull with the decisions what is in and out and have to rely on some common sense as this is impossible to put into rules (for me!).

Anyhow, thx again, I have something to think about!

Bye and a nice remaining Sunday
Ingo

Don · Post by **Don** » Sun Jan 29, 2012 3:36 pm

IWB wrote:Thx for the suggestions Don, I will ponder a bit about it.

Just one remark. I intend to do A AND B. I just cant do it right now as I don't have a full set of F13 as the CB Turney-Interface is so clumsy. I hesitate to do that job and want to wait for DF13 ...

Regarding stopping to test a version: Rybka 4.1 or Onno 1.2.7 (to name a few) where a little less good as their precessors according to Bayes or Elostat with their release. But looking into the details I saw that vs the same opponents with the same openings they where better. The problem was the good performance vs older engines of the two predecessors. In both cases I knew that I only have to play more games and R4.1 and O127 will be better than their parrents. If I would have stopped back then they would be out. The problem continues: Today, both newer engines are behind their parenta again - becasue they played against a lot of stronger engines (Kommodo/Critter/Houdini ...) which dropped their rating, while their parents where stopped at their peak rating ....

I think I have to be very carefull with the decisions what is in and out and have to rely on some common sense as this is impossible to put into rules (for me!).

Anyhow, thx again, I have something to think about!

Bye and a nice remaining Sunday
Ingo

I have another very simple set of rules to ponder which will make your life much simpler:

1. If an author releases a new version, remove the old and substitute it with the new version. It's the authors decisions which is his latest and greatest.

2. If a NEW engine is released, test it against the top 19, not the top 20.

Idea 2 is very logical - because we basically ASSUME that it will be top 20 and the only time you have to resolve anything is if it tests in last place - then you have to determine if the previous position 20 engine is stronger or this new one is. But that is trivial to resolve since BOTH program played the same 19 players. Just keep the one with the higher ELO.

Don

IPON will change ...

Re: IPON will change ...

Re: IPON will change ...

Re: IPON will change ...

Re: IPON will change ...

Re: IPON will change ...

Re: IPON will change ...

Re: IPON will change ...

Re: IPON will change ...

Re: IPON will change ...

Re: IPON will change ...