Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

ThatsIt · Post by **ThatsIt** » Mon Jun 02, 2014 11:47 am

Hi to all !

The testrun has begun:
http://cegt.forumieren.com/t153-testing ... ish-50-x64

Best wishes,
G.S.
(CEGT team)

ThatsIt · Post by **ThatsIt** » Tue Jun 03, 2014 9:18 pm

update: 500 games are played.

http://cegt.forumieren.com/t153-testing ... ish-50-x64

Best wishes,
G.S.
(CEGT team)

Wolfgang · Post by **Wolfgang** » Wed Jun 04, 2014 3:53 pm

750 games played now, +37 to Stockfish DD and -6 to Houdini 4.0

Uri Blass · Post by **Uri Blass** » Wed Jun 04, 2014 4:12 pm

I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.

For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.

After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.

lkaufman · Post by **lkaufman** » Wed Jun 04, 2014 4:45 pm

Uri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.

For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.

After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.

The above IPON rating inversion is due to using BayesElo rather than Ordo. The real problem is that an engine that draws fewer games (i.e. Houdini) will have an artificially high rating when most of the matches are mismatches. Your solution has a couple problems, namely that few games can be run in the same time, and that the four core ratings will be even more distorted. My solution (other than using Ordo) is simply to play an additional RR of the top 4 or so engines with enough games to equal the number played in the original RR, and rate all the games together. There should be far more close pairings than mismatches, and this solves the problem.

Modern Times · Post by **Modern Times** » Wed Jun 04, 2014 5:25 pm

Wolfgang wrote:750 games played now, +37 to Stockfish DD and -6 to Houdini 4.0

Thanks for the update ! So Houdini 4 still leads the list by a small margin, but that could yet change. Within the error bars, too close to call.

Leto · Post by **Leto** » Wed Jun 04, 2014 6:03 pm

Keep in mind that Houdini 4 default has contempt 1 setting, that explains why it scores less draws against weaker engines. Also it's not certain yet if Houdini 4 still leads that list because with any change on an engine's rating (for example Houdini 4) it changes the ratings for all engines.

Dr.Wael Deeb · Post by **Dr.Wael Deeb** » Wed Jun 04, 2014 11:22 pm

Uri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.

For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.

After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.

I don't see any logical explanation behind your proposal Uri
Dr.D

ThatsIt · Post by **ThatsIt** » Fri Jun 06, 2014 10:50 am

update: 1250 games are played.

http://cegt.forumieren.com/t153-testing ... ish-50-x64

Best wishes,
G.S.
(CEGT team)

Uri Blass · Post by **Uri Blass** » Fri Jun 06, 2014 11:16 am

Dr.Wael Deeb wrote:
Uri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.

For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.

After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
I don't see any logical explanation behind your proposal Uri
Dr.D

The logical explanation is that the rating that you get is biased when you test only against weaker opponents and I prefer rating that is less dependent on opponents.

If houdini4 has now bigger rating than stockfish5 when both played against weaker opponents only because houdini take risks and play objectively bad moves that help it to win against weak opponents
when some years later when both of them play against
stronger opponents we find that stockfish5 has bigger rating then it means that rating is not reliable to measure playing strength.

If you want rating to be more reliable tool to measure playing strength then you need to care that programs get something close to 50% and the assumption that your opponents are going to be weaker is simply an arrogant assumption that should have no basis if we think about the future
and I can easily prove it(for example Fritz5.32 that was the ssdf leader
now has only 38% in the games of it in the ssdf list)

http://ssdf.bosjo.net/long.txt

Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on