Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

ThatsIt
Posts: 992
Joined: Thu Mar 09, 2006 2:11 pm

Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by ThatsIt »

Hi to all !

The testrun has begun:
http://cegt.forumieren.com/t153-testing ... ish-50-x64

Best wishes,
G.S.
(CEGT team)
ThatsIt
Posts: 992
Joined: Thu Mar 09, 2006 2:11 pm

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by ThatsIt »

update: 500 games are played.

http://cegt.forumieren.com/t153-testing ... ish-50-x64

Best wishes,
G.S.
(CEGT team)
Wolfgang
Posts: 989
Joined: Sat May 13, 2006 1:08 am

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by Wolfgang »

750 games played now, +37 to Stockfish DD and -6 to Houdini 4.0
Best
Wolfgang
CEGT-Team
www.cegt.net
www.cegt.forumieren.com
Uri Blass
Posts: 11153
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by Uri Blass »

I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.

For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.

After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by lkaufman »

Uri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.

For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.

After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
The above IPON rating inversion is due to using BayesElo rather than Ordo. The real problem is that an engine that draws fewer games (i.e. Houdini) will have an artificially high rating when most of the matches are mismatches. Your solution has a couple problems, namely that few games can be run in the same time, and that the four core ratings will be even more distorted. My solution (other than using Ordo) is simply to play an additional RR of the top 4 or so engines with enough games to equal the number played in the original RR, and rate all the games together. There should be far more close pairings than mismatches, and this solves the problem.
Modern Times
Posts: 3803
Joined: Thu Jun 07, 2012 11:02 pm

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by Modern Times »

Wolfgang wrote:750 games played now, +37 to Stockfish DD and -6 to Houdini 4.0
Thanks for the update ! So Houdini 4 still leads the list by a small margin, but that could yet change. Within the error bars, too close to call.
User avatar
Leto
Posts: 2139
Joined: Thu May 04, 2006 3:40 am
Location: Dune

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by Leto »

Keep in mind that Houdini 4 default has contempt 1 setting, that explains why it scores less draws against weaker engines. Also it's not certain yet if Houdini 4 still leads that list because with any change on an engine's rating (for example Houdini 4) it changes the ratings for all engines.
User avatar
Dr.Wael Deeb
Posts: 9773
Joined: Wed Mar 08, 2006 8:44 pm
Location: Amman,Jordan

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by Dr.Wael Deeb »

Uri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.

For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.

After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
I don't see any logical explanation behind your proposal Uri
Dr.D
_No one can hit as hard as life.But it ain’t about how hard you can hit.It’s about how hard you can get hit and keep moving forward.How much you can take and keep moving forward….
ThatsIt
Posts: 992
Joined: Thu Mar 09, 2006 2:11 pm

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by ThatsIt »

update: 1250 games are played.

http://cegt.forumieren.com/t153-testing ... ish-50-x64

Best wishes,
G.S.
(CEGT team)
Uri Blass
Posts: 11153
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on

Post by Uri Blass »

Dr.Wael Deeb wrote:
Uri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.

For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.

After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
I don't see any logical explanation behind your proposal Uri
Dr.D
The logical explanation is that the rating that you get is biased when you test only against weaker opponents and I prefer rating that is less dependent on opponents.

If houdini4 has now bigger rating than stockfish5 when both played against weaker opponents only because houdini take risks and play objectively bad moves that help it to win against weak opponents
when some years later when both of them play against
stronger opponents we find that stockfish5 has bigger rating then it means that rating is not reliable to measure playing strength.

If you want rating to be more reliable tool to measure playing strength then you need to care that programs get something close to 50% and the assumption that your opponents are going to be weaker is simply an arrogant assumption that should have no basis if we think about the future
and I can easily prove it(for example Fritz5.32 that was the ssdf leader
now has only 38% in the games of it in the ssdf list)

http://ssdf.bosjo.net/long.txt