Hi to all !
The testrun has begun:
http://cegt.forumieren.com/t153-testing ... ish-50-x64
Best wishes,
G.S.
(CEGT team)
Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
Moderator: Ras
-
ThatsIt
- Posts: 992
- Joined: Thu Mar 09, 2006 2:11 pm
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
update: 500 games are played.
http://cegt.forumieren.com/t153-testing ... ish-50-x64
Best wishes,
G.S.
(CEGT team)
http://cegt.forumieren.com/t153-testing ... ish-50-x64
Best wishes,
G.S.
(CEGT team)
-
Wolfgang
- Posts: 989
- Joined: Sat May 13, 2006 1:08 am
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
750 games played now, +37 to Stockfish DD and -6 to Houdini 4.0
-
Uri Blass
- Posts: 11153
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.
For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.
After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.
After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
-
lkaufman
- Posts: 6284
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
The above IPON rating inversion is due to using BayesElo rather than Ordo. The real problem is that an engine that draws fewer games (i.e. Houdini) will have an artificially high rating when most of the matches are mismatches. Your solution has a couple problems, namely that few games can be run in the same time, and that the four core ratings will be even more distorted. My solution (other than using Ordo) is simply to play an additional RR of the top 4 or so engines with enough games to equal the number played in the original RR, and rate all the games together. There should be far more close pairings than mismatches, and this solves the problem.Uri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.
For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.
After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
-
Modern Times
- Posts: 3803
- Joined: Thu Jun 07, 2012 11:02 pm
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
Thanks for the update ! So Houdini 4 still leads the list by a small margin, but that could yet change. Within the error bars, too close to call.Wolfgang wrote:750 games played now, +37 to Stockfish DD and -6 to Houdini 4.0
-
Leto
- Posts: 2139
- Joined: Thu May 04, 2006 3:40 am
- Location: Dune
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
Keep in mind that Houdini 4 default has contempt 1 setting, that explains why it scores less draws against weaker engines. Also it's not certain yet if Houdini 4 still leads that list because with any change on an engine's rating (for example Houdini 4) it changes the ratings for all engines.
-
Dr.Wael Deeb
- Posts: 9773
- Joined: Wed Mar 08, 2006 8:44 pm
- Location: Amman,Jordan
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
I don't see any logical explanation behind your proposal UriUri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.
For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.
After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
Dr.D
_No one can hit as hard as life.But it ain’t about how hard you can hit.It’s about how hard you can get hit and keep moving forward.How much you can take and keep moving forward….
-
ThatsIt
- Posts: 992
- Joined: Thu Mar 09, 2006 2:11 pm
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
update: 1250 games are played.
http://cegt.forumieren.com/t153-testing ... ish-50-x64
Best wishes,
G.S.
(CEGT team)
http://cegt.forumieren.com/t153-testing ... ish-50-x64
Best wishes,
G.S.
(CEGT team)
-
Uri Blass
- Posts: 11153
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Stockfish 5.0 x64 1CPU @ CEGT 5'+3" pb=on
The logical explanation is that the rating that you get is biased when you test only against weaker opponents and I prefer rating that is less dependent on opponents.Dr.Wael Deeb wrote:I don't see any logical explanation behind your proposal UriUri Blass wrote:I think that if you test 1 core also against 4 cores then your rating list is going to be more reliable.
For example you can test stockfish5 1 cpu and houdini4 1 cpu against komodo7a 4 cpu.
After seeing that stockfish score more points and lower rating in one version of the IPON rating list I am afraid that I cannot trust the rating lists when the programs score clearly more than 50% and it is important to make effort to have score that is closer to 50% if it is possible.
Dr.D
If houdini4 has now bigger rating than stockfish5 when both played against weaker opponents only because houdini take risks and play objectively bad moves that help it to win against weak opponents
when some years later when both of them play against
stronger opponents we find that stockfish5 has bigger rating then it means that rating is not reliable to measure playing strength.
If you want rating to be more reliable tool to measure playing strength then you need to care that programs get something close to 50% and the assumption that your opponents are going to be weaker is simply an arrogant assumption that should have no basis if we think about the future
and I can easily prove it(for example Fritz5.32 that was the ssdf leader
now has only 38% in the games of it in the ssdf list)
http://ssdf.bosjo.net/long.txt