Komodo bugfix results

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Komodo bugfix results

Post by lkaufman »

A fairly serious bug was found by one of our alert beta-testers in the beta version of Komodo 5.1 MP. It has been fixed in time for Monday's release. Here are new results with the bugfixed version, all against Stockfish 3.0 at 3' + 2", 256 meg hash, Noomen test set, sse4 machines.

On 4 cores, score now is 60-49 (+34 elo)
On 12 cores, score now is 50-28 (+100 elo).

This difference is so large that even given the small samples the likelihood of superiority for the second result must be quite high, so much to our surprise it seems we scale better than SF with more cores. Normally in view of the small samples I would combine the results and quote the average elo difference, but in view of the huge disparity between the two results this might not be appropriate. Both results are better than with the previous version, the 12 core one dramatically so.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Attn: too much error bar in the test results (n/t)

Post by Daniel Shawul »

..proves Mark's point..
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Komodo bugfix results

Post by michiguel »

lkaufman wrote:A fairly serious bug was found by one of our alert beta-testers in the beta version of Komodo 5.1 MP. It has been fixed in time for Monday's release. Here are new results with the bugfixed version, all against Stockfish 3.0 at 3' + 2", 256 meg hash, Noomen test set, sse4 machines.

On 4 cores, score now is 60-49 (+34 elo)
On 12 cores, score now is 50-28 (+100 elo).

This difference is so large that even given the small samples the likelihood of superiority for the second result must be quite high, so much to our surprise it seems we scale better than SF with more cores. Normally in view of the small samples I would combine the results and quote the average elo difference, but in view of the huge disparity between the two results this might not be appropriate. Both results are better than with the previous version, the 12 core one dramatically so.
Hi Larry,

Sounds like if I do not make a comment in any thread related to statistical significance, I cannot make it somewhere else. So, here I go. I do not think that you can make those conclusions with the number of games you played (particularly with the scaling 4 to 12 cores). Some people can help you with BayesElo. I can help you with Ordo if you are using it to get an idea of the errors.

Miguel
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Komodo bugfix results

Post by lkaufman »

michiguel wrote:
lkaufman wrote:A fairly serious bug was found by one of our alert beta-testers in the beta version of Komodo 5.1 MP. It has been fixed in time for Monday's release. Here are new results with the bugfixed version, all against Stockfish 3.0 at 3' + 2", 256 meg hash, Noomen test set, sse4 machines.

On 4 cores, score now is 60-49 (+34 elo)
On 12 cores, score now is 50-28 (+100 elo).

This difference is so large that even given the small samples the likelihood of superiority for the second result must be quite high, so much to our surprise it seems we scale better than SF with more cores. Normally in view of the small samples I would combine the results and quote the average elo difference, but in view of the huge disparity between the two results this might not be appropriate. Both results are better than with the previous version, the 12 core one dramatically so.
Hi Larry,

Sounds like if I do not make a comment in any thread related to statistical significance, I cannot make it somewhere else. So, here I go. I do not think that you can make those conclusions with the number of games you played (particularly with the scaling 4 to 12 cores). Some people can help you with BayesElo. I can help you with Ordo if you are using it to get an idea of the errors.

Miguel
I don't know how to calculate the likelihood that on 12 cores Komodo is stronger relative to Stockfish than on four cores, but my sense is that it is around 90% from the data given. If someone knows how to do this calculation, please do so here. If it is around 90%, it would be reasonable to say that Komodo "probably" scales better than Stockfish with more cores. I also had a better result on 12 cores than on 4 for the buggy version, so there is some extra reason to believe this conclusion. By the way, I have no commercial reason to make this claim, as very few buyers have more than 6 cores.
User avatar
Mike S.
Posts: 1480
Joined: Thu Mar 09, 2006 5:33 am

Re: Komodo bugfix results

Post by Mike S. »

12 cores data indeed is of academic interest, only. Nevertheless, certainly more than a dozen power users will be interested, as it seems to indicate the effectiveness of the multiprocessing implementation. It looks good!

It seems that if we count engine series only, Komodo 5.1 will be on second rank. But the tough question is now: In typical MP (e.g. 4 cores/4 threads or even 8 threads on Intel), is it stronger than Houdini 1.5a?

As for me, this question is moot though as I am now very satisfied with Komodo CCT on my dualcore CPU. Thanks! :mrgreen:
Regards, Mike
User avatar
Ajedrecista
Posts: 2177
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Komodo bugfix results.

Post by Ajedrecista »

Hello Larry:
lkaufman wrote:
michiguel wrote:
lkaufman wrote:A fairly serious bug was found by one of our alert beta-testers in the beta version of Komodo 5.1 MP. It has been fixed in time for Monday's release. Here are new results with the bugfixed version, all against Stockfish 3.0 at 3' + 2", 256 meg hash, Noomen test set, sse4 machines.

On 4 cores, score now is 60-49 (+34 elo)
On 12 cores, score now is 50-28 (+100 elo).

This difference is so large that even given the small samples the likelihood of superiority for the second result must be quite high, so much to our surprise it seems we scale better than SF with more cores. Normally in view of the small samples I would combine the results and quote the average elo difference, but in view of the huge disparity between the two results this might not be appropriate. Both results are better than with the previous version, the 12 core one dramatically so.
Hi Larry,

Sounds like if I do not make a comment in any thread related to statistical significance, I cannot make it somewhere else. So, here I go. I do not think that you can make those conclusions with the number of games you played (particularly with the scaling 4 to 12 cores). Some people can help you with BayesElo. I can help you with Ordo if you are using it to get an idea of the errors.

Miguel
I don't know how to calculate the likelihood that on 12 cores Komodo is stronger relative to Stockfish than on four cores, but my sense is that it is around 90% from the data given. If someone knows how to do this calculation, please do so here. If it is around 90%, it would be reasonable to say that Komodo "probably" scales better than Stockfish with more cores. I also had a better result on 12 cores than on 4 for the buggy version, so there is some extra reason to believe this conclusion. By the way, I have no commercial reason to make this claim, as very few buyers have more than 6 cores.
I am glad to see that you are in good form again. :)

I agree with other posters that the number of games is somewhat small to draw an early conclusion, so trying to raise the number of played games is a good start.

Regarding Statistics... well, I am not the best here, of course! I did some Fortran 95 programmes that can be downloaded for free through the link of my signature. Just take the results with tons of care. Inside this pack, LOS_and_Elo_uncertainties_calculator is the programme you should run (it is valid between matches of only two engines).

Here is a thread that might be of help with your request:

Math Test 4 All

There are some interesting answers. Miguel explained there how to do a simulation with Ordo. I also did my clumsy math in a 'trial and error' mode. I hope that you find valuable answers in that topic.

------------------------

If you want to calculate LLR (Log Likelihood Ratio) in the same way than SF Testing Framework does, here is other programme by me:

LLR_calculator_for_chess.rar (0.6 MB)

(This link will dead after 30 days since the last download). It is no more than a copy of a part of a Python file of FishTest, as I give credit both in the Readme and source code files. Once again, take the given results with care.

My tools only work in Windows (sorry to Linux and other OS users).

Good luck for the inminent release!

Regards from Spain.

Ajedrecista.
Werewolf
Posts: 2064
Joined: Thu Sep 18, 2008 10:24 pm

Re: Komodo bugfix results

Post by Werewolf »

lkaufman wrote:
michiguel wrote:
lkaufman wrote:A fairly serious bug was found by one of our alert beta-testers in the beta version of Komodo 5.1 MP. It has been fixed in time for Monday's release. Here are new results with the bugfixed version, all against Stockfish 3.0 at 3' + 2", 256 meg hash, Noomen test set, sse4 machines.

On 4 cores, score now is 60-49 (+34 elo)
On 12 cores, score now is 50-28 (+100 elo).

This difference is so large that even given the small samples the likelihood of superiority for the second result must be quite high, so much to our surprise it seems we scale better than SF with more cores. Normally in view of the small samples I would combine the results and quote the average elo difference, but in view of the huge disparity between the two results this might not be appropriate. Both results are better than with the previous version, the 12 core one dramatically so.
Hi Larry,

Sounds like if I do not make a comment in any thread related to statistical significance, I cannot make it somewhere else. So, here I go. I do not think that you can make those conclusions with the number of games you played (particularly with the scaling 4 to 12 cores). Some people can help you with BayesElo. I can help you with Ordo if you are using it to get an idea of the errors.

Miguel
I don't know how to calculate the likelihood that on 12 cores Komodo is stronger relative to Stockfish than on four cores, but my sense is that it is around 90% from the data given. If someone knows how to do this calculation, please do so here. If it is around 90%, it would be reasonable to say that Komodo "probably" scales better than Stockfish with more cores. I also had a better result on 12 cores than on 4 for the buggy version, so there is some extra reason to believe this conclusion. By the way, I have no commercial reason to make this claim, as very few buyers have more than 6 cores.
I just want to say that as a user who has 16 and 12 core machines, I very much appreciate your efforts to work on this equipment.

My hardware is fairly tied up but if I can help run a quick test for you let me know.
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Komodo bugfix results

Post by lkaufman »

I completed 200 games against Stockfish 3 on four cores for the release version, and the result is 108.5 to 91.5, which is +30 elo for Komodo. The difference between this and the +100 elo result on 12 cores after 78 games must be highly significant now.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Komodo bugfix results

Post by Laskos »

lkaufman wrote:
michiguel wrote:
lkaufman wrote:A fairly serious bug was found by one of our alert beta-testers in the beta version of Komodo 5.1 MP. It has been fixed in time for Monday's release. Here are new results with the bugfixed version, all against Stockfish 3.0 at 3' + 2", 256 meg hash, Noomen test set, sse4 machines.

On 4 cores, score now is 60-49 (+34 elo)
On 12 cores, score now is 50-28 (+100 elo).

This difference is so large that even given the small samples the likelihood of superiority for the second result must be quite high, so much to our surprise it seems we scale better than SF with more cores. Normally in view of the small samples I would combine the results and quote the average elo difference, but in view of the huge disparity between the two results this might not be appropriate. Both results are better than with the previous version, the 12 core one dramatically so.
Hi Larry,

Sounds like if I do not make a comment in any thread related to statistical significance, I cannot make it somewhere else. So, here I go. I do not think that you can make those conclusions with the number of games you played (particularly with the scaling 4 to 12 cores). Some people can help you with BayesElo. I can help you with Ordo if you are using it to get an idea of the errors.

Miguel
I don't know how to calculate the likelihood that on 12 cores Komodo is stronger relative to Stockfish than on four cores, but my sense is that it is around 90% from the data given. If someone knows how to do this calculation, please do so here. If it is around 90%, it would be reasonable to say that Komodo "probably" scales better than Stockfish with more cores. I also had a better result on 12 cores than on 4 for the buggy version, so there is some extra reason to believe this conclusion. By the way, I have no commercial reason to make this claim, as very few buyers have more than 6 cores.
You have a good sense with your 90%. Rule of thumb calculation (besides being rude, I would need the number of draws) is as follows:

2SD for first match is 480/sqrt(109) points ~= 46 points
2SD for second match is 480/sqrt(78) points ~= 54 points

2SD for their difference is ~= sqrt(46^2+54^2) ~= 70 points

You have 66 points difference, and it's a bit smaller than 2SD error of 70 points, so the LOS is a bit smaller than 95%, say 90%, as your excellent sense told you. Maybe Miguel can show that in Ordo, or you feed Bayeselo with PGN to get LOS matrix.
yanquis1972
Posts: 1766
Joined: Wed Jun 03, 2009 12:14 am

Re: Komodo bugfix results

Post by yanquis1972 »

larry, i am not sure if you or don (or either) would be the one to ask, but do you think this drastic result on manycore systems is a result of your access to such hardware? do you have any theories as to how such a drastic improvement is possible? would something like stockfishes default split depth come into play?