Peculiarity of Komodo 5.1MP

Laskos · Post by **Laskos** » Sun Jun 23, 2013 2:15 am

bob wrote:
I have not denied anything.

I simply said (a) the data at the beginning of this thread has absolutely no useful information content in it, and says nothing whatsoever about parallel performance of any kind.

The data at the beginning of this thread has the useful information that time-to-depth is not the way to see the effectiveness of the SMP implementation.

Suppose that Elo gain came at the expense of running 8x slower? The given test would not show that. So concluding anything at all about Elo given the way that test was run is completely pointless.

(b) for modest numbers of processors, the most significant gain is going to be depth-related...

I'd hope that is clear enough. If someone gets more from widening (as opposed to deepening) the search, that suggests an issue. Nothing more, nothing less. But it is an indicator of an issue.

If Komodo can be modified to one ply less or one ply more without changing the strength, I think the "issue" is only theoretical, practically, with this widening, it could have reasonable, or good, or even the best SMP implementation.

syzygy · Post by **syzygy** » Sun Jun 23, 2013 3:07 am

bob wrote:1. Kai's data does NOT show anything relative to parallel search Elo gain.

Did I write that it did? Let's see:

syzygy wrote:I think the problem is that you have not bothered to read what this thread is about.

What Kai showed is ONLY that Komodo's SMP behaviour is different from SMP behaviour of other engines. This does not mean that Komodo's SMP implementation is any good or any bad. It does mean that it is different.

So no, I did not write that it did.

2. There has been ZERO evidence to show that such a "wider search" is stronger.

Hasn't there been? Do you realise that the only reason that we assume Komodo's search is wider is that Kai's experiment has shown Komodo's 4-core search to be stronger than its 1-core search at the same depth?

If this happened, does it not DIRECTLY show that the sequential search could be improved?

First of all: so what? Second: no, it does not.

Multiplexing works.

No, multiplexing does not work. Multiplexing costs time. Try to switch context every other node. You mess up your caches. You mess up everything.

Multiplexing is a theoretical way of showing that an O(f(n)) algorithm using N cores implies an O(f(n))-algorithm using 1 core. The O() consumes an arbitrary constant here.

More importantly: nobody is saying that Komodo's 4-core search is better than a 4x faster 1-core search. Where did you get that from?

bob · Post by **bob** » Mon Jun 24, 2013 6:57 am

Laskos wrote:
bob wrote:
I have not denied anything.

I simply said (a) the data at the beginning of this thread has absolutely no useful information content in it, and says nothing whatsoever about parallel performance of any kind.
The data at the beginning of this thread has the useful information that time-to-depth is not the way to see the effectiveness of the SMP implementation.

How so? Just disable LMR. You get a BIG elo jump at fixed depth. Does that tell us anything we don't already know?

The data at the beginning of this thread simply shows that Komodo does something different. Not whether it is better, worse, or anything... Nothing about the SMP search either. Can't tell whether it is better or worse, because real games are timed.

If you want to post elo, just play 1 cpu against a gauntlet and then repeat with 4 cpus. Then you get actual Elo. If you want to look at SMP efficiency only, look at time to depth, only, 1 cpu vs 4 cpus.

Suppose that Elo gain came at the expense of running 8x slower? The given test would not show that. So concluding anything at all about Elo given the way that test was run is completely pointless.

(b) for modest numbers of processors, the most significant gain is going to be depth-related...

I'd hope that is clear enough. If someone gets more from widening (as opposed to deepening) the search, that suggests an issue. Nothing more, nothing less. But it is an indicator of an issue.
If Komodo can be modified to one ply less or one ply more without changing the strength, I think the "issue" is only theoretical, practically, with this widening, it could have reasonable, or good, or even the best SMP implementation.

Or bad. The data doesn't show ANYTHING about Elo gain/loss when using 4 cores...

bob · Post by **bob** » Mon Jun 24, 2013 7:08 am

syzygy wrote:
bob wrote:1. Kai's data does NOT show anything relative to parallel search Elo gain.
Did I write that it did? Let's see:
syzygy wrote:I think the problem is that you have not bothered to read what this thread is about.

What Kai showed is ONLY that Komodo's SMP behaviour is different from SMP behaviour of other engines. This does not mean that Komodo's SMP implementation is any good or any bad. It does mean that it is different.
So no, I did not write that it did.

2. There has been ZERO evidence to show that such a "wider search" is stronger.
Hasn't there been? Do you realise that the only reason that we assume Komodo's search is wider is that Kai's experiment has shown Komodo's 4-core search to be stronger than its 1-core search at the same depth?

It should be intuitively obvious to anyone familiar with computer chess that a wider search is more accurate, when using fixed depth. It is also slower. There is no way to measure whether it is better or worse at fixed depth, because chess is timed. Ergo, the data at the beginning of this thread shows nothing other than suggesting that the komodo parallel search looks at significantly more nodes for a given depth than a serial search. It doesn't show whether it is better or not, because the time loss from the extra width could cost a ply or two, which is not accounted for. Ergo <zero information>

If this happened, does it not DIRECTLY show that the sequential search could be improved?
First of all: so what? Second: no, it does not.

Multiplexing works.
No, multiplexing does not work. Multiplexing costs time. Try to switch context every other node. You mess up your caches. You mess up everything.

Eh? No context switches whatsoever. Inside the program you just search a move on subtree 1, then a move on subtree 2, and repeat. Same program, no cache issues, no nothing. This is how I debugged much of my early parallel search when the parallel machines were only rarely available for testing (Cray XMP serial number 1, for example).

Multiplexing is a theoretical way of showing that an O(f(n)) algorithm using N cores implies an O(f(n))-algorithm using 1 core. The O() consumes an arbitrary constant here.

More importantly: nobody is saying that Komodo's 4-core search is better than a 4x faster 1-core search. Where did you get that from?

Did you bother to read the first post??? I quote:

So, time to depth is an incorrect way of calculating Komodo's MP efficiency.

SMP efficiency is ALWAYS defined as "time (1cpu) / time (Ncpus)"

ALWAYS.

I can provide citations if you want. The book I use in my parallel programming course this summer has it in chapter 2.

So the data does NOT show anything about "SMP efficiency" whatsoever. It doesn't show anything about Elo since it is not testing Elo correctly (one program has a significant time advantage over the other which makes the comparison invalid).

So, WHAT does it actually show? Not a thing I can see other than Komodo is doing something different and that searching a bigger tree is better if time is not measured. Not anything new there I can see.

syzygy · Post by **syzygy** » Mon Jun 24, 2013 11:01 am

bob wrote:
syzygy wrote:
bob wrote:1. Kai's data does NOT show anything relative to parallel search Elo gain.
Did I write that it did? Let's see:
syzygy wrote:I think the problem is that you have not bothered to read what this thread is about.

What Kai showed is ONLY that Komodo's SMP behaviour is different from SMP behaviour of other engines. This does not mean that Komodo's SMP implementation is any good or any bad. It does mean that it is different.
So no, I did not write that it did.

2. There has been ZERO evidence to show that such a "wider search" is stronger.
Hasn't there been? Do you realise that the only reason that we assume Komodo's search is wider is that Kai's experiment has shown Komodo's 4-core search to be stronger than its 1-core search at the same depth?
It should be intuitively obvious to anyone familiar with computer chess that a wider search is more accurate, when using fixed depth. It is also slower. There is no way to measure whether it is better or worse at fixed depth, because chess is timed. Ergo, the data at the beginning of this thread shows nothing other than suggesting that the komodo parallel search looks at significantly more nodes for a given depth than a serial search. It doesn't show whether it is better or not, because the time loss from the extra width could cost a ply or two, which is not accounted for.

This is what I wrote. So we agree. That Komodo's SMP implementation is effective in terms of playing strength / elo gain per core follows from other threads.

Ergo <zero information>

Nope. It does show that Komodo's SMP implementation is different. Komodo's 4-core search to a particular (reported) depth is clearly of higher quality than Komodo's 1-core search to the same (reported) depth.

If Komodo had been Rybka, this would have put the reported depth in doubt. Given that Komodo is not Rybka, it seems very reasonable to assume that the reported depths are accurate.

Eh? No context switches whatsoever. Inside the program you just search a move on subtree 1, then a move on subtree 2, and repeat. Same program, no cache issues, no nothing.

Switches of search context, not the same program. Of course this decreases the effectiveness of caches and brings all kinds of other overhead that work out to a constant > 1. You know this.

bob wrote:
syzygy wrote:More importantly: nobody is saying that Komodo's 4-core search is better than a 4x faster 1-core search. Where did you get that from?
Did you bother to read the first post??? I quote:

So, time to depth is an incorrect way of calculating Komodo's MP efficiency.
SMP efficiency is ALWAYS defined as "time (1cpu) / time (Ncpus)"

ALWAYS.

First: how does that line from the first post contradict what I wrote? It does not.

Second: it should be clear that the majority of posters in this thread consider increased playing strength from using more core THE measure of SMP efficiency. How much stronger is a 4-core search compared to a 1-core search at the same time control.

Kai's test does NOT show that Komodo gains more from 4 cores than other engines at the same time control. Kai's test does NOT show that Komodo's 4-core search is better than a 4x faster 1-core search.

From other threads we know that Komodo's elo gain from going from 1 to 4 cores is at least comparable to the gain in other engines.

Kai's test shows that part of this gain for Komodo does not come from increased depth in the same time, but from a higher quality search at the same depth. Most likely (but Kai's test does not show this), another part of the gain comes from increased depth.

So, WHAT does it actually show? Not a thing I can see other than Komodo is doing something different and that searching a bigger tree is better if time is not measured. Not anything new there I can see.

Combine it with the fact observed in other threads that Komodo's SMP implementation is competitive with that of other engines...

Laskos · Post by **Laskos** » Mon Jun 24, 2013 11:47 am

bob wrote:
Laskos wrote:
bob wrote:
I have not denied anything.

I simply said (a) the data at the beginning of this thread has absolutely no useful information content in it, and says nothing whatsoever about parallel performance of any kind.
The data at the beginning of this thread has the useful information that time-to-depth is not the way to see the effectiveness of the SMP implementation.
How so? Just disable LMR. You get a BIG elo jump at fixed depth. Does that tell us anything we don't already know?

The data at the beginning of this thread simply shows that Komodo does something different. Not whether it is better, worse, or anything... Nothing about the SMP search either. Can't tell whether it is better or worse, because real games are timed.

The data and me was not saying anything about the effectiveness of SMP implementation of Komodo. Other threads have shown that this effectiveness is reasonable. What I am saying is that whatever the effectiveness of SMP in Komodo, time-to-depth is not the way to calculate the effectiveness of Komodo's SMP implementation. Time-to-depth is an artifact applicable to engines which don't show differences to the same depth on 1 to many cores.

Now, isn't it trivial that time-to-depth is not applicable to Komodo? Say you get time-to-depth equal on 4 cores and 1 core, you will say then that the gain from 1 to 4 cores is 0 points? Or 80 points from my test?

If you want to post elo, just play 1 cpu against a gauntlet and then repeat with 4 cpus. Then you get actual Elo. If you want to look at SMP efficiency only, look at time to depth, only, 1 cpu vs 4 cpus.

Suppose that Elo gain came at the expense of running 8x slower? The given test would not show that. So concluding anything at all about Elo given the way that test was run is completely pointless.

(b) for modest numbers of processors, the most significant gain is going to be depth-related...

I'd hope that is clear enough. If someone gets more from widening (as opposed to deepening) the search, that suggests an issue. Nothing more, nothing less. But it is an indicator of an issue.
If Komodo can be modified to one ply less or one ply more without changing the strength, I think the "issue" is only theoretical, practically, with this widening, it could have reasonable, or good, or even the best SMP implementation.
Or bad. The data doesn't show ANYTHING about Elo gain/loss when using 4 cores...

Laskos · Post by **Laskos** » Mon Jun 24, 2013 11:56 am

bob wrote:
SMP efficiency is ALWAYS defined as "time (1cpu) / time (Ncpus)"

ALWAYS.

I can provide citations if you want. The book I use in my parallel programming course this summer has it in chapter 2.

You seem to be confused by "time (1cpu) / time (Ncpus)". The definition "time (1cpu) / time (Ncpus)" is correct in the sense that time ratio needed for 1cpu to get THE STRENGTH of Ncpus is defining the effectiveness of SMP implementation. Not time-to-depth, time-to-depth in general is useless.

Joerg Oster · Post by **Joerg Oster** » Mon Jun 24, 2013 3:48 pm

bob wrote: If you want to post elo, just play 1 cpu against a gauntlet and then repeat with 4 cpus. Then you get actual Elo.

OK. Here it is.

Both gauntlets with TC = 15+0.2 sec, 128 MB Hash each, 999 games. Elo computed with Bayeselo.

Gauntlet 1 (Komodo 5.1 MP with 1 Thread)

Code: Select all

Rank Name          Elo    +    - games score oppo. draws 
   1 Critter1.6a    13   23   23   333   51%     6   45% 
   2 Komodo-1T       6   13   13   999   51%    -2   42% 
   3 Houdini1.5a     6   23   23   333   50%     6   40% 
   4 Stockfish     -25   23   23   333   45%     6   41%

Gauntlet 2 (Komodo 5.1 MP with 4 Threads)

Code: Select all

Rank Name          Elo    +    - games score oppo. draws 
   1 Komodo-4T     107   14   14   999   72%   -36   36% 
   2 Critter1.6a   -29   24   24   333   29%   107   37% 
   3 Houdini1.5a   -30   24   25   333   30%   107   35% 
   4 Stockfish     -49   24   25   333   27%   107   37%

Both gauntlets combined

Code: Select all

Rank Name          Elo    +    - games score oppo. draws 
   1 Komodo-4T     112   15   15   999   72%   -30   36% 
   2 Critter1.6a   -19   18   18   666   40%    45   41% 
   3 Komodo-1T     -22   14   14   999   51%   -30   42% 
   4 Houdini1.5a   -23   18   18   666   40%    45   37% 
   5 Stockfish     -48   18   18   666   36%    45   39%

So approximately +130 Elo.
I hope you, and others as well, find this useful.

Don · Post by **Don** » Mon Jun 24, 2013 3:51 pm

Thanks Joerg,

I appreciate that run.

Don

Joerg Oster wrote:
bob wrote: If you want to post elo, just play 1 cpu against a gauntlet and then repeat with 4 cpus. Then you get actual Elo.
OK. Here it is.

Both gauntlets with TC = 15+0.2 sec, 128 MB Hash each, 999 games. Elo computed with Bayeselo.

Gauntlet 1 (Komodo 5.1 MP with 1 Thread)
Code: Select all
Rank Name          Elo    +    - games score oppo. draws 
   1 Critter1.6a    13   23   23   333   51%     6   45% 
   2 Komodo-1T       6   13   13   999   51%    -2   42% 
   3 Houdini1.5a     6   23   23   333   50%     6   40% 
   4 Stockfish     -25   23   23   333   45%     6   41%
Gauntlet 2 (Komodo 5.1 MP with 4 Threads)
Code: Select all
Rank Name          Elo    +    - games score oppo. draws 
   1 Komodo-4T     107   14   14   999   72%   -36   36% 
   2 Critter1.6a   -29   24   24   333   29%   107   37% 
   3 Houdini1.5a   -30   24   25   333   30%   107   35% 
   4 Stockfish     -49   24   25   333   27%   107   37%
Both gauntlets combined
Code: Select all
Rank Name          Elo    +    - games score oppo. draws 
   1 Komodo-4T     112   15   15   999   72%   -30   36% 
   2 Critter1.6a   -19   18   18   666   40%    45   41% 
   3 Komodo-1T     -22   14   14   999   51%   -30   42% 
   4 Houdini1.5a   -23   18   18   666   40%    45   37% 
   5 Stockfish     -48   18   18   666   36%    45   39%
So approximately +130 Elo.
I hope you, and others as well, find this useful.

Laskos · Post by **Laskos** » Mon Jun 24, 2013 8:50 pm

Joerg Oster wrote:
bob wrote: If you want to post elo, just play 1 cpu against a gauntlet and then repeat with 4 cpus. Then you get actual Elo.
OK. Here it is.

Both gauntlets with TC = 15+0.2 sec, 128 MB Hash each, 999 games. Elo computed with Bayeselo.

Gauntlet 1 (Komodo 5.1 MP with 1 Thread)
Code: Select all
Rank Name          Elo    +    - games score oppo. draws 
   1 Critter1.6a    13   23   23   333   51%     6   45% 
   2 Komodo-1T       6   13   13   999   51%    -2   42% 
   3 Houdini1.5a     6   23   23   333   50%     6   40% 
   4 Stockfish     -25   23   23   333   45%     6   41%
Gauntlet 2 (Komodo 5.1 MP with 4 Threads)
Code: Select all
Rank Name          Elo    +    - games score oppo. draws 
   1 Komodo-4T     107   14   14   999   72%   -36   36% 
   2 Critter1.6a   -29   24   24   333   29%   107   37% 
   3 Houdini1.5a   -30   24   25   333   30%   107   35% 
   4 Stockfish     -49   24   25   333   27%   107   37%
Both gauntlets combined
Code: Select all
Rank Name          Elo    +    - games score oppo. draws 
   1 Komodo-4T     112   15   15   999   72%   -30   36% 
   2 Critter1.6a   -19   18   18   666   40%    45   41% 
   3 Komodo-1T     -22   14   14   999   51%   -30   42% 
   4 Houdini1.5a   -23   18   18   666   40%    45   37% 
   5 Stockfish     -48   18   18   666   36%    45   39%
So approximately +130 Elo.
I hope you, and others as well, find this useful.

And this +130 Elos going from 1 to 4 cores compares rather well with 90-110 Elos shown by "typical" engines like Houdini, Critter, Stockfish, Crafty, etc. on CCRL and CEGT 40/4. I don't have right now my 4 core i7 available, but I bet time-to depth of Komodo from 1 to 4 cores is around factor of 1.5 to 2, compared to 3-3.2 of typical engines, and Bob would conclude that SMP efficiency of Komodo is bad. The hard data shows that SMP efficiency of Komodo is very good while having much lower time-to-depth factor.

Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP