Threads test incl. Stockfish 5 and Komodo 8

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Laskos »

Adam Hair wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
Always "I assume" or "I think" or "I believe" or "I heard" or "I saw somewhere" and such.

Notice my formula is right with Komodo through 8. Highly inaccurate. That it diverges a bit at 16 is cause for a major panic? When I have specifically stated that it is really well-tested only through 16? Where I have specifically stated that it is also architecturally dependent since multi-core chips have a bottleneck that single-core multi-chip machines do not. NO formula will predict SMP performance with two decimal place accuracy across all existing platforms using Intel/AMD processors. Nobody in their right mind would expect them to do so.

Those that actually know how to measure speedup would do the following, something you have not done, probably were not aware of needing to do, etc.

1. measure NPS with one thread nothing else running.

2. if you have N cores, run N instances of a single thread search and measure the NPS. If you add the N NPS values together, it might or might not be equal to N times the NPS from 1.

Now you know the hardware scaling limit. Assume a REALLY good implementation of the hardware so that for 2 you really do get N * number 1, which means raw speed scales perfectly.

3. Run a set of positions to fixed depth using 1 and N threads. For the N thread version, run them several times and average. Divide 1 thread total time by N thread total time. You have an actual SMP speedup. The JICCA DTS article gives this data for Cray Blitz for 1, 2, 4, 8 and 16 processors on a good architecture. You can find the numbers, or I can post them later. There was a thread, started by Vincent when he was not happy with the speedup he got on a supercomputer he was going to use for one of the WCCC events, and he was complaining that my numbers were grossly exaggerated. I ran a bunch of positions at the AMD development lab, using 1,2,4 and 8 processors (no multi-core) and gave the data to Fierz. He discovered that my formula was a low estimate and that the actual numbers were a bit better.

I don't waste all of my time trying to measure parallel speedup unless I make significant changes to the parallel search. Could it be better or worse today? quite possible since my search (not the parallel part) has changed a lot with more aggressive LMR, more aggressive null-move pruning, more aggressive forward pruning, singular extensions, and on and on. And one day I will take the time to test again and see if the formula needs an update. Until then, it is an ESTIMATE.

However, I consider your plot to be bogus, because you are mixing apples and oranges. My formula is purely SMP speedup. You are using Elo, an estimate of elo per doubling, to compute an estimate of an estimate for what Komodo's parallel speedup would be. There is a real, scientific, accurate, no-guesswork method to compute parallel speedup. You are not using it.

Again, I don't see the point in (a) your broken calculations or (b) your concern with an estimate that actually matches your broken data almost perfectly through 8 processors and diverges later. Does the speedup grow linearly? No, and I have never claimed it did. The estimate is a linear function because one can compute that in their head to get a quick estimate. Want an accurate number. Spend the time and run the tests. That is how _I_ measure and report speedup. ALWAYS.
On 1) and 3) there is Andreas'
http://www.talkchess.com/forum/viewtopic.php?p=570955

As for my data is bogus because I am using Elos, it's completely non-bogus if you will see that doubling in time at that TC means 90-100 Elo pints, with 80 it will go superlinear. And is consistent with other empirical data. And it doesn't matter too much, Crafty anyway will be on top by far on 16 threads. The question is: is SMP implementation in Crafty so much superior?


Do you believe doubling in time is 90-100 is true for ALL programs? In fact, it seems to be quite a bit higher than the number most have been using for years, namely 50-70.
At the very least it is approximately true for these engines at the OP's time control and computer.
Agreeing for at least 90 points per doubling at this TC on this hardware (Dual AMD Opteron 6376 2x16 Cores) for Komodo 8, it's "easy" to check for Andreas whether Bob's linear formula, which gives ~85 Elo points gain from 16 to 32 threads or my log formula, which gives ~38 Elo points gain, is correct.

Bob, let's have a bet, whoever in his prediction is closer, wins it.

Andreas, could you check in the same conditions 32 threads vs. 1 thread for Komodo 8? I know it's not easy to run 3,000 games with an engine on 32 threads, but Bob made here some claims.

Bob stated that on 64 core Itanium box, he got with Crafty an effective speed-up of 32, without even tuning the engine. That is a VERY high number for effective speed-up, and I begin to wonder about exceptionality of Crafty.

I hope Andreas sees this post.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

Adam Hair wrote:
bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid... 8-)
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.
Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:


// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;
FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:

Code: Select all

      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave(thisThread)
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD)
      {
          assert(bestValue < beta);

          thisThread->split<FakeSplit>(pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode);
          if (bestValue >= beta)
              break;
      }
    }

    if (SpNode)
        return bestValue;

What other programs are capable of multiplexing?
I don't peruse other programs so no idea. A couple of versions of Crafty did, when I first started the parallel search code, although I don't think any were ever released. The 1983 version of Cray Blitz did as we had to simulate testing while waiting on Cray to get the SMP libraries and fortran compiler changes completed.

The only sensible reason for multiplexing is debugging when you don't have a decent debugger than can handle threads. gdb does this quite nicely today making multiplexing unnecessary. The multiplexing concept, to shoot down consistent super-linear speedups and such is a simple idea. It is NOT the way a rational person would "fix" the problem. It is mainly used as a simple and easy to understand methodology to debunk the occasional bogus claims that surface here and there.


bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

syzygy wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid... 8-)
Not really. The "simulate N threads by a single thread" argument firstly does not apply to the case, secondly does not take into account the loss of efficiency due to simulation. It's just a plain stupid argument.

The more concrete argument regarding what Komodo is likely doing (extending/reducing differently at split nodes) also does not cut it. The efficiency cost involved in forcing Komodo to behave exactly the same at split nodes as in regular nodes may simply be prohibitive. This is not a bug at all.
Only thing stupid here is you. Here's a simple example of multiplexing, from an actual discussion years ago. Someone pointed out that if your move ordering is bad, you can produce a larger tree than necessary and that a parallel search could consistently produce a super-linear speedup given that case. Which it would. But you could get the SAME speedup (yes, a REAL speedup as multiplexing is NOT anywhere near as complex as you seem to think) by multiplexing so that the two pseudo-threads work in an interleaved parallel manner, and now that move ordering advantage the parallel search enjoyed because the second move was better now disappears completely. Of course the REAL solution is to improve move ordering if that kind of super-linear speedup happens consistently. it will always happen here and there in chess, but it had better not happen consistently or the search has a significant bug that could be fixed.

Doing the same thing at split and non-split nodes is NOT "prohibitive". It has very little computational cost at all as there are not very many split nodes in a parallel search compared to the size of the tree. Do you understand any of this stuff or do you just jump in for the sake of arguing?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

Laskos wrote:
Adam Hair wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

Image

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
Always "I assume" or "I think" or "I believe" or "I heard" or "I saw somewhere" and such.

Notice my formula is right with Komodo through 8. Highly inaccurate. That it diverges a bit at 16 is cause for a major panic? When I have specifically stated that it is really well-tested only through 16? Where I have specifically stated that it is also architecturally dependent since multi-core chips have a bottleneck that single-core multi-chip machines do not. NO formula will predict SMP performance with two decimal place accuracy across all existing platforms using Intel/AMD processors. Nobody in their right mind would expect them to do so.

Those that actually know how to measure speedup would do the following, something you have not done, probably were not aware of needing to do, etc.

1. measure NPS with one thread nothing else running.

2. if you have N cores, run N instances of a single thread search and measure the NPS. If you add the N NPS values together, it might or might not be equal to N times the NPS from 1.

Now you know the hardware scaling limit. Assume a REALLY good implementation of the hardware so that for 2 you really do get N * number 1, which means raw speed scales perfectly.

3. Run a set of positions to fixed depth using 1 and N threads. For the N thread version, run them several times and average. Divide 1 thread total time by N thread total time. You have an actual SMP speedup. The JICCA DTS article gives this data for Cray Blitz for 1, 2, 4, 8 and 16 processors on a good architecture. You can find the numbers, or I can post them later. There was a thread, started by Vincent when he was not happy with the speedup he got on a supercomputer he was going to use for one of the WCCC events, and he was complaining that my numbers were grossly exaggerated. I ran a bunch of positions at the AMD development lab, using 1,2,4 and 8 processors (no multi-core) and gave the data to Fierz. He discovered that my formula was a low estimate and that the actual numbers were a bit better.

I don't waste all of my time trying to measure parallel speedup unless I make significant changes to the parallel search. Could it be better or worse today? quite possible since my search (not the parallel part) has changed a lot with more aggressive LMR, more aggressive null-move pruning, more aggressive forward pruning, singular extensions, and on and on. And one day I will take the time to test again and see if the formula needs an update. Until then, it is an ESTIMATE.

However, I consider your plot to be bogus, because you are mixing apples and oranges. My formula is purely SMP speedup. You are using Elo, an estimate of elo per doubling, to compute an estimate of an estimate for what Komodo's parallel speedup would be. There is a real, scientific, accurate, no-guesswork method to compute parallel speedup. You are not using it.

Again, I don't see the point in (a) your broken calculations or (b) your concern with an estimate that actually matches your broken data almost perfectly through 8 processors and diverges later. Does the speedup grow linearly? No, and I have never claimed it did. The estimate is a linear function because one can compute that in their head to get a quick estimate. Want an accurate number. Spend the time and run the tests. That is how _I_ measure and report speedup. ALWAYS.
On 1) and 3) there is Andreas'
http://www.talkchess.com/forum/viewtopic.php?p=570955

As for my data is bogus because I am using Elos, it's completely non-bogus if you will see that doubling in time at that TC means 90-100 Elo pints, with 80 it will go superlinear. And is consistent with other empirical data. And it doesn't matter too much, Crafty anyway will be on top by far on 16 threads. The question is: is SMP implementation in Crafty so much superior?


Do you believe doubling in time is 90-100 is true for ALL programs? In fact, it seems to be quite a bit higher than the number most have been using for years, namely 50-70.
At the very least it is approximately true for these engines at the OP's time control and computer.
Agreeing for at least 90 points per doubling at this TC on this hardware (Dual AMD Opteron 6376 2x16 Cores) for Komodo 8, it's "easy" to check for Andreas whether Bob's linear formula, which gives ~85 Elo points gain from 16 to 32 threads or my log formula, which gives ~38 Elo points gain, is correct.

Bob, let's have a bet, whoever in his prediction is closer, wins it.

Andreas, could you check in the same conditions 32 threads vs. 1 thread for Komodo 8? I know it's not easy to run 3,000 games with an engine on 32 threads, but Bob made here some claims.

Bob stated that on 64 core Itanium box, he got with Crafty an effective speed-up of 32, without even tuning the engine. That is a VERY high number for effective speed-up, and I begin to wonder about exceptionality of Crafty.

I hope Andreas sees this post.
You are so far behind in parallel search topics I don't know where to start. 50% speedup is not "very high". You might want to check some of Cozzie's numbers on large numbers of processors and then report back. I've never considered 50% as acceptable. You can find my 1994-era 16 cpu results with Cray Blitz. And some 32 cpu results on the T932 we used in 1994 as well. And with 32 cpus we were NOT searching less than 16x faster.

Here's the 1990ish DTS numbers from the C90-16 we used around that time frame:

Code: Select all

          +-------------+-----+-----+-----+-----+------+
          |# processors |  1  |  2  |  4  |  8  |  16  |
          +-------------+-----+-----+-----+-----+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 |
          +-------------+-----+-----+-----+-----+------+
                 Table 3 DTS performance results
Those are Cray Blitz numbers that I happen to have handy (the DTS paper on my web page in fact). I can only report what I have actually measured. But apparently there are a LOT of parallel results floating around from quite a few programs, results which you have no idea about.

Certainly the graph I questioned was wrong, because it shows a super-linear property that is impossible. IE NOBODY is going to get a bigger gain going from 8 to 16 than when going from 4 to 8. That's the poster-child for "absurd" and is exactly the kind of nonsense that needs debunking when it happens. And before you harp on my linear approximation again, I will continue to remind you that (a) it was always claimed to be a simple approximation that is pretty close and (b) was NEVER claimed to extrapolate beyond 32 processors. There's not a parallel search person on the planet that has ever claimed that there was no diminishing return with alpha/beta. I wrote a paper to explain exactly why there is such a problem, in fact. All caused by the alpha/beta algorithm itself. Nothing about the tree grows linearly, most especially the search overhead. It is all exponential, and always has.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by michiguel »

bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid... 8-)
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.
Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:


// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;
FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:

Code: Select all

      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave(thisThread)
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD)
      {
          assert(bestValue < beta);

          thisThread->split<FakeSplit>(pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode);
          if (bestValue >= beta)
              break;
      }
    }

    if (SpNode)
        return bestValue;

What other programs are capable of multiplexing?
I don't peruse other programs so no idea. A couple of versions of Crafty did, when I first started the parallel search code, although I don't think any were ever released. The 1983 version of Cray Blitz did as we had to simulate testing while waiting on Cray to get the SMP libraries and fortran compiler changes completed.

The only sensible reason for multiplexing is debugging when you don't have a decent debugger than can handle threads. gdb does this quite nicely today making multiplexing unnecessary. The multiplexing concept, to shoot down consistent super-linear speedups and such is a simple idea. It is NOT the way a rational person would "fix" the problem. It is mainly used as a simple and easy to understand methodology to debunk the occasional bogus claims that surface here and there.
So, the only sensible reason for it is debugging, not production code. Thank you.

Miguel

bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.
Xann
Posts: 132
Joined: Sat Jan 22, 2011 7:14 pm
Location: Lille, France
Full name: Fabien Letouzey

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Xann »

bob wrote:Only thing stupid here is you. ...
Just a quick reminder that personal attacks are not tolerated on this forum.

Fabien.
syzygy
Posts: 5728
Joined: Tue Feb 28, 2012 11:56 pm

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by syzygy »

bob wrote:if that kind of super-linear speedup happens consistently.
For the Nth time, nobody is claiming a super-linear speedup. Your whole argument is a non-starter to begin with.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Laskos »

bob wrote:
Laskos wrote: Agreeing for at least 90 points per doubling at this TC on this hardware (Dual AMD Opteron 6376 2x16 Cores) for Komodo 8, it's "easy" to check for Andreas whether Bob's linear formula, which gives ~85 Elo points gain from 16 to 32 threads or my log formula, which gives ~38 Elo points gain, is correct.

Bob, let's have a bet, whoever in his prediction is closer, wins it.

Andreas, could you check in the same conditions 32 threads vs. 1 thread for Komodo 8? I know it's not easy to run 3,000 games with an engine on 32 threads, but Bob made here some claims.

Bob stated that on 64 core Itanium box, he got with Crafty an effective speed-up of 32, without even tuning the engine. That is a VERY high number for effective speed-up, and I begin to wonder about exceptionality of Crafty.

I hope Andreas sees this post.
You are so far behind in parallel search topics I don't know where to start. 50% speedup is not "very high". You might want to check some of Cozzie's numbers on large numbers of processors and then report back. I've never considered 50% as acceptable. You can find my 1994-era 16 cpu results with Cray Blitz. And some 32 cpu results on the T932 we used in 1994 as well. And with 32 cpus we were NOT searching less than 16x faster.

Here's the 1990ish DTS numbers from the C90-16 we used around that time frame:

Code: Select all

          +-------------+-----+-----+-----+-----+------+
          |# processors |  1  |  2  |  4  |  8  |  16  |
          +-------------+-----+-----+-----+-----+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 |
          +-------------+-----+-----+-----+-----+------+
                 Table 3 DTS performance results
Those are Cray Blitz numbers that I happen to have handy (the DTS paper on my web page in fact). I can only report what I have actually measured. But apparently there are a LOT of parallel results floating around from quite a few programs, results which you have no idea about.

Certainly the graph I questioned was wrong, because it shows a super-linear property that is impossible. IE NOBODY is going to get a bigger gain going from 8 to 16 than when going from 4 to 8. That's the poster-child for "absurd" and is exactly the kind of nonsense that needs debunking when it happens. And before you harp on my linear approximation again, I will continue to remind you that (a) it was always claimed to be a simple approximation that is pretty close and (b) was NEVER claimed to extrapolate beyond 32 processors. There's not a parallel search person on the planet that has ever claimed that there was no diminishing return with alpha/beta. I wrote a paper to explain exactly why there is such a problem, in fact. All caused by the alpha/beta algorithm itself. Nothing about the tree grows linearly, most especially the search overhead. It is all exponential, and always has.
I was not talking of DTS and that sort of architecture machines. Just Komodo gain from 16 to 32 cores on Andreas machine, at that time control. Also, are you sure you are not mangling NPS speed-up with effective speed-up? Huge NPS speed-up I saw all around, huge effective speed-up never. You claim ~1.97 effective speed-up from 16 to 32 cores, which is about 85-90 Elo gain, l am claiming ~1.34 effective speed-up, which is about 38-40 Elo gain, on Andreas machine at that time control. If only Andreas would bother to test Komodo 32 threads versus Komodo 1 thread in 3,000 games. Or directly Komodo 32 threads versus Komodo 16 threads, to minimize Elo error.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Adam Hair »

bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid... 8-)
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.
Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:


// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;
FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:

Code: Select all

      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave(thisThread)
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD)
      {
          assert(bestValue < beta);

          thisThread->split<FakeSplit>(pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode);
          if (bestValue >= beta)
              break;
      }
    }

    if (SpNode)
        return bestValue;

What other programs are capable of multiplexing?
I don't peruse other programs so no idea. A couple of versions of Crafty did, when I first started the parallel search code, although I don't think any were ever released. The 1983 version of Cray Blitz did as we had to simulate testing while waiting on Cray to get the SMP libraries and fortran compiler changes completed.
I asked the question because you stated that "more than one already has this capability".
The only sensible reason for multiplexing is debugging when you don't have a decent debugger than can handle threads. gdb does this quite nicely today making multiplexing unnecessary. The multiplexing concept, to shoot down consistent super-linear speedups and such is a simple idea. It is NOT the way a rational person would "fix" the problem. It is mainly used as a simple and easy to understand methodology to debunk the occasional bogus claims that surface here and there.


bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.
x3
Posts: 13
Joined: Mon Sep 08, 2014 6:54 pm

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by x3 »

SF5 8-16 cores +37 elo 3200 games
K8 8-16 cores +45 elo 1600 games
SF5 16-24 cores +7 elo 1600 games
K8 16-24 cores +18 elo 1480 gsames

3m+1s
8 move book repeating

x3