Threads test incl. Stockfish 5 and Komodo 8

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Laskos »

bob wrote:
Where do "I claim 1.9x speedup from 16 cores to 32 cores?" Never have, never will. Unless you talk about my simple approximation, which has NEVER been a claim of anything, just a simple estimate. You misuse the word "claim" badly here. I claim that is just a rough estimate. and I've made the claim much weaker for 16-32 and beyond as well.
So, you say that formula doesn't apply anywhere near 16 or 32 threads. Couldn't you just remember 3 numbers for doublings? "Approximation" a la Hyatt.

Now guy, you said that you got x32 effective speedup (time to strength) on 64 core Itanium box. Then smirked when I said it IS very high effective speed-up, with you citing some papers from 80es for DTS on completely different architecture. You know what x32 _effective_speed_up means in terms of _average_ effective speed-up for doubling time to 64 threads?

1.78, per each doubling, 6 doublings.

That is VERY HIGH effective speed-up for 6 doublings. Either hardware is very different or Crafty is exceptionally, almost linearly scaling engine. What V. Rajlich reported years ago for effective speed-up of Rybka was a series of doublings to 16 cores looking like that:
1.8*1.7*1.6*1.5
The total is 7.3, already below half of 16 cores used. That's why I consider x32 for Crafty on 64 core box very high.

Would you stand for a bet:

If Andreas bothers to run 32 cores test on his Dual Opteron, I claim Komodo will gain 1.34 effective speed-up, or 38 Elo points, with log model, your claim is that it will gain 1.78 in effective speed-up from 16 to 32 threads (from your Itanium 64 core tests with claimed x32 effective speed-up), or 75 Elo points on that hardware at Andreas' 60''+0.05'' time control. Whoever is closer in numbers wins the bet, whoever loses has to post here "I AM AN IDIOT".

I may open a plea thread to ask Andreas to perform such a test, either Komodo 32 threads against Komodo 1 thread, or even better, Komodo 32 threads against Komodo 16 threads.

Agreed on such a bet?
syzygy
Posts: 5722
Joined: Tue Feb 28, 2012 11:56 pm

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by syzygy »

Laskos wrote:Then smirked when I said it IS very high effective speed-up, with you citing some papers from 80es for DTS on completely different architecture.
Those numbers were based on 24 positions searched up to 6 ply or so, so have to be taken cum grano salis. (Not to mention that the node counts listed are not real but reconstructed.)
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Laskos »

syzygy wrote:
Laskos wrote:Then smirked when I said it IS very high effective speed-up, with you citing some papers from 80es for DTS on completely different architecture.
Those numbers were based on 24 positions searched up to 6 ply or so, so have to be taken cum grano salis. (Not to mention that the node counts listed are not real but reconstructed.)
Ok.
I have a feeling Bob will not accept the bet, and will divagate on superlinear and papers on DTS from 80es for Cray.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Adam Hair »

Since Andreas has shared his results for Crafty, I will add Crafty's plot to my graph. Again, this is a plot of the Wilo (as opposed to Elo) differences from 1 core for each engine as the number of cores increase. This allows us to compare each engine's SMP performance on roughly the same scale.

Image
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by Laskos »

Adam Hair wrote:Since Andreas has shared his results for Crafty, I will add Crafty's plot to my graph. Again, this is a plot of the Wilo (as opposed to Elo) differences from 1 core for each engine as the number of cores increase. This allows us to compare each engine's SMP performance on roughly the same scale.

Image
Wilos will revolutionize chess engine ratings. They get rid of strength differences, the time control and hardware.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid... 8-)
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.
Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:


// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;
FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:

Code: Select all

      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave(thisThread)
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD)
      {
          assert(bestValue < beta);

          thisThread->split<FakeSplit>(pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode);
          if (bestValue >= beta)
              break;
      }
    }

    if (SpNode)
        return bestValue;

What other programs are capable of multiplexing?
I don't peruse other programs so no idea. A couple of versions of Crafty did, when I first started the parallel search code, although I don't think any were ever released. The 1983 version of Cray Blitz did as we had to simulate testing while waiting on Cray to get the SMP libraries and fortran compiler changes completed.

The only sensible reason for multiplexing is debugging when you don't have a decent debugger than can handle threads. gdb does this quite nicely today making multiplexing unnecessary. The multiplexing concept, to shoot down consistent super-linear speedups and such is a simple idea. It is NOT the way a rational person would "fix" the problem. It is mainly used as a simple and easy to understand methodology to debunk the occasional bogus claims that surface here and there.
So, the only sensible reason for it is debugging, not production code. Thank you.

Miguel

bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.
Yes. If it improves your program, you have something to work on. It is simply a way of explaining why some things simply do not happen as claimed. Super-linear parallel speedups have a specific cause - poor move ordering. That can be solved by multiplexing, but there are much better solutions (actually improving the order rather than searching the moves out of order to solve the issue).
right, the only sensible reason for it is debugging or explaining things, not production code.

If a "wider" tree helps a parallel searcher, that same wider tree would also help the non-parallel searcher. multiplexing would show the same gain, but it is not the best solution. It just illustrates the problem. Which in this case would be excessive pruning that could be toned down.
Not if the widening takes advantage of parallel inefficiencies of the "non-widening" approach. For instance, you search extra nodes instead of remaining idle in certain threads.

Miguel
PS: You keep mentioning super linear speed up like it has anything to do with this.
Super-linear speedup is in the same family of nonsense as widening.

If you search extra nodes, the tree grows. It would be BETTER to simply avoid searching those and drive the search deeper. If searching those extra nodes helps performance, that suggests a problem that can be fixed.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

lkaufman wrote:It seems to me that you are saying that the optimum LMR reduction schedule for a single core engine should also be the optimum for a 16 core (for example) SMP version. Do you believe this to be true, and if so can you prove it? If the 16 core version got a 16 to one speedup that would seem to follow, but if it only gets a ten to one speedup (for example), I don't see why it would have to be true. Note that I am not saying that this is what Komodo does.
What if you just get a single-cpu that is 16 x faster. Would you continue with the current search approach, or would you back off on the reduction stuff and search a tree that is less deep and more bushy? I can't imagine anyone saying "yes."

Given that, why would you do the SAME thing with a parallel search? The proof has been discussed at length. If you believe that searching EXTRA nodes helps, then one can do that in a sequential search just as easily and it should help there as well. The parallel search does not offer ANYTHING new except for a faster search. Now if you can't write a parallel search that runs faster on 16 cores, then yes, I suppose one might give up and just write a search that looks at a 16x larger tree and hope for fewer reduction/pruning errors. But I would expect most programs to be well-tuned regarding reductions and pruning, leaving going deeper as the most logical path for the parallel search.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

mvk wrote:
Adam Hair wrote:What other programs are capable of multiplexing?
The simplest way to multiplex on a single cpu is to launch more than one thread or process. The OS then does the multiplexing for you by means of preemptive task-switching.
Works perfectly fine, but not with debugging in mind, where you don't have a good thread-aware debugger, as I mentioned. I've not done any multiplexing since the first 2-3 months of parallel search in Crafty, because by that time, parallel machines were pretty common and gdb worked reasonably well (and even better today).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by bob »

syzygy wrote:
mvk wrote:
Adam Hair wrote:What other programs are capable of multiplexing?
The simplest way to multiplex on a single cpu is to launch more than one thread or process. The OS then does the multiplexing for you by means of preemptive task-switching.
True, this would be an easy way to do a very coarse-grained multiplexing. I was thinking of fine-grained multiplexing to simulate what is really happening in a multithreaded search. Note that Bob has in the past claimed that one could multiplex between two trees one node at a time resulting in the identical tree the parallel search would grow.

The SF4 FakeSplit debugging option is not an example of multiplexing. It merely uses SF's smp data structures to split a node, then anyway searches all moves serially and in sequence. I guess it may be somewhat useful for debugging on a system that lacks a debugger capable of handling multiple threads.

Btw, note how Bob now keeps clinging to "super-linear speedup = bug" when nobody has made a claim of super-linear speedup for Komodo or any other "widening" smp implementation.
I'm not "clinging" to anything. super-linear speedup and effective "widening" are just two mythical ideas that point out a flaw in the basic search algorithm, not really related to the parallel search at all.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Threads test incl. Stockfish 5 and Komodo 8

Post by michiguel »

bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid... 8-)
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.
Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:


// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;
FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:

Code: Select all

      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave(thisThread)
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD)
      {
          assert(bestValue < beta);

          thisThread->split<FakeSplit>(pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode);
          if (bestValue >= beta)
              break;
      }
    }

    if (SpNode)
        return bestValue;

What other programs are capable of multiplexing?
I don't peruse other programs so no idea. A couple of versions of Crafty did, when I first started the parallel search code, although I don't think any were ever released. The 1983 version of Cray Blitz did as we had to simulate testing while waiting on Cray to get the SMP libraries and fortran compiler changes completed.

The only sensible reason for multiplexing is debugging when you don't have a decent debugger than can handle threads. gdb does this quite nicely today making multiplexing unnecessary. The multiplexing concept, to shoot down consistent super-linear speedups and such is a simple idea. It is NOT the way a rational person would "fix" the problem. It is mainly used as a simple and easy to understand methodology to debunk the occasional bogus claims that surface here and there.
So, the only sensible reason for it is debugging, not production code. Thank you.

Miguel

bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.
Yes. If it improves your program, you have something to work on. It is simply a way of explaining why some things simply do not happen as claimed. Super-linear parallel speedups have a specific cause - poor move ordering. That can be solved by multiplexing, but there are much better solutions (actually improving the order rather than searching the moves out of order to solve the issue).
right, the only sensible reason for it is debugging or explaining things, not production code.

If a "wider" tree helps a parallel searcher, that same wider tree would also help the non-parallel searcher. multiplexing would show the same gain, but it is not the best solution. It just illustrates the problem. Which in this case would be excessive pruning that could be toned down.
Not if the widening takes advantage of parallel inefficiencies of the "non-widening" approach. For instance, you search extra nodes instead of remaining idle in certain threads.

Miguel
PS: You keep mentioning super linear speed up like it has anything to do with this.
Super-linear speedup is in the same family of nonsense as widening.
Not at all.

If you search extra nodes, the tree grows. It would be BETTER to simply avoid searching those and drive the search deeper.
It would be better, if you can.

If searching those extra nodes helps performance, that suggests a problem that can be fixed.
Not if you got those nodes for free. They may be "free" because they were searched while some threads were idle. In other words, if you get those extra nodes by increasing nps.

Miguel