SMP at fixed depth revisited

Laskos · Post by **Laskos** » Fri Apr 11, 2014 10:11 pm

Using Cutechess-cli at SPRT stop with alpha, beta=0.05, fixed depth 4 threads versus 1 thread I let self-play several known engines. There are clearly two categories of engines: not widening, for which time-to-depth (TTD) is a good measure of SMP efficiency, and widening, where SMP efficiency is harder to measure. H0 here is always Elo difference equal to 0, H1 is some specified Elo difference.

Not widening, TTD is the criterion:

SF 11042014:
H0=0
H1=10
Score of SF 4 threads vs SF 1 thread: 437 - 442 - 1455 [0.499] 2334
ELO difference: -1
SPRT: H0 was accepted
Finished match

Houdini 4
H0=0
H1=10
Score of H4 4 threads vs H4 1 thread: 533 - 549 - 2310 [0.498] 3392
ELO difference: -2
SPRT: H0 was accepted
Finished match

Crafty 23.8
H0=0
H1=10
Score of Crafty 4 threads vs Crafty 1 thread: 1004 - 1013 - 1484 [0.499] 3501
ELO difference: -1
SPRT: H0 was accepted
Finished match

Critter 1.6
H0=0
H1=10
Score of Critter 4 threads vs Critter 1 thread: 614 - 633 - 1164 [0.496] 2411
ELO difference: -3
SPRT: H0 was accepted
Finished match

Gull 2.8
H0=0
H1=10
Score of Gull 4 threads vs Gull 1 thread: 229 - 262 - 565 [0.484] 1056
ELO difference: -11
SPRT: H0 was accepted
Finished match

Widening, TTD is wrong to use:

Komodo TCEC
H0=0
H1=30
Score of Komodo 4 threads vs Komodo 1 thread: 83 - 54 - 134 [0.554] 271
ELO difference: 37
SPRT: H1 was accepted
Finished match

Rybka 4
H0=0
H1=20
Score of Rybka 4 threads vs Rybka 1 thread: 225 - 176 - 417 [0.530] 818
ELO difference: 21
SPRT: H1 was accepted
Finished match

Hannibal 1.4
H0=0
H1=30
Score of Hannibal 4 threads vs Hannibal 1 thread: 138 - 100 - 171 [0.546] 409
ELO difference: 32
SPRT: H1 was accepted
Finished match

Junior 13
H=0
H1=20
Score of Junior 4 threads vs Junior 1 thread: 74 - 40 - 94 [0.582] 208
ELO difference: 57
SPRT: H1 was accepted
Finished match

syzygy · Post by **syzygy** » Fri Apr 11, 2014 11:59 pm

Interesting. I was actually wondering some days ago if more recent versions of Komodo (than the one you tested earlier) still exhibited this behaviour. One theory was that widening is mainly the result of "limitations" in the smp implementation. (E.g., it might be easier to skip some reductions at split nodes.) But if later versions still do it...

Laskos · Post by **Laskos** » Sat Apr 12, 2014 11:21 am

syzygy wrote:Interesting. I was actually wondering some days ago if more recent versions of Komodo (than the one you tested earlier) still exhibited this behaviour. One theory was that widening is mainly the result of "limitations" in the smp implementation. (E.g., it might be easier to skip some reductions at split nodes.) But if later versions still do it...

Yes, Komodo's fattening is pretty much the same. I was surprised to see Junior fattening, because IIRC it was scaling from 1 to 4 cores very well, among the best on CCRL list. So, a SMP implementation with widening can be competitive.

Another thing to note is that SMP efficiency, or effective speed-up, is not a constant, it depends on time or depth, being an increasing function of them. So, assigning a number, say 3.1 on 4 cores, for effective speed-up is wrong. It can be 2.8 on depth 16 and 3.3 on depth 25. This was observed by Miguel IIRC.

bob · Post by **bob** » Sat Apr 12, 2014 4:19 pm

Laskos wrote:
syzygy wrote:Interesting. I was actually wondering some days ago if more recent versions of Komodo (than the one you tested earlier) still exhibited this behaviour. One theory was that widening is mainly the result of "limitations" in the smp implementation. (E.g., it might be easier to skip some reductions at split nodes.) But if later versions still do it...
Yes, Komodo's fattening is pretty much the same. I was surprised to see Junior fattening, because IIRC it was scaling from 1 to 4 cores very well, among the best on CCRL list. So, a SMP implementation with widening can be competitive.

Another thing to note is that SMP efficiency, or effective speed-up, is not a constant, it depends on time or depth, being an increasing function of them. So, assigning a number, say 3.1 on 4 cores, for effective speed-up is wrong. It can be 2.8 on depth 16 and 3.3 on depth 25. This was observed by Miguel IIRC.

There are a dozen things about this topic I don't like. And the test can give some badly skewed results.

one simple example from my Ph.D. dissertation is to order moves worst-first. Now the search to fixed depth will take much longer, but due to hash grafting, you actually get slightly more accurate scores. But clearly it is weaker. So if a parallel search to fixed depth has somewhat worse move ordering, it will play stronger if searching to a fixed depth.

The reason for this is well-known from several of us studying fine #70 way back. That position requires 26 non-captures before a pawn is won. A 26 ply search. Yet most solve it at depths less than 26. How?

Suppose you search best moves first for white, but worst moves first for black. Black puts up little resistance and you win a pawn much quicker. And you save everything in the TT. Now you search better moves for black, and you find you can't directly win a pawn. But you do discover that you can force a position where the ttable says you won a pawn when move ordering was worse, and you can forcibly reach that position. And you know you can win a pawn without the full 26 ply non-capture search, because some of those 26 plies come from the ttable.

That is just one flaw.

A program might prune poorly at a split point due to laziness of the programmer, as it becomes more complicated when different threads are searching different moves, particularly when you use any sort of "late move" criterion. So you search a larger tree, and perhaps, by pure luck, now don't prune a critical move. You search a larger tree, but time doesn't matter in fixed depth, so the larger tree is better. Simplest idea is to just turn off LMR. Since time is not a factor, the non-LMR search will beat the regular search consistently. Because LMR is a trade-off, you lose some accuracy, but you gain some depth. If you throw out the gain (depth) with a fixed-depth search, then avoiding the lost accuracy looks better.

My intuition says that if a program plays better by going wider (with a parallel search) rather than by going deeper, the SAME should apply to the non-parallel search as well. Perhaps it prunes or reduces excessively and that should be tuned. I'm not a believer in this "wider" stuff for that reason. Unless you just don't believe you can do a decent parallel search and produce a reasonable speedup, I don't see why going wider would be better.

Just because some do so does NOT make them correct or intelligent. Widening the non-parallel search ought to produce the same improvement. This is related to the super-linear parallel speedup issue, it happens, but only by accident.

michiguel · Post by **michiguel** » Sat Apr 12, 2014 4:33 pm

bob wrote:
Laskos wrote:
syzygy wrote:Interesting. I was actually wondering some days ago if more recent versions of Komodo (than the one you tested earlier) still exhibited this behaviour. One theory was that widening is mainly the result of "limitations" in the smp implementation. (E.g., it might be easier to skip some reductions at split nodes.) But if later versions still do it...
Yes, Komodo's fattening is pretty much the same. I was surprised to see Junior fattening, because IIRC it was scaling from 1 to 4 cores very well, among the best on CCRL list. So, a SMP implementation with widening can be competitive.

Another thing to note is that SMP efficiency, or effective speed-up, is not a constant, it depends on time or depth, being an increasing function of them. So, assigning a number, say 3.1 on 4 cores, for effective speed-up is wrong. It can be 2.8 on depth 16 and 3.3 on depth 25. This was observed by Miguel IIRC.
There are a dozen things about this topic I don't like. And the test can give some badly skewed results.

one simple example from my Ph.D. dissertation is to order moves worst-first. Now the search to fixed depth will take much longer, but due to hash grafting, you actually get slightly more accurate scores. But clearly it is weaker. So if a parallel search to fixed depth has somewhat worse move ordering, it will play stronger if searching to a fixed depth.

The reason for this is well-known from several of us studying fine #70 way back. That position requires 26 non-captures before a pawn is won. A 26 ply search. Yet most solve it at depths less than 26. How?

Suppose you search best moves first for white, but worst moves first for black. Black puts up little resistance and you win a pawn much quicker. And you save everything in the TT. Now you search better moves for black, and you find you can't directly win a pawn. But you do discover that you can force a position where the ttable says you won a pawn when move ordering was worse, and you can forcibly reach that position. And you know you can win a pawn without the full 26 ply non-capture search, because some of those 26 plies come from the ttable.

That is just one flaw.

A program might prune poorly at a split point due to laziness of the programmer, as it becomes more complicated when different threads are searching different moves, particularly when you use any sort of "late move" criterion. So you search a larger tree, and perhaps, by pure luck, now don't prune a critical move. You search a larger tree, but time doesn't matter in fixed depth, so the larger tree is better. Simplest idea is to just turn off LMR. Since time is not a factor, the non-LMR search will beat the regular search consistently. Because LMR is a trade-off, you lose some accuracy, but you gain some depth. If you throw out the gain (depth) with a fixed-depth search, then avoiding the lost accuracy looks better.

My intuition says that if a program plays better by going wider (with a parallel search) rather than by going deeper, the SAME should apply to the non-parallel search as well. Perhaps it prunes or reduces excessively and that should be tuned. I'm not a believer in this "wider" stuff for that reason. Unless you just don't believe you can do a decent parallel search and produce a reasonable speedup, I don't see why going wider would be better.

Just because some do so does NOT make them correct or intelligent. Widening the non-parallel search ought to produce the same improvement. This is related to the super-linear parallel speedup issue, it happens, but only by accident.

Kai did not say it is better, it said it happens. They seem at least comparable, though.

Many times in a search, going deeper or wider could be the same in terms of ELO (for a single core). For instance, when you change a pruning parameter and gives you no improvement or decrease in ELO. One is wider and the other is deeper. But, one may parallelize or scale better.

Miguel

bob · Post by **bob** » Sat Apr 12, 2014 8:00 pm

michiguel wrote:
bob wrote:
Laskos wrote:
syzygy wrote:Interesting. I was actually wondering some days ago if more recent versions of Komodo (than the one you tested earlier) still exhibited this behaviour. One theory was that widening is mainly the result of "limitations" in the smp implementation. (E.g., it might be easier to skip some reductions at split nodes.) But if later versions still do it...
Yes, Komodo's fattening is pretty much the same. I was surprised to see Junior fattening, because IIRC it was scaling from 1 to 4 cores very well, among the best on CCRL list. So, a SMP implementation with widening can be competitive.

Another thing to note is that SMP efficiency, or effective speed-up, is not a constant, it depends on time or depth, being an increasing function of them. So, assigning a number, say 3.1 on 4 cores, for effective speed-up is wrong. It can be 2.8 on depth 16 and 3.3 on depth 25. This was observed by Miguel IIRC.
There are a dozen things about this topic I don't like. And the test can give some badly skewed results.

one simple example from my Ph.D. dissertation is to order moves worst-first. Now the search to fixed depth will take much longer, but due to hash grafting, you actually get slightly more accurate scores. But clearly it is weaker. So if a parallel search to fixed depth has somewhat worse move ordering, it will play stronger if searching to a fixed depth.

The reason for this is well-known from several of us studying fine #70 way back. That position requires 26 non-captures before a pawn is won. A 26 ply search. Yet most solve it at depths less than 26. How?

Suppose you search best moves first for white, but worst moves first for black. Black puts up little resistance and you win a pawn much quicker. And you save everything in the TT. Now you search better moves for black, and you find you can't directly win a pawn. But you do discover that you can force a position where the ttable says you won a pawn when move ordering was worse, and you can forcibly reach that position. And you know you can win a pawn without the full 26 ply non-capture search, because some of those 26 plies come from the ttable.

That is just one flaw.

A program might prune poorly at a split point due to laziness of the programmer, as it becomes more complicated when different threads are searching different moves, particularly when you use any sort of "late move" criterion. So you search a larger tree, and perhaps, by pure luck, now don't prune a critical move. You search a larger tree, but time doesn't matter in fixed depth, so the larger tree is better. Simplest idea is to just turn off LMR. Since time is not a factor, the non-LMR search will beat the regular search consistently. Because LMR is a trade-off, you lose some accuracy, but you gain some depth. If you throw out the gain (depth) with a fixed-depth search, then avoiding the lost accuracy looks better.

My intuition says that if a program plays better by going wider (with a parallel search) rather than by going deeper, the SAME should apply to the non-parallel search as well. Perhaps it prunes or reduces excessively and that should be tuned. I'm not a believer in this "wider" stuff for that reason. Unless you just don't believe you can do a decent parallel search and produce a reasonable speedup, I don't see why going wider would be better.

Just because some do so does NOT make them correct or intelligent. Widening the non-parallel search ought to produce the same improvement. This is related to the super-linear parallel speedup issue, it happens, but only by accident.
Kai did not say it is better, it said it happens. They seem at least comparable, though.

Many times in a search, going deeper or wider could be the same in terms of ELO (for a single core). For instance, when you change a pruning parameter and gives you no improvement or decrease in ELO. One is wider and the other is deeper. But, one may parallelize or scale better.

Miguel

My point was simply that "wider = better" is almost guaranteed to be accidental. Why would one scale better than the other assuming a reasonable parallel search implementation? I don't see significant difference between Crafty's speedup today, and Crafty's speedup 20 years ago. Yet todays trees are MUCH narrower and deeper.

The only time I would buy this argument is when someone actually runs up against some sort of impenetrable barrier in terms of SMP speedup. There the extra cpus COULD reduce the risk by going wider. But with 4-8-12-16 cores? that doesn't seem likely to me at all.

There are other interpretations of such data. For example, perhaps the SMP version is actually performing WORSE than the author intends, but the extra width accidentally helps enough to offset part of that poor performance. The basic parallel search is all about going deeper. YBW addresses that and nothing but that.

My simple claim would be this. For N, where N is small (16 being small here), processors, a program that shows this kind of performance most likely can be better-tuned in the one-processor form to obtain a similar Elo gain. And once that is done, the parallel "widening" will be a loss, not a gain, as we would expect.

I would also like to see far more games played. Even 1,00 has a very large variance. If you run a batch of such tests, I would expect some to look better just due to the randomness parallel search introduces.

In any case, I can't imagine someone actually working on this approach actively, until they reach that point where more processors gains absolutely nothing more in speedup. I've been doing some testing (not online) on a 24 core box. 24 certainly does not reach that threshold for my program, at least.

Laskos · Post by **Laskos** » Sat Apr 12, 2014 11:03 pm

bob wrote:
My point was simply that "wider = better" is almost guaranteed to be accidental.

That's almost sure, and I was surprised getting widening with Junior, given its apparently good scaling from 1 to 4 cores in the rating lists.

Take LMR. Say they wreck the rating at fixed depth by 200 points, but give a rating boost at fixed _time_ of 50 points. In this way one can vary widely the fixed _depth_ rating without disturbing much fixed time rating. That is, "widening" might be a "limitation" of SMP implementation, as Ronald said, which doesn't harm much, maybe some 5 points at fixed time for a widening of 30 points at fixed depth. And this "limited" implementation can still be competitive.

Why would one scale better than the other assuming a reasonable parallel search implementation? I don't see significant difference between Crafty's speedup today, and Crafty's speedup 20 years ago. Yet todays trees are MUCH narrower and deeper.

The only time I would buy this argument is when someone actually runs up against some sort of impenetrable barrier in terms of SMP speedup. There the extra cpus COULD reduce the risk by going wider. But with 4-8-12-16 cores? that doesn't seem likely to me at all.

There are other interpretations of such data. For example, perhaps the SMP version is actually performing WORSE than the author intends, but the extra width accidentally helps enough to offset part of that poor performance. The basic parallel search is all about going deeper. YBW addresses that and nothing but that.

My simple claim would be this. For N, where N is small (16 being small here), processors, a program that shows this kind of performance most likely can be better-tuned in the one-processor form to obtain a similar Elo gain. And once that is done, the parallel "widening" will be a loss, not a gain, as we would expect.

I would also like to see far more games played. Even 1,00 has a very large variance. If you run a batch of such tests, I would expect some to look better just due to the randomness parallel search introduces.

In any case, I can't imagine someone actually working on this approach actively, until they reach that point where more processors gains absolutely nothing more in speedup. I've been doing some testing (not online) on a 24 core box. 24 certainly does not reach that threshold for my program, at least.

syzygy · Post by **syzygy** » Sat Apr 12, 2014 11:12 pm

Laskos wrote:That is, "widening", might be a "limitation" of SMP implementation as Ronald said, which doesn't harm much, maybe some 5 points at fixed time for a widening of 30 points at fixed depth. And this "limited" implementation can still be competitive.

It may be a limitation, it may be intentional, I won't claim to know why certain engines do what they do. After all, I don't even know what they do. But clearly Komodo is competitive and might even improve more when going to 16 cores than other top engines (impossible to be sure, but there is certainly nothing pointing to the contrary).

Laskos · Post by **Laskos** » Tue Apr 15, 2014 2:26 pm

On a side note, related to SMP, I tested on 4,800 neutral positions (1,200 repeated 4 times) the usefulness of hyperthreading for one recent Stockfish, assuming that time-to-depth (TTD) is the way to measure the SMP efficiency of SF (which seems a right assumption in the light of the OP results):

4 cores HT on (8 logical cores), 2GB hash.
It seems HT helps 8% on TTD going from 4 to 8 threads on 4 physical cores, 7% going to 6 threads. The standard deviation is about 0.3 minutes (or 0.6%). It is possible to test HT benefits on different numbers of threads only on literally thousands of positions, the variance otherwise is too high.

bob · Post by **bob** » Tue Apr 15, 2014 7:36 pm

First thing that jumps out at me is the inconsistency. Worse with 7 than with 6? Tells me more positions or more time per position is needed. For all of my SMP results, I generally use at LEAST one minute per position, and that is always with max cores, meaning fewer cores are going to run a LOT longer.

I have only seen one machine where HT helps crafty, my mac book, because the process scheduler has no idea that when I run two threads, they ought to run on two different physical cores. If I watch to see what is happening, I see 1-2 busy, 1-3 busy, 1-4 busy, 2-3 busy, 2-4 busy, or 3-4 busy. However half of those are using one physical core. With 4 threads, I at least get to use both cores fully.

Not exactly the best justification for using hyper threading...

SMP at fixed depth revisited

SMP at fixed depth revisited

Re: SMP at fixed depth revisited

Re: SMP at fixed depth revisited

Re: SMP at fixed depth revisited

Re: SMP at fixed depth revisited

Re: SMP at fixed depth revisited

Re: SMP at fixed depth revisited

Re: SMP at fixed depth revisited

Re: SMP at fixed depth revisited

Re: SMP at fixed depth revisited