Measure of SMP scalability

michiguel · Post by **michiguel** » Wed Jul 03, 2013 5:33 pm

AlvaroBegue wrote:
michiguel wrote: The question of how much faster the 1x engine should be, must be answered with an extrapolation from at least two points, one faster and one slower than what is required.
The word you are looking for is "interpolation".

Correct, when I was writing I was thinking that it should not be an extrapolation, but an interpolation and I typed the wrong word. Thanks for catching it!

Miguel

Laskos · Post by **Laskos** » Wed Jul 03, 2013 5:37 pm

Sven Schüle wrote:Would it help to define something like an "effective SMP scaling factor" as follows:

- Start with single-CPU version of a candidate engine, play games against a fixed set of opponents with a given time control and obtain an ELO rating.
- Repeat with N CPUs for the candidate engine but with unchanged number of CPUs for the opponents, and measure the ELO improvement X.
- Now find out how much faster the 1 CPU version would need to be for the same ELO improvement X. (You would usually give more time to the candidate engine than to its opponents to simulate that.) The speed factor you need to reach X is M.
- Then M/N would be the "effective SMP scaling factor" of the candidate engine for N CPUs.

Does that make sense to anyone?

Sven

Yes, that's a good measure, but hard to do practically. Games need to be at some normal time control, say 40/4', and to get enough precision out of this ia a pain. One cannot do that with ultra-bullet games because of various scalings with time at ultra-short TC.

abulmo · Post by **abulmo** » Wed Jul 03, 2013 8:26 pm

michiguel wrote: Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

In my experiments, I good pretty good result. I use set of positions and average several runs to increase accuracy, but the variation between runs is small (< 1%). On sufficiently deep problems, the R² is > 99,9%, and the points are well aligned. To me this is a good fit.

bob · Post by **bob** » Thu Jul 04, 2013 8:29 am

Edsel Apostol wrote:There seems to be no agreement yet regarding how to measure SMP scalability in computer chess. I can think of 3 things that could be used to measure scalability.

NPS-wise
This seems to be the most simple and obvious one. It simply measures the CPU utilization of the algorithm used. The problem with this is that it doesn't take into account if the amount of work being measured is just redundant in the case of a shared hash table implementation.

Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.

In some cases, in order to gain better performance, one can widen the search a bit when using SMP to compensate for excessive pruning/reduction. This will improve performance for the reason that at higher depths there is not much compensation for searching deeper, it is better to search wider and not miss critical variations.

Performance/Elo based
This seems to be the most logical measurement to me since SMP algorithm in modern chess engines is more complex to be measured by just "Time to depth" and "NPS-wise".

Just to clarify, I have only heard ONE that claims to search a different tree with the parallel search than what is searched in the normal serial search. Time-to-depth is still the best indicator of the parallel search performance. Komodo appears to be different, whether it is better or not remains to be seen.

NPS is worthless other than one really wants as close to perfect NPS scaling as possible. If your NPS speedup is only 7x on an 8 core box, your SMP efficiency is limited to 87.5% as an upper bound...

Elo is certainly a good way to measure, but it does require some significant hardware to play enough games to produce statistically useful results...

bob · Post by **bob** » Fri Jul 05, 2013 6:30 am

michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.

For the record, Amdahl's law applies to ALL parallel programs. Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

Laskos · Post by **Laskos** » Fri Jul 05, 2013 9:59 am

bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs. Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

One cannot rely on time-to-depth. Besides Komodo and Rybka, I found another program, Hiarcs 14, which widens and has poor time-to-depth factor, although it appears to scale very well 1->4 cores in CEGT 40/4' list Elo-wise.
It appears that there are two factors: time-to-depth and widening 1->4 cores. If the widening gives N Elo points to the same depth, and doubling in time is 100 Elo points at fast controls, then the formula would be:
(time-to-depth factor) * 2^(N/100).

I agree that playing games to find the gain is practically impossible for normal users. One needs thousands of multi-threaded games at not very short control, say 40/4', a task which is too much even for developers with large machines.

Laskos · Post by **Laskos** » Fri Jul 05, 2013 11:29 am

Laskos wrote:
bob wrote: For the record, Amdahl's law applies to ALL parallel programs. Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...
One cannot rely on time-to-depth. Besides Komodo and Rybka, I found another program, Hiarcs 14, which widens and has poor time-to-depth factor, although it appears to scale very well 1->4 cores in CEGT 40/4' list Elo-wise.
It appears that there are two factors: time-to-depth and widening 1->4 cores. If the widening gives N Elo points to the same depth, and doubling in time is 100 Elo points at fast controls, then the formula would be:
(time-to-depth factor) * 2^(N/100).

I agree that playing games to find the gain is practically impossible for normal users. One needs thousands of multi-threaded games at not very short control, say 40/4', a task which is too much even for developers with large machines.

Applying the above formula, I find [time(1CPU)/time(4CPU)] to equal _strength_, which gives the quality of SMP implementation:

Code: Select all

Engine          time-to-depth    widening      time-to-strength      

Houdini 3           3.05            0             3.05
Stockfish 3         2.91            0             2.91
Rybka 4.1           2.32           24             2.74
Komodo 5.1          1.68           75             2.83
Hiarcs 14           2.10           42             2.81
Gaviota 0.86        2.74            0             2.74

Komodo is the most extreme in deviating from time-to-depth rule.

bob · Post by **bob** » Fri Jul 05, 2013 5:54 pm

Laskos wrote:
Laskos wrote:
bob wrote: For the record, Amdahl's law applies to ALL parallel programs. Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...
One cannot rely on time-to-depth. Besides Komodo and Rybka, I found another program, Hiarcs 14, which widens and has poor time-to-depth factor, although it appears to scale very well 1->4 cores in CEGT 40/4' list Elo-wise.
It appears that there are two factors: time-to-depth and widening 1->4 cores. If the widening gives N Elo points to the same depth, and doubling in time is 100 Elo points at fast controls, then the formula would be:
(time-to-depth factor) * 2^(N/100).

I agree that playing games to find the gain is practically impossible for normal users. One needs thousands of multi-threaded games at not very short control, say 40/4', a task which is too much even for developers with large machines.
Applying the above formula, I find [time(1CPU)/time(4CPU)] to equal _strength_, which gives the quality of SMP implementation:
Code: Select all
Engine          time-to-depth    widening      time-to-strength      

Houdini 3           3.05            0             3.05
Stockfish 3         2.91            0             2.91
Rybka 4.1           2.32           24             2.74
Komodo 5.1          1.68           75             2.83
Hiarcs 14           2.10           42             2.81
Gaviota 0.86        2.74            0             2.74
Komodo is the most extreme in deviating from time-to-depth rule.

How is widening measured???

Since EVERY parallel search widens to some extent due to overhead.

michiguel · Post by **michiguel** » Fri Jul 05, 2013 6:00 pm

bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs.

No, they don't!

It applies to the ones in which the size of the problem remains the same.

Miguel

Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

Laskos · Post by **Laskos** » Fri Jul 05, 2013 6:18 pm

bob wrote:
Laskos wrote:
Laskos wrote:
bob wrote: For the record, Amdahl's law applies to ALL parallel programs. Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...
One cannot rely on time-to-depth. Besides Komodo and Rybka, I found another program, Hiarcs 14, which widens and has poor time-to-depth factor, although it appears to scale very well 1->4 cores in CEGT 40/4' list Elo-wise.
It appears that there are two factors: time-to-depth and widening 1->4 cores. If the widening gives N Elo points to the same depth, and doubling in time is 100 Elo points at fast controls, then the formula would be:
(time-to-depth factor) * 2^(N/100).

I agree that playing games to find the gain is practically impossible for normal users. One needs thousands of multi-threaded games at not very short control, say 40/4', a task which is too much even for developers with large machines.
Applying the above formula, I find [time(1CPU)/time(4CPU)] to equal _strength_, which gives the quality of SMP implementation:
Code: Select all
Engine          time-to-depth    widening      time-to-strength      

Houdini 3           3.05            0             3.05
Stockfish 3         2.91            0             2.91
Rybka 4.1           2.32           24             2.74
Komodo 5.1          1.68           75             2.83
Hiarcs 14           2.10           42             2.81
Gaviota 0.86        2.74            0             2.74
Komodo is the most extreme in deviating from time-to-depth rule.
How is widening measured???

Since EVERY parallel search widens to some extent due to overhead.

In Elo points gain at the fixed depth. The depth is chosen such that the games are not extremely fast, but fast enough to measure Elo with reasonable precision in reasonable amount of time. Many engines do not show widening at all (as Elo gain), just checked Shredder 12, and it does not.
Time-to-depth is measured on 150 positions (30 repeating 5 times) running for an hour or two on single core (and less than that on 4 cores).

Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability