Measure of SMP scalability

michiguel · Post by **michiguel** » Sun Jul 07, 2013 7:26 am

bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs.

No, they don't!

It applies to the ones in which the size of the problem remains the same.
For EVERY problem, this size can remain constant.

I does not for chess engines, and you know it.
You are mangling the word "size". In chess, in a parallel search, we search a larger tree due to overhead. THAT is why we don't get perfectly linear speedup.
That is why it deviates from Amdhal.
That is NOT a "deviation". The goal is to get to depth N as quickly as possible. That the nature of alpha/beta impedes that is simply a PART of the algorithm definition. And it is why chess programs don't get perfect linear speedup, and it is why amdahl's law fits perfectly, still.

Who cares? Those extra nodes, for almost every program on the planet, do not give any additional information,

Not actually true, because it fills the hashtable with information that sometimes it is useful and few times it generates bigger speedups.

"sometimes". And sometimes it hurts. And the net effect is "no gain". As verified by the 1 vs 4 cpu tests to fixed depth. Have you read all the previous discussions? If there is no gain then there is no gain.

they are simply searched due to alpha/beta considerations and imperfect move ordering / splitting.

Can we return to the regularly scheduled discussion now?

No, we were discussing this in this sub-branch.

Amdahl's law can be applied to ANY algorithm that has a specific output for a specific input.

No, the work done needs to be the same.
work done = search to fixed depth. Seems clear enough to me.

Wrong.

It is same depth, same path, same tree, same everything. Since a parallel search with a chess engine does not do that, then, Amdahl law does NOT apply.

Doesn't matter whether the serial program uses alpha/beta and the parallel program uses pure minimax. Time to depth is "the same task"

No.

Chess certainly does that for fixed depth measuring time to depth. Since Elo has ALWAYS been directly correlated with speed, one could probably measure Elo, extrapolate to speedup, and Amdahl's law would still apply just fine...

You can measure the speed and plot it vs. number of processors. Does it fit an equation y = a * x / (b + x)? with the data we have seen over the years, it doesn't. Does it fit today?

Miguel

Which program?

Any you choose. If you can show data that can fit y = a * x / (b + x), being y = "speed" and "x" = number of processors, Amdahl's law _may_ apply for that particular program. Otherwise, it won't.

Miguel

Since we are discussing Komodo which seems to be significantly different from what I see with Crafty, or Houdini or stockfish or the rest of the "traditional parallel searchers".

[quote

Miguel

In a weather model, one would not compare two different programs that examine a different number of data points. Chess certainly fits this description. Changing pruning is certainly unusual, to say the least...

Miguel

Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

[/quote]

mvk · Post by **mvk** » Sun Jul 07, 2013 12:50 pm

michiguel wrote:
bob wrote:
work done = search to fixed depth. Seems clear enough to me.

Wrong.

It is same depth, same path, same tree, same everything. Since a parallel search with a chess engine does not do that, then, Amdahl law does NOT apply.

Indeed, Gustafson's Law applies to the domain of heuristic search.

syzygy · Post by **syzygy** » Sun Jul 07, 2013 2:10 pm

bob wrote:The program has a bug that the author has fixed but not released. It eliminates much of that "widening". So yes, his numbers are hypothetical since the REAL program won't be searching like that. Are we reading the same posts?

So any program that has a bug is not REAL? Are you kidding?

The 'buggy' Hannibal plays real chess, so provides a real data point. Whether its behaviour corresponds to the author's intention is immaterial.

If a future 'debugged' Hannibal has better time-to-reported-depth but an effective gain smaller than 2.27, that would indicate a regression in its smp implementation.

Laskos · Post by **Laskos** » Sun Jul 07, 2013 6:46 pm

mvk wrote:
michiguel wrote:
bob wrote:
work done = search to fixed depth. Seems clear enough to me.

Wrong.

It is same depth, same path, same tree, same everything. Since a parallel search with a chess engine does not do that, then, Amdahl law does NOT apply.
Indeed, Gustafson's Law applies to the domain of heuristic search.

Yes, but a bit modified from what Wiki says. The non-parallelizable portion of the problem grows with problem size like O(log n)), and then the speedup is proportional to n/log(n), and not to n (number of processors), as presented in Wiki. I think that is what we observe in computer chess. And sure, the whole philosophy of heuristic search is obeying the Gustafson Law.

bob · Post by **bob** » Sun Jul 07, 2013 9:28 pm

michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs.

No, they don't!

It applies to the ones in which the size of the problem remains the same.
For EVERY problem, this size can remain constant.

I does not for chess engines, and you know it.
You are mangling the word "size". In chess, in a parallel search, we search a larger tree due to overhead. THAT is why we don't get perfectly linear speedup.
That is why it deviates from Amdhal.
That is NOT a "deviation". The goal is to get to depth N as quickly as possible. That the nature of alpha/beta impedes that is simply a PART of the algorithm definition. And it is why chess programs don't get perfect linear speedup, and it is why amdahl's law fits perfectly, still.

Who cares? Those extra nodes, for almost every program on the planet, do not give any additional information,

Not actually true, because it fills the hashtable with information that sometimes it is useful and few times it generates bigger speedups.

"sometimes". And sometimes it hurts. And the net effect is "no gain". As verified by the 1 vs 4 cpu tests to fixed depth. Have you read all the previous discussions? If there is no gain then there is no gain.

they are simply searched due to alpha/beta considerations and imperfect move ordering / splitting.

Can we return to the regularly scheduled discussion now?

No, we were discussing this in this sub-branch.

Amdahl's law can be applied to ANY algorithm that has a specific output for a specific input.

No, the work done needs to be the same.
work done = search to fixed depth. Seems clear enough to me.

Wrong.

It is same depth, same path, same tree, same everything. Since a parallel search with a chess engine does not do that, then, Amdahl law does NOT apply.

Doesn't matter whether the serial program uses alpha/beta and the parallel program uses pure minimax. Time to depth is "the same task"

No.

Chess certainly does that for fixed depth measuring time to depth. Since Elo has ALWAYS been directly correlated with speed, one could probably measure Elo, extrapolate to speedup, and Amdahl's law would still apply just fine...

You can measure the speed and plot it vs. number of processors. Does it fit an equation y = a * x / (b + x)? with the data we have seen over the years, it doesn't. Does it fit today?

Miguel

Which program?

Any you choose. If you can show data that can fit y = a * x / (b + x), being y = "speed" and "x" = number of processors, Amdahl's law _may_ apply for that particular program. Otherwise, it won't.

Miguel

Since we are discussing Komodo which seems to be significantly different from what I see with Crafty, or Houdini or stockfish or the rest of the "traditional parallel searchers".

[quote

Miguel

In a weather model, one would not compare two different programs that examine a different number of data points. Chess certainly fits this description. Changing pruning is certainly unusual, to say the least...

Miguel

Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

[/quote]

You can repeat that as often as you want. Yet Amdahl's law fits perfectly. The only mitigating circumstance is a program that modifies what it INTENDS to search based on the number of cores. That's an unusual (and generally unintentional) approach.

It might not apply for you, for whatever reason. But it has applied for the rest of us for years. NO parallel application does EXACTLY the same work as the serial version. The minute you add a mutex lock, you just changed what was done. Ditto for barriers and such. Ditto for dynamic scheduling / load balancing. Ditto for cache interactions. Etc. Yet nobody counts those as "different work." This argument is silly.

bob · Post by **bob** » Sun Jul 07, 2013 9:30 pm

syzygy wrote:
bob wrote:The program has a bug that the author has fixed but not released. It eliminates much of that "widening". So yes, his numbers are hypothetical since the REAL program won't be searching like that. Are we reading the same posts?
So any program that has a bug is not REAL? Are you kidding?

The 'buggy' Hannibal plays real chess, so provides a real data point. Whether its behaviour corresponds to the author's intention is immaterial.

If a future 'debugged' Hannibal has better time-to-reported-depth but an effective gain smaller than 2.27, that would indicate a regression in its smp implementation.

Any program that has a bug that will be fixed should NOT be measured. What's the point. Isn't it time to stop the silly arguments?

Using a program with a bug to make a point is beyond ridiculous...

syzygy · Post by **syzygy** » Sun Jul 07, 2013 10:25 pm

bob wrote:
syzygy wrote:
bob wrote:The program has a bug that the author has fixed but not released. It eliminates much of that "widening". So yes, his numbers are hypothetical since the REAL program won't be searching like that. Are we reading the same posts?
So any program that has a bug is not REAL? Are you kidding?

The 'buggy' Hannibal plays real chess, so provides a real data point. Whether its behaviour corresponds to the author's intention is immaterial.

If a future 'debugged' Hannibal has better time-to-reported-depth but an effective gain smaller than 2.27, that would indicate a regression in its smp implementation.
Any program that has a bug that will be fixed should NOT be measured. What's the point. Isn't it time to stop the silly arguments?

Using a program with a bug to make a point is beyond ridiculous...

I'm sure Houdini has a bug that will be fixed in the next version. Should we not measure the current version?

Wishing away real data points by arguing "bug, this was not intended" is rather unscientific. The data point is real. How are you in any intellectual honest way going to deny that it exists?

mvk · Post by **mvk** » Sun Jul 07, 2013 11:50 pm

syzygy wrote:Wishing away real data points by arguing "bug, this was not intended" is rather unscientific. The data point is real. How are you in any intellectual honest way going to deny that it exists?

Many great discoveries are based on data points that should have been thrown away because they were "wrong". Penicillin, teflon, microwaves, rubber, dynamite, saccharin, viagra... Some important discoveries were delayed because of data filtering, most notably the hole in the ozone layer.

bob · Post by **bob** » Mon Jul 08, 2013 12:13 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:The program has a bug that the author has fixed but not released. It eliminates much of that "widening". So yes, his numbers are hypothetical since the REAL program won't be searching like that. Are we reading the same posts?
So any program that has a bug is not REAL? Are you kidding?

The 'buggy' Hannibal plays real chess, so provides a real data point. Whether its behaviour corresponds to the author's intention is immaterial.

If a future 'debugged' Hannibal has better time-to-reported-depth but an effective gain smaller than 2.27, that would indicate a regression in its smp implementation.
Any program that has a bug that will be fixed should NOT be measured. What's the point. Isn't it time to stop the silly arguments?

Using a program with a bug to make a point is beyond ridiculous...
I'm sure Houdini has a bug that will be fixed in the next version. Should we not measure the current version?

Wishing away real data points by arguing "bug, this was not intended" is rather unscientific. The data point is real. How are you in any intellectual honest way going to deny that it exists?

As I said, you ONLY want to argue. Go for it, if that makes you happy. The author said this "widening" was DIRECTLY caused by a bug he had already fixed. So using this data to show "widening" seems wrong. If not to you, fine. To me, it is clearly a bogus data point.

syzygy · Post by **syzygy** » Mon Jul 08, 2013 12:24 am

bob wrote:As I said, you ONLY want to argue. Go for it, if that makes you happy. The author said this "widening" was DIRECTLY caused by a bug he had already fixed. So using this data to show "widening" seems wrong. If not to you, fine. To me, it is clearly a bogus data point.

How is it a bogus data point if the program plays real chess?

Can you explain this?

If the measurements were performed incorrectly, then the data point is certainly bogus. But there is no indication that anything was wrong with the measurement.

This is really not difficult, I would think. Just be intellectually honest.

You could simply agree it's a data point. It's not a very interesting one for all I care, but to say it's bogus is just, well... like the Black Knight saying it's just a flesh wound...

Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability