Measure of SMP scalability

bob · Post by **bob** » Tue Jul 09, 2013 6:58 am

michiguel wrote:
bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
bob wrote:As I said, you ONLY want to argue. Go for it, if that makes you happy. The author said this "widening" was DIRECTLY caused by a bug he had already fixed. So using this data to show "widening" seems wrong. If not to you, fine. To me, it is clearly a bogus data point.
How is it a bogus data point if the program plays real chess?

Can you explain this?

If the measurements were performed incorrectly, then the data point is certainly bogus. But there is no indication that anything was wrong with the measurement.

This is really not difficult, I would think. Just be intellectually honest.

You could simply agree it's a data point. It's not a very interesting one for all I care, but to say it's bogus is just, well... like the Black Knight saying it's just a flesh wound...
How simple is it to follow this logic:

create a data point that supports some argument.

Author of program says "that data point is invalid, it is the direct result of a bug that I have fixed", which says the data point won't be reproduced if the program is tested again.
Actually, Edsel stated that he expected Kai's result for Hannibal (time to depth was 1.33). You are the one who has claimed it to be invalid. Whether or not it is an unintentional bug, it still is a valid data point in the context of this discussion. And to reiterate, the general discussion has been about what is the most relevant measure of the effectiveness of the SMP implementation in a chess engine.

You have claimed that time to depth is a reasonable alternative to measuring Elo gain when determining the effectiveness. What Kai has pointed out with his data is that time to depth is not a good indicator of SMP effectiveness for some engines. That some engines gain more Elo with additional cores than their time to depth data would seem to indicate. And the answer for the Elo increase is that the search, while lacking in depth, is wider.

Let's look at Hannibal 1.3 for a moment. It has a bug that limits its SMP speedup for 4 cores to ~1.3. Given the typical gain of 50 to 100 Elo per doubling (the actual gain is related to the average depth being reached), Hannibal should gain something on the order of 19 to 39 Elo when using 4 cores. However, according to the CEGT 40/20 list, the gain is 58 Elo.

For Komodo, the expected increase would be 38 to 75 Elo. According to the CEGT 40/20 list, Komodo 5.1 4cpu is 3112. Now, the 1cpu version is not on this list, but it is known to be between Komodo CCT and Komodo 5.0 in strength. So, the rating for Komodo 5.1 1cpu would likely lie between 2994 and 3014. Thus, the increase is ~ 98 to 118 Elo.

All of these numbers are hazy for various reasons. But I do know from my tests and from others that the expected increase, when using those speed up numbers, would be towards the lower end of the given ranges when using the 40/20 (or equivalent) time control. So, these two engines are most likely gaining Elo over and above that predicted by the time to depth data. Thus, their SMP implementation is more effective than predicted by the usual measurement.
What are you talking about? He SPECIFICALLY said "this was a serious bug..." And that he had fixed it. How can you use data from a program with a KNOWN serious bug to support something that the bug caused?

I just fixed a new version of Crafty where I wrongly handled reductions and pruning at split points, and blew the tree up over 2x its normal size. Hell of a "widening example" wouldn't you say? And one that was caused by my not keeping up with "moves searched" at the split ply correctly. Should we use THAT data as well? Of course not.

When someone says "that version has a serious bug" no serious scientist on the planet would use that version to produce data and then report on it to support their argument.

That is what experimental science do all the time. They pick a protein, gene, organism, etc, and make a "mutant" (==bug) to study an altered behavior to correlate (or study cause) structure with function. Many times scientist do not make the mutants, but they observe what nature offers. Same thing. For instance, we would not have advanced so much in biochemistry without the ability to study "nature defects", which are the equivalent of "bugs".

So, this is a very interesting data point that should attract the attention of the curious.

Miguel

This is ridiculousness at its worst. If you want to use that data, feel free. I don't think anyone "serious" will consider it valid however. It's bad enough that our data has unknown bugs lurking inside, we certainly don't want it to have a known significant bug included... At least I don't...

Not worth arguing further however... This seems to be an argument held just for the sake of arguing...

The Komodo data is interesting. The hannibal data is useless.

This is "interesting"? The idea that if you reduce the pruning and stuff a program can play stronger? Absolutely nothing surprising there. The question is, if you introduce overhead is it more important than searching deeper with a normal parallel search? I personally don't believe it, and haven't seen anything yet that helps with this, pro or con...

bob · Post by **bob** » Tue Jul 09, 2013 7:04 am

syzygy wrote:
michiguel wrote:So, this is a very interesting data point that should attract the attention of the curious.
The curious, also known as the scientifically inclined.

Sort of funny what this place considers insulting nowadays, eh? OK to call someone unscientific it seems. In any case, waste all the time you want investigating this (I assume you HAVE done a parallel search.) Nothing new to one that has been working on this for a long while, however... So far, absolutely nothing to show that going wider is better than going deeper, or even as good.

Mark · Post by **Mark** » Tue Jul 09, 2013 1:40 pm

bob wrote:
michiguel wrote:
bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
bob wrote:As I said, you ONLY want to argue. Go for it, if that makes you happy. The author said this "widening" was DIRECTLY caused by a bug he had already fixed. So using this data to show "widening" seems wrong. If not to you, fine. To me, it is clearly a bogus data point.
How is it a bogus data point if the program plays real chess?

Can you explain this?

If the measurements were performed incorrectly, then the data point is certainly bogus. But there is no indication that anything was wrong with the measurement.

This is really not difficult, I would think. Just be intellectually honest.

You could simply agree it's a data point. It's not a very interesting one for all I care, but to say it's bogus is just, well... like the Black Knight saying it's just a flesh wound...
How simple is it to follow this logic:

create a data point that supports some argument.

Author of program says "that data point is invalid, it is the direct result of a bug that I have fixed", which says the data point won't be reproduced if the program is tested again.
Actually, Edsel stated that he expected Kai's result for Hannibal (time to depth was 1.33). You are the one who has claimed it to be invalid. Whether or not it is an unintentional bug, it still is a valid data point in the context of this discussion. And to reiterate, the general discussion has been about what is the most relevant measure of the effectiveness of the SMP implementation in a chess engine.

You have claimed that time to depth is a reasonable alternative to measuring Elo gain when determining the effectiveness. What Kai has pointed out with his data is that time to depth is not a good indicator of SMP effectiveness for some engines. That some engines gain more Elo with additional cores than their time to depth data would seem to indicate. And the answer for the Elo increase is that the search, while lacking in depth, is wider.

Let's look at Hannibal 1.3 for a moment. It has a bug that limits its SMP speedup for 4 cores to ~1.3. Given the typical gain of 50 to 100 Elo per doubling (the actual gain is related to the average depth being reached), Hannibal should gain something on the order of 19 to 39 Elo when using 4 cores. However, according to the CEGT 40/20 list, the gain is 58 Elo.

For Komodo, the expected increase would be 38 to 75 Elo. According to the CEGT 40/20 list, Komodo 5.1 4cpu is 3112. Now, the 1cpu version is not on this list, but it is known to be between Komodo CCT and Komodo 5.0 in strength. So, the rating for Komodo 5.1 1cpu would likely lie between 2994 and 3014. Thus, the increase is ~ 98 to 118 Elo.

All of these numbers are hazy for various reasons. But I do know from my tests and from others that the expected increase, when using those speed up numbers, would be towards the lower end of the given ranges when using the 40/20 (or equivalent) time control. So, these two engines are most likely gaining Elo over and above that predicted by the time to depth data. Thus, their SMP implementation is more effective than predicted by the usual measurement.
What are you talking about? He SPECIFICALLY said "this was a serious bug..." And that he had fixed it. How can you use data from a program with a KNOWN serious bug to support something that the bug caused?

I just fixed a new version of Crafty where I wrongly handled reductions and pruning at split points, and blew the tree up over 2x its normal size. Hell of a "widening example" wouldn't you say? And one that was caused by my not keeping up with "moves searched" at the split ply correctly. Should we use THAT data as well? Of course not.

When someone says "that version has a serious bug" no serious scientist on the planet would use that version to produce data and then report on it to support their argument.

That is what experimental science do all the time. They pick a protein, gene, organism, etc, and make a "mutant" (==bug) to study an altered behavior to correlate (or study cause) structure with function. Many times scientist do not make the mutants, but they observe what nature offers. Same thing. For instance, we would not have advanced so much in biochemistry without the ability to study "nature defects", which are the equivalent of "bugs".

So, this is a very interesting data point that should attract the attention of the curious.

Miguel

This is ridiculousness at its worst. If you want to use that data, feel free. I don't think anyone "serious" will consider it valid however. It's bad enough that our data has unknown bugs lurking inside, we certainly don't want it to have a known significant bug included... At least I don't...

Not worth arguing further however... This seems to be an argument held just for the sake of arguing...

The Komodo data is interesting. The hannibal data is useless.
This is "interesting"? The idea that if you reduce the pruning and stuff a program can play stronger? Absolutely nothing surprising there. The question is, if you introduce overhead is it more important than searching deeper with a normal parallel search? I personally don't believe it, and haven't seen anything yet that helps with this, pro or con...

What's interesting about the Komodo data is that it gains the expected elo going from 1 to 4 cores, without gaining the expected increased depth. Bob, from your discussions, the only reasonable explanation is that the serial version could gain additional elo by introducing the same "overhead" as the parallel version.

Laskos · Post by **Laskos** » Tue Jul 09, 2013 1:44 pm

bob wrote:
syzygy wrote:
michiguel wrote:So, this is a very interesting data point that should attract the attention of the curious.
The curious, also known as the scientifically inclined.
Nothing new to one that has been working on this for a long while, however... So far, absolutely nothing to show that going wider is better than going deeper, or even as good.

The new thing is that time-to-depth as claimed by you to be the universal speedup measure of parallelization is not working for some engines, "buggy" or not. Nobody claimed that going wider is better than not going, but as shown by Komodo, Rybka and Hiarcs, going wider is as good as going deeper, and comparable to purely deepening engines in effective speedup. As for Hannibal, buggy or not, it's a chess engine playing chess, and gaining in an un-orthodox fashion Elo points from parallelization, mostly from widening. It's immaterial that the authors of Hannibal have new ideas about their SMP implementation, until now we have an engine called Hannibal which widens its search and has a bit smaller effective speedup 1->4 cores than other engines, but a speedup that cannot be gauged by time-to-depth. So, measuring speedup, before time-to-depth empiric procedure, which is not universal, one has to check that the engine is not doing other tricks like widening (or narrowing, why not).

syzygy · Post by **syzygy** » Wed Jul 10, 2013 1:53 am

bob wrote:So far, absolutely nothing to show that going wider is better than going deeper, or even as good.

Did anyone actually claim that it was better? That was never the issue here.

bob · Post by **bob** » Wed Jul 10, 2013 9:51 pm

Laskos wrote:
bob wrote:
syzygy wrote:
michiguel wrote:So, this is a very interesting data point that should attract the attention of the curious.
The curious, also known as the scientifically inclined.
Nothing new to one that has been working on this for a long while, however... So far, absolutely nothing to show that going wider is better than going deeper, or even as good.
The new thing is that time-to-depth as claimed by you to be the universal speedup measure of parallelization is not working for some engines, "buggy" or not. Nobody claimed that going wider is better than not going, but as shown by Komodo, Rybka and Hiarcs, going wider is as good as going deeper, and comparable to purely deepening engines in effective speedup. As for Hannibal, buggy or not, it's a chess engine playing chess, and gaining in an un-orthodox fashion Elo points from parallelization, mostly from widening. It's immaterial that the authors of Hannibal have new ideas about their SMP implementation, until now we have an engine called Hannibal which widens its search and has a bit smaller effective speedup 1->4 cores than other engines, but a speedup that cannot be gauged by time-to-depth. So, measuring speedup, before time-to-depth empiric procedure, which is not universal, one has to check that the engine is not doing other tricks like widening (or narrowing, why not).

What you have identified does NOT show that time to depth is not the correct measure. So far, nothing says that going wider is better. In one case, it was due to a program bug. In another case (Rybka) it seems to be related to a not-necessarily-good characteristic of how the parallel search was designed/implemented. I don't have any problems whatsoever in doing the same exact extensions, reductions and pruning, in a parallel search as I do in the serial search. Whether going "wider" is good or not is, as of right now, unknown.

As you add cores, the parallel search efficiency drops. Always has. Going wider rather than deeper is a choice. But it STILL uses the same poorly scaling parallel search algorithm to traverse a tree that is wider. I've not yet discovered any real benefit, after running a few million test games on my cluster. Slightly relaxing reductions or pruning (very slightly else things blow up) just hurts performance thru 8 cores. I can't test beyond that easily at the moment. And the absolute max I can test is 12, where the parallel speedup is still quite good...

If it is not "better" or at least "as good as" then what's the entire point of this discussion? Surely not that one can simply write a bad parallel search that blows the tree up? That's certainly easy enough to do. SO there must be a gain somewhere or it is an exercise in futility.

bob · Post by **bob** » Wed Jul 10, 2013 9:56 pm

Mark wrote:
bob wrote:
michiguel wrote:
bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
bob wrote:As I said, you ONLY want to argue. Go for it, if that makes you happy. The author said this "widening" was DIRECTLY caused by a bug he had already fixed. So using this data to show "widening" seems wrong. If not to you, fine. To me, it is clearly a bogus data point.
How is it a bogus data point if the program plays real chess?

Can you explain this?

If the measurements were performed incorrectly, then the data point is certainly bogus. But there is no indication that anything was wrong with the measurement.

This is really not difficult, I would think. Just be intellectually honest.

You could simply agree it's a data point. It's not a very interesting one for all I care, but to say it's bogus is just, well... like the Black Knight saying it's just a flesh wound...
How simple is it to follow this logic:

create a data point that supports some argument.

Author of program says "that data point is invalid, it is the direct result of a bug that I have fixed", which says the data point won't be reproduced if the program is tested again.
Actually, Edsel stated that he expected Kai's result for Hannibal (time to depth was 1.33). You are the one who has claimed it to be invalid. Whether or not it is an unintentional bug, it still is a valid data point in the context of this discussion. And to reiterate, the general discussion has been about what is the most relevant measure of the effectiveness of the SMP implementation in a chess engine.

You have claimed that time to depth is a reasonable alternative to measuring Elo gain when determining the effectiveness. What Kai has pointed out with his data is that time to depth is not a good indicator of SMP effectiveness for some engines. That some engines gain more Elo with additional cores than their time to depth data would seem to indicate. And the answer for the Elo increase is that the search, while lacking in depth, is wider.

Let's look at Hannibal 1.3 for a moment. It has a bug that limits its SMP speedup for 4 cores to ~1.3. Given the typical gain of 50 to 100 Elo per doubling (the actual gain is related to the average depth being reached), Hannibal should gain something on the order of 19 to 39 Elo when using 4 cores. However, according to the CEGT 40/20 list, the gain is 58 Elo.

For Komodo, the expected increase would be 38 to 75 Elo. According to the CEGT 40/20 list, Komodo 5.1 4cpu is 3112. Now, the 1cpu version is not on this list, but it is known to be between Komodo CCT and Komodo 5.0 in strength. So, the rating for Komodo 5.1 1cpu would likely lie between 2994 and 3014. Thus, the increase is ~ 98 to 118 Elo.

All of these numbers are hazy for various reasons. But I do know from my tests and from others that the expected increase, when using those speed up numbers, would be towards the lower end of the given ranges when using the 40/20 (or equivalent) time control. So, these two engines are most likely gaining Elo over and above that predicted by the time to depth data. Thus, their SMP implementation is more effective than predicted by the usual measurement.
What are you talking about? He SPECIFICALLY said "this was a serious bug..." And that he had fixed it. How can you use data from a program with a KNOWN serious bug to support something that the bug caused?

I just fixed a new version of Crafty where I wrongly handled reductions and pruning at split points, and blew the tree up over 2x its normal size. Hell of a "widening example" wouldn't you say? And one that was caused by my not keeping up with "moves searched" at the split ply correctly. Should we use THAT data as well? Of course not.

When someone says "that version has a serious bug" no serious scientist on the planet would use that version to produce data and then report on it to support their argument.

That is what experimental science do all the time. They pick a protein, gene, organism, etc, and make a "mutant" (==bug) to study an altered behavior to correlate (or study cause) structure with function. Many times scientist do not make the mutants, but they observe what nature offers. Same thing. For instance, we would not have advanced so much in biochemistry without the ability to study "nature defects", which are the equivalent of "bugs".

So, this is a very interesting data point that should attract the attention of the curious.

Miguel

This is ridiculousness at its worst. If you want to use that data, feel free. I don't think anyone "serious" will consider it valid however. It's bad enough that our data has unknown bugs lurking inside, we certainly don't want it to have a known significant bug included... At least I don't...

Not worth arguing further however... This seems to be an argument held just for the sake of arguing...

The Komodo data is interesting. The hannibal data is useless.
This is "interesting"? The idea that if you reduce the pruning and stuff a program can play stronger? Absolutely nothing surprising there. The question is, if you introduce overhead is it more important than searching deeper with a normal parallel search? I personally don't believe it, and haven't seen anything yet that helps with this, pro or con...
What's interesting about the Komodo data is that it gains the expected elo going from 1 to 4 cores, without gaining the expected increased depth. Bob, from your discussions, the only reasonable explanation is that the serial version could gain additional elo by introducing the same "overhead" as the parallel version.

That's been my point from the get-go here. If going wider is good for the parallel search, I see absolutely no reason to believe it would not be better for the serial search. That leaves two possibilities...

(1) the smp search efficiency is simply not very good. If not, making the tree wider seems to be a loser since the same inefficient parallel search now simply has to search a bushier tree, with the same level of inefficiency it had to start with;

(2) the smp search efficiency is normal, but internal design decisions make the tree bushier (whether this is intentional or not is not so easy to discern based on the rybka discussions among others). Now you are REALLY trading depth for width, intentionally. If it works here, why not in the serial search as well???

Increasing the effective branching factor reduces the depth reached in a fixed amount of time. The normal parallel search (4 cores) will gain about 1.5 plies (3x+ faster than 1 core, ebf=2.0). If giving up part of that 1.5 plies is positively offset by increasing the EBF, that would seem to be applicable whether the search is parallel or serial. So far, nothing contradicts that...

Laskos · Post by **Laskos** » Wed Jul 10, 2013 11:53 pm

bob wrote:
What you have identified does NOT show that time to depth is not the correct measure.

It is showing exactly that, and this I wrote in the first post of the former thread.

So far, nothing says that going wider is better.

Nobody is saying this. But there are engines (Komodo, Rybka, Hiarcs) which widen going parallel and have comparable effective speedups to non-widening engines. If they exist, time-to-depth is not an universal measure. Even for buggy Hannibal, time-to-depth is gibberish if calculating its effective speedup.

In one case, it was due to a program bug. In another case (Rybka) it seems to be related to a not-necessarily-good characteristic of how the parallel search was designed/implemented. I don't have any problems whatsoever in doing the same exact extensions, reductions and pruning, in a parallel search as I do in the serial search. Whether going "wider" is good or not is, as of right now, unknown.

Again, I was not claiming that going wider is better. The fact is some are going wider, and time-to-depth is not applicable.

As you add cores, the parallel search efficiency drops. Always has. Going wider rather than deeper is a choice. But it STILL uses the same poorly scaling parallel search algorithm to traverse a tree that is wider. I've not yet discovered any real benefit, after running a few million test games on my cluster. Slightly relaxing reductions or pruning (very slightly else things blow up) just hurts performance thru 8 cores. I can't test beyond that easily at the moment. And the absolute max I can test is 12, where the parallel speedup is still quite good...

The scaling behavior of widening Komodo, Rybka, Hiarcs with so many threads is not known to be much less effective compared to that of non-widening Houdini or Crafty. We can only speculate.

If it is not "better" or at least "as good as" then what's the entire point of this discussion? Surely not that one can simply write a bad parallel search that blows the tree up? That's certainly easy enough to do. SO there must be a gain somewhere or it is an exercise in futility.

The discussion from my part was mainly stated in the first post, that Komodo widens, therefore time-to-depth is not a correct measure of the effective speedup. At the same time, effective speedup of Komodo 1->4 cores is reasonable, comparable to engines which don't widen, as shown by rating lists and other threads and posts on this topic.

bob · Post by **bob** » Thu Jul 11, 2013 12:51 am

Laskos wrote:
bob wrote:
What you have identified does NOT show that time to depth is not the correct measure.
It is showing exactly that, and this I wrote in the first post of the former thread.
So far, nothing says that going wider is better.
Nobody is saying this. But there are engines (Komodo, Rybka, Hiarcs) which widen going parallel and have comparable effective speedups to non-widening engines. If they exist, time-to-depth is not an universal measure. Even for buggy Hannibal, time-to-depth is gibberish if calculating its effective speedup.
In one case, it was due to a program bug. In another case (Rybka) it seems to be related to a not-necessarily-good characteristic of how the parallel search was designed/implemented. I don't have any problems whatsoever in doing the same exact extensions, reductions and pruning, in a parallel search as I do in the serial search. Whether going "wider" is good or not is, as of right now, unknown.
Again, I was not claiming that going wider is better. The fact is some are going wider, and time-to-depth is not applicable.
As you add cores, the parallel search efficiency drops. Always has. Going wider rather than deeper is a choice. But it STILL uses the same poorly scaling parallel search algorithm to traverse a tree that is wider. I've not yet discovered any real benefit, after running a few million test games on my cluster. Slightly relaxing reductions or pruning (very slightly else things blow up) just hurts performance thru 8 cores. I can't test beyond that easily at the moment. And the absolute max I can test is 12, where the parallel speedup is still quite good...
The scaling behavior of widening Komodo, Rybka, Hiarcs with so many threads is not known to be much less effective compared to that of non-widening Houdini or Crafty. We can only speculate.
If it is not "better" or at least "as good as" then what's the entire point of this discussion? Surely not that one can simply write a bad parallel search that blows the tree up? That's certainly easy enough to do. SO there must be a gain somewhere or it is an exercise in futility.
The discussion from my part was mainly stated in the first post, that Komodo widens, therefore time-to-depth is not a correct measure of the effective speedup. At the same time, effective speedup of Komodo 1->4 cores is reasonable, comparable to engines which don't widen, as shown by rating lists and other threads and posts on this topic.

Your graph shows that a couple of programs (2 as best I can tell although you have not cited Rybka specifically) don't show very good time to depth. Doesn't show that is not the CORRECT way to measure performance and evaluate the design however. Jury is still out on that determination.

My interest is "is this the best way to do things?" I do not currently believe so, and have seen nothing (yet) to change my mind. Increasing the EBF is harmful since it increases time. The fact that it is not as harmful with a parallel search doesn't mean it is not harmful, however, when compared to the really good parallel searches.

This "story" is far from complete...

syzygy · Post by **syzygy** » Thu Jul 11, 2013 2:34 am

bob wrote:Doesn't show that is not the CORRECT way to measure performance and evaluate the design however.

We're not talking about "evaluating the design" (which you obviously would like to equate with "if time-to-depth is not great, the design is wrong").

We are talking about measuring the performance of the smp implementation. We are talking about measuring effective smp speedup. Effective in the sense of "how well does it play chess", which in the end is the only thing that matters.

The one and only point of the whole discussion is this: time-to-depth is not a complete measure of effective smp speedup.

There is no other point being made here. In particular, we are not discussing what Crafty should do in its next release.

I will now stop feeding, because this has been going on long enough. Everybody else is in complete agreement. You of course not. Good for you.

Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability