Clustering etc. thread

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An idea for a new WCCC format - what do you think?

Post by bob »

Rolf wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW :)

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
This happens frequently that older experts argue that something wouldnt or couldnt work because it never worked or had been done. There are some other excuses. If it were true, progress would end in smoke.
Feel free to develop a mathematical system (base 10) where 2+2 = 5, then come back and continue the discussion. I wrote a paper in the Journal of Parallel Computing about 20 years ago that discussed this idea in depth, both practically and theoretically, and there are no mistakes in it. Splitting at the root is bad for old programs. With todays EBF at around 2, it os way beyond "bad". All the way to "completely ineffective" unless you consider a speedup of 1.1 out of 3-4 processors "good".

As far as progress goes, fortunately most know to look at past results _first_ to see which direction appears most promising or most pointless. Serendipity works at times, but not here, because the theory is rock-solid and starts with the paper by Knuth/Moore (An analsis of alpha/beta prunting) and ends with the paper I wrote "A parallel alpha/beta tree-searching algorithm" which completes Knuth/Moore's analysis for parallel search (they only considered sequential search). Those computations do not lie.
Gian-Carlo Pascutto
Posts: 1260
Joined: Sat Dec 13, 2008 7:00 pm

Re: An idea for a new WCCC format - what do you think?

Post by Gian-Carlo Pascutto »

bob wrote:
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
Please split this discussion in a seperate thread.
Vasik Rajlich
Posts: 75
Joined: Mon Nov 27, 2006 6:49 am

Re: An idea for a new WCCC format - what do you think?

Post by Vasik Rajlich »

bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.
Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
Vasik Rajlich
Posts: 75
Joined: Mon Nov 27, 2006 6:49 am

Re: An idea for a new WCCC format - what do you think?

Post by Vasik Rajlich »

bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW :)

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing :)

Vas
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An idea for a new WCCC format - what do you think?

Post by bob »

Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.
Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An idea for a new WCCC format - what do you think?

Post by bob »

Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW :)

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing :)

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

If both hold, you can get a speedup of N, where N is the number of times you actually change the best move at the root. Typically that is none in roughly 80% of the searches according to analysis on the "deep projects" from years ago, "Crafty goes deep" and then Heinz's "DarkThought goes deeper" or whatever it was called. So with luck, 20% of time time (actually more like 15%) you will change your mind at the root once, and get a 2x speedup. 80% of the time you get no speedup to speak of.

The bad part of that is that the more nodes / processors you add, you get nothing for them unless you change your mind at least once for each node so that all those moves have to be searched completely with no early exits due to cutoffs.

In 1978 when we started this, our branching factor was closer to 10 than to 2. So we potentially got more then than now. Today with EBF hovering around 2.0 or so, the first move takes >50% of the time. Which means all you can speed up is that remaining 50% which is not going to accomplish much, performance-wise. For example, one processor takes 20 minutes to complete a search. Two will take 15 minutes. Four will take 12.5 minutes. An infinite number could complete the search in 10 minutes minimum, and that is a stretch to assume all of those branches would produce trees no larger than the first move, which is unlikely.


I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either. :)
Uri Blass
Posts: 10801
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An idea for a new WCCC format - what do you think?

Post by Uri Blass »

bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW :)

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing :)

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

If both hold, you can get a speedup of N, where N is the number of times you actually change the best move at the root. Typically that is none in roughly 80% of the searches according to analysis on the "deep projects" from years ago, "Crafty goes deep" and then Heinz's "DarkThought goes deeper" or whatever it was called. So with luck, 20% of time time (actually more like 15%) you will change your mind at the root once, and get a 2x speedup. 80% of the time you get no speedup to speak of.

The bad part of that is that the more nodes / processors you add, you get nothing for them unless you change your mind at least once for each node so that all those moves have to be searched completely with no early exits due to cutoffs.

In 1978 when we started this, our branching factor was closer to 10 than to 2. So we potentially got more then than now. Today with EBF hovering around 2.0 or so, the first move takes >50% of the time. Which means all you can speed up is that remaining 50% which is not going to accomplish much, performance-wise. For example, one processor takes 20 minutes to complete a search. Two will take 15 minutes. Four will take 12.5 minutes. An infinite number could complete the search in 10 minutes minimum, and that is a stretch to assume all of those branches would produce trees no larger than the first move, which is unlikely.


I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either. :)
What you can achieve with infinite number of processors by splitting at the root is one ply and one ply can be equivalent to more than speed improvement of 2:1

The branching factor is irrelevant because
one ply+ searching to depth n may be practically better than searching to depth n+1 because searching to depth n+1 may have more pruning.

The question is what is the effective speed difference between searching to depth n and searching to depth n+1 when you do not prune after root moves(not by null move and not by late move reductions).

practically the depth is not constant but you can try by yourself to test version A against version B when version A is using 2 seconds per move and version B is searching for 1 second after every root move and see who wins(version B should use something like 100 processors to be practically sure that the number of root moves is smaller than the number of processors but the interesting question is if version B can win the match or not).
If it wins then you can increase the time of version A to see
the effective speed up when version A score near 50% against version B.

Uri
Vasik Rajlich
Posts: 75
Joined: Mon Nov 27, 2006 6:49 am

Re: An idea for a new WCCC format - what do you think?

Post by Vasik Rajlich »

bob wrote:
Vasik Rajlich wrote:
bob wrote:
I _did_ an algorithm like that in 1983. I put it together in under 2 weeks and won the 1983 WCCC championship with it. And I _know_ how it works and how the PVs look.
Actually, I'd be curious to hear more about it. How did you split the root moves? Which moves got PV searches and which ones got scout (null-window) searches? What beta did you use for the scout searches?

I couldn't find anything useful in the public domain about splitting work without shared memory, except for the Cluster Toga algorithm, which wasn't really suitable for Rybka (among other things, it doesn't apply to analysis and would be useless for users).

Vas
What we did back then was more dynamic that that. I had the root move list sorted just as I do today (previous iteration node count). Processor 1 searched the first move with a normal aspiration window. Processor 2 did the same. When one of them reported back with a new best move, the alpha/beta value was shipped to the other processor, which would take a short "break", update all the bounds, and then check to see if anything will now cutoff where it didn't when we started. We then searched the root moves one at a time, whenever a processor finished a move it grabbed another (dynamic load balancing) until the list was gone.

We always did an aspiration-type search in CB, where the root window was very narrow, something like +/- 0.25 pawns, for the first move. At the time we were using PVS (this was something from the late 70's and was used by belle and ourselves as well as others) which is (I assume) what you mean by "scout" (which is not quite the same as PVS but close enough for discussion).

I should add that both the univac 1100 and Cray were both shared-memory. I'm looking at cluster-based search right now because of this 70 node 8-core cluster we have. I can search about 20M nodes per sec on one node, which gives a potential for something faster than deep blue (we could possibly hit 1400M nodes per second although there are great issues to overcome. But Hsu and I did talk a lot about his parallel search and I like his two-level approach (split at early plies among the nodes, each node splits at late plies among the local processors)...
Aha, thanks. Like many things, this won't work without shared memory. Without shared memory, if 'processor A' (to use your terminology) takes moves x at some iteration, it needs to take the same move at the next iteration as well.

Basically, without shared memory, you want to have the same units processing the same parts of the tree throughout the entire search. I agree with Hsu's comment - without shared memory, work needs to be split at or near the root.

Vas

ps. I use the term "scout search" to refer to any search with a zero-width window.
Vasik Rajlich
Posts: 75
Joined: Mon Nov 27, 2006 6:49 am

Re: An idea for a new WCCC format - what do you think?

Post by Vasik Rajlich »

bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW :)

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing :)

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.
Sure. If you're not changing your mind, it doesn't matter what kind of speedup you have.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.
That's just life without shared memory. Any cluster implementation is going to have problems in a position like that.

I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either. :)
With infinite # of processors and splitting only at the root, you will get a lot more than 1.5x.

Vas
Vasik Rajlich
Posts: 75
Joined: Mon Nov 27, 2006 6:49 am

Re: An idea for a new WCCC format - what do you think?

Post by Vasik Rajlich »

Uri Blass wrote:
bob wrote:
Vasik Rajlich wrote:
bob wrote:
Vasik Rajlich wrote:
Uri Blass wrote:
Dirt wrote:
Vasik Rajlich wrote:
bob wrote:I don't buy the "this hurts Rybka" idea, because the cluster rybka is a joke. And a poor joke at that. There have been some decent cluster-based programs. But Rybka is simply not one of them.
Where did that come from ??

Vas
There is something of an explanation here.
I read this post and I can say 2 things.

1)I think that it is impossible to know the algorithm rybka is using based on output from a single position.

It is possible that something similiar that is not exactly the same is used
when some illogical moves that lose the queen are analyzed but this is not all the story and the algorithm is based partly on "split only at the root" and partly on another idea.

2)I remember that Vas said 100 elo based on testing at fast time control and I suspect that at fast time control you get clearly more than 50 elo per doubling so practically 5 nodes do not give 4:1 speed improvement but clearly less than it(maybe 2.5:1).

Uri
The effective speedup is probably somewhere between 2.5:1 and 3:1 for 5 nodes, which is what Lukas had when he tested all of this.

Now he's up to 9 nodes BTW :)

Vas
Can we stay in the real world? Splitting at the root can not produce a 2.5x speedup, when the best move at the root takes _way_ over 50% of the total search time. There is theory. There is practice. And there is nonsense. For the event I am talking about, this claim is "nonsense". You might get the uninformed to buy this stuff, but not someone that has been doing it for 30+ years now (my first parallel search played its first ACM event in 1978....)
By effective speedup I mean the time handicap you could give to the original entity and score 50%. So, even if you do nothing other than split at the root and if the first move typically takes 50% of your search time, you could still get an effective speedup of >2. Not that that's what Rybka is doing :)

Vas
Here's what you need to make that happen.

(1) you need to change your mind at least once at the root during the last couple of iterations. More changes is better.

(2) you have to hope that the hash information from the first move does not affect any other move. Fine 70 is a good example of where this can be a problem.

If both hold, you can get a speedup of N, where N is the number of times you actually change the best move at the root. Typically that is none in roughly 80% of the searches according to analysis on the "deep projects" from years ago, "Crafty goes deep" and then Heinz's "DarkThought goes deeper" or whatever it was called. So with luck, 20% of time time (actually more like 15%) you will change your mind at the root once, and get a 2x speedup. 80% of the time you get no speedup to speak of.

The bad part of that is that the more nodes / processors you add, you get nothing for them unless you change your mind at least once for each node so that all those moves have to be searched completely with no early exits due to cutoffs.

In 1978 when we started this, our branching factor was closer to 10 than to 2. So we potentially got more then than now. Today with EBF hovering around 2.0 or so, the first move takes >50% of the time. Which means all you can speed up is that remaining 50% which is not going to accomplish much, performance-wise. For example, one processor takes 20 minutes to complete a search. Two will take 15 minutes. Four will take 12.5 minutes. An infinite number could complete the search in 10 minutes minimum, and that is a stretch to assume all of those branches would produce trees no larger than the first move, which is unlikely.


I think you'd be very lucky to get a speedup of 1.5x with any number of processors, which is not zero of course, but it is not something that will make me quake in my boots either. :)
What you can achieve with infinite number of processors by splitting at the root is one ply and one ply can be equivalent to more than speed improvement of 2:1

The branching factor is irrelevant because
one ply+ searching to depth n may be practically better than searching to depth n+1 because searching to depth n+1 may have more pruning.

The question is what is the effective speed difference between searching to depth n and searching to depth n+1 when you do not prune after root moves(not by null move and not by late move reductions).

practically the depth is not constant but you can try by yourself to test version A against version B when version A is using 2 seconds per move and version B is searching for 1 second after every root move and see who wins(version B should use something like 100 processors to be practically sure that the number of root moves is smaller than the number of processors but the interesting question is if version B can win the match or not).
If it wins then you can increase the time of version A to see
the effective speed up when version A score near 50% against version B.

Uri
That's right. The way to say this is that having to work without shared memory is going to have a big impact on the proper shape of your search tree.

Vas