Cluster Rybka

jswaff · Post by **jswaff** » Fri Nov 14, 2008 2:26 pm

Ah, that would make sense. That was splitting at the root only using completely random move ordering...

Dirt wrote:
bob wrote:...But on one of your tests you sent, I though I saw a drop in nodes at say cpus=2, compared to 1, then a jump back up later, but I might well be remembering wrong. But you actually can search fewer nodes with a parallel search at times, that is what produces the super-linear speedups we had discussed elsewhere.
Maybe you were thinking about this data that James posted earlier in the thread:
Code: Select all
CPUS  Time    Speedup Nodes   NPS
1     1.00    1.00    1.00    1.00
2     0.49    2.03    0.98    1.94
3     0.37    2.68    1.06    2.77
4     0.32    3.16    1.14    3.58
5     0.28    3.54    1.22    4.30
6     0.27    3.76    1.31    4.95
7     0.26    3.90    1.39    5.50
8     0.25    3.95    1.49    6.00

bob · Post by **bob** » Fri Nov 14, 2008 5:02 pm

Dirt wrote:
bob wrote:...But on one of your tests you sent, I though I saw a drop in nodes at say cpus=2, compared to 1, then a jump back up later, but I might well be remembering wrong. But you actually can search fewer nodes with a parallel search at times, that is what produces the super-linear speedups we had discussed elsewhere.
Maybe you were thinking about this data that James posted earlier in the thread:
Code: Select all
CPUS  Time    Speedup Nodes   NPS
1     1.00    1.00    1.00    1.00
2     0.49    2.03    0.98    1.94
3     0.37    2.68    1.06    2.77
4     0.32    3.16    1.14    3.58
5     0.28    3.54    1.22    4.30
6     0.27    3.76    1.31    4.95
7     0.26    3.90    1.39    5.50
8     0.25    3.95    1.49    6.00

Could be. That was also in the email he sent to me. That 0.98 nodes caught my eye for 2 cpus although that is hardly unheard of. I am not sure if that is a single position or a set of positions. If a set, it would be problematic, if a single position, it is not that unusual...

Zach Wegner · Post by **Zach Wegner** » Fri Nov 14, 2008 10:38 pm

jswaff wrote:I'm glad to read this... I was wondering how you arrived at that conclusion myself.

My concern was (is) that you've stated roughly 30% increase in node count with each additional processor, with nearly perfect NPS scaling. I'm seeing a much lower overhead, and much worse NPS scaling. The NPS scaling doesn't concern me that much (not that it doesn't, but I can "explain" it). It's the fact that my search overhead seems to be so much lower than yours... it makes me suspect "bug" as well.

That said, on every test I've run Prophet has produced the same answers with eight processors as it has with one, just faster. Also, I've gone as far as splitting min-max style (no cutoffs) and verifying the node counts are correct. So, that makes a bug seem unlikely.

What I took away from our chat during the tournament is that I should focus on improving NPS scaling and see what happens from there.

Edit: Just wanted to point out, I don't ever see a decrease in nodes searched, just a "lower increase." And just to be clear: my intent wasn't to say "Bob, look, I have a lower increase than you!" ... it was more like "Bob, something must be wrong here... any ideas?"

--
James

I have also started to fix NPS scaling. Well, I need to fix node overhead as well, but getting NPS is simpler to measure and diagnose. However, it seems hard to fix NPS. I've looked at cache coherency, and it doesn't seem to be too much of a problem. There is a good amount of copying, but I don't think it should be more than Crafty. I might try to reduce lock contention by increasing granularity (the same lock is used for attaching to a split point and getting a move from it, when those might be able to be separated). The weird thing is though that these shouldn't account for much, I would think this might total .2 out of 4 or so...

Here's my current numbers:

Code: Select all

CPUS  Time    Speedup Nodes   NPS
1     1.00    1.00    1.00    1.00
2     0.67    1.49    1.30    1.94
3     0.50    1.98    1.36    2.69
4     0.48    2.09    1.60    3.36

BTW, are the numbers there just for one position? I thought I remembered you saying your 4x speedup was 2.2x or so.

bob · Post by **bob** » Sat Nov 15, 2008 12:01 am

Zach Wegner wrote:
jswaff wrote:I'm glad to read this... I was wondering how you arrived at that conclusion myself.

My concern was (is) that you've stated roughly 30% increase in node count with each additional processor, with nearly perfect NPS scaling. I'm seeing a much lower overhead, and much worse NPS scaling. The NPS scaling doesn't concern me that much (not that it doesn't, but I can "explain" it). It's the fact that my search overhead seems to be so much lower than yours... it makes me suspect "bug" as well.

That said, on every test I've run Prophet has produced the same answers with eight processors as it has with one, just faster. Also, I've gone as far as splitting min-max style (no cutoffs) and verifying the node counts are correct. So, that makes a bug seem unlikely.

What I took away from our chat during the tournament is that I should focus on improving NPS scaling and see what happens from there.

Edit: Just wanted to point out, I don't ever see a decrease in nodes searched, just a "lower increase." And just to be clear: my intent wasn't to say "Bob, look, I have a lower increase than you!" ... it was more like "Bob, something must be wrong here... any ideas?"

--
James
I have also started to fix NPS scaling. Well, I need to fix node overhead as well, but getting NPS is simpler to measure and diagnose. However, it seems hard to fix NPS. I've looked at cache coherency, and it doesn't seem to be too much of a problem. There is a good amount of copying, but I don't think it should be more than Crafty. I might try to reduce lock contention by increasing granularity (the same lock is used for attaching to a split point and getting a move from it, when those might be able to be separated). The weird thing is though that these shouldn't account for much, I would think this might total .2 out of 4 or so...

Here's my current numbers:
Code: Select all
CPUS  Time    Speedup Nodes   NPS
1     1.00    1.00    1.00    1.00
2     0.67    1.49    1.30    1.94
3     0.50    1.98    1.36    2.69
4     0.48    2.09    1.60    3.36
BTW, are the numbers there just for one position? I thought I remembered you saying your 4x speedup was 2.2x or so.

I have a lock in each split block to limit access. That lock is used when I try to select a move at that ply and want to make sure that the "next move critical section" is serialized. But with a lock per split block, different splits do not interact via a lock contention issue. NPS scaling is an issue. Cache is the most likely candidate whenever this is a problem, but it is not always easy to discover the specific problem. Anything that is shared needs examination. If it is modified at all, it really needs examination. Just having two vales adjacent is enough to cause a problem, even if each cpu only modifies one of the adjacent values... Cache coherency is by the line/block, not by the byte or word.

jswaff · Post by **jswaff** » Sat Nov 15, 2008 3:32 pm

I've posted three different sets of numbers at various points in this thread, and all were for sets of positions. The first shows a max speedup in Prophet of 2.27 splitting only at the root. Bob pointed out even that was too high and indicated move ordering problems. I remarked that those numbers were derived with some parts of Prophet's search disabled (null move for example), and posted some numbers using _completely_ random move ordering. Later I posted some numbers with "everything on", which showed a max speedup of 1.24 (or something like that). But remember, we were talking about splitting only at the root....

I've implemented both YBW and DTS in Prophet. The DTS scales a bit better but still just a little buggy, so I ran with YBW in the ACCA tournament. With YBW I see speedups in the 2.2-2.3 range using four CPUS. With DTS it's just a little higher.

Zach Wegner wrote:BTW, are the numbers there just for one position? I thought I remembered you saying your 4x speedup was 2.2x or so.

Ovyron · Post by **Ovyron** » Tue Nov 18, 2008 9:04 am

Vasik Rajlich's on Rybka Forum:

"These things are really complicated. The cycles are:

1) Think
2) Implement
3) Do other things
4) Think again
5) Realize that it can be done much better

Slowly, everything cycles forward.

It's true that centaur analysis during freestyle games has been an excellent source of ideas. I typically have several machines to coordinate and the machines have limited ability to share state. It's basically the same problem.

Anyway, the 100 Elo figure going from one node to five nodes is real. I'm willing to make a wager on this.

Vas"

bob · Post by **bob** » Tue Nov 18, 2008 6:45 pm

Ovyron wrote:Vasik Rajlich's on Rybka Forum:

"These things are really complicated. The cycles are:

1) Think
2) Implement
3) Do other things
4) Think again
5) Realize that it can be done much better

Slowly, everything cycles forward.

It's true that centaur analysis during freestyle games has been an excellent source of ideas. I typically have several machines to coordinate and the machines have limited ability to share state. It's basically the same problem.

Anyway, the 100 Elo figure going from one node to five nodes is real. I'm willing to make a wager on this.

Vas"

That's always the way to wriggle out of a discussion. You _can_ do as I usually do and produce real data and post it here. Or you can say "I'll bet you what I claimed is true..."

I'm not interested enough to make a wager and waste further time... I know what splitting at the root can gain. It has been documented for 30 years. I suppose I could run the test on Crafty easily enough since it splits at the root in addition to everywhere else. I could just make it split only at the root. Stay tuned and I'll post some real data since they obviously are not...

bob · Post by **bob** » Tue Nov 18, 2008 8:59 pm

As I said, I prefer to let the data do the talking, rather than making claims that are obviously exaggerations...

I modified Crafty to only split at the root. Was trivial to do since it already splits at the root as well as everywhere else in the tree. I just commented out the "split everywhere else" code, and then commented out the "search the first move at root serially before starting parallel search" (young-brothers wait idea) so that we start splitting at the root and search the first N moves in parallel as "cluster rybka does." This test is actually a little more favorable to parallel search than a cluster approach, because even though root moves are searched in parallel in this test, the hash table is still shared, while it is not on a cluster. So this gives an upper bound on split-at-root only performance, a cluster will actually be worse...

I am copying one line from 12 log files. This represents three positions (I chose kopec 20, 21 and 22 as I already had a quick test set up for those). log.001 thru log.004 represent runs on kopec20, depth=16, with 1, 2, 4 and 8 cpus. longs 5-8 are the same for kopec21 and logs 9-12 are for kopec22.

Code: Select all

log.001:              time=26.36  mat=0  n=62520166  fh=94%  nps=2.4M
log.002:              time=1:30  mat=0  n=270966345  fh=94%  nps=3.0M
log.003:              time=55.44  mat=0  n=248918593  fh=94%  nps=4.5M
log.004:              time=1:38  mat=0  n=336495120  fh=94%  nps=3.4M
log.005:              time=15.42  mat=0  n=41452680  fh=95%  nps=2.7M
log.006:              time=16.75  mat=0  n=64363895  fh=95%  nps=3.8M
log.007:              time=33.40  mat=0  n=145168285  fh=96%  nps=4.3M
log.008:              time=12.43  mat=0  n=75031988  fh=95%  nps=6.0M
log.009:              time=11.73  mat=0  n=24606252  fh=94%  nps=2.1M
log.010:              time=36.05  mat=0  n=84805095  fh=94%  nps=2.4M
log.011:              time=41.37  mat=0  n=162147757  fh=93%  nps=3.9M
log.012:              time=1:06  mat=0  n=198969789  fh=93%  nps=3.0M

Analysis. Position K22 does horribly. Every split at root is slower. Since we search N moves in parallel before we have a good alpha bound, the tree blows up badly. over 4x bigger. NFG on #1

2 actually shows a minimal speedup with 8 cpus, somewhat close to the 1.5X upper bound I had mentioned previously. Note NPS doesn't scale very well either as when a processor runs out of work at the root, it sits, where in Crafty it rips back into the search to help out.

#3 is another bad one.

Do you_really_ believe that is going to give you a +100 Elo rating boost.

If so I have some ocean-front property for sale in Kansas (USA) that you might be interested in as well.

For comparison, here is a real parallel search that splits everywhere, same positions, same number of processors...

Code: Select all

log.001:              time=26.44  mat=0  n=62520166  fh=94%  nps=2.4M
log.002:              time=24.20  mat=0  n=113450816  fh=94%  nps=4.7M
log.003:              time=7.99  mat=0  n=78127749  fh=94%  nps=9.8M
log.004:              time=12.69  mat=0  n=214517406  fh=94%  nps=16.9M
log.005:              time=15.38  mat=0  n=41452680  fh=95%  nps=2.7M
log.006:              time=8.68  mat=0  n=45952809  fh=95%  nps=5.3M
log.007:              time=6.35  mat=0  n=64642998  fh=94%  nps=10.2M
log.008:              time=4.85  mat=0  n=76170088  fh=94%  nps=15.7M
log.009:              time=11.73  mat=0  n=24606252  fh=94%  nps=2.1M
log.010:              time=5.66  mat=0  n=24232074  fh=94%  nps=4.3M
log.011:              time=3.03  mat=0  n=25672767  fh=94%  nps=8.5M
log.012:              time=5.27  mat=0  n=71320093  fh=92%  nps=13.5M

I used these because I test with this particular group frequently. #1 gives Crafty fits with a parallel search, and #3 is often very bad because the evaluation tuning can make it change from Bxe4 to another move and back every other iteration...

You decide. Crafty produced better than a 4x speedup overall above. Which is probably worth 100 Elo. In the split at root, it was often _worse_ and not better.

Believe what you want. But facts speak louder than hyperbole, IMHO..

Eizenhammer · Post by **Eizenhammer** » Wed Nov 19, 2008 1:10 am

bob wrote: Believe what you want. But facts speak louder than hyperbole, IMHO..

By showing data about crafty you prove absolutely nothing about the gain Vasik gets by whatever he does (and you dont know what he does).

Gandalf · Post by **Gandalf** » Wed Nov 19, 2008 1:46 am

Could the increase in Elo be due to evaluation rather than speed increase? By somehow making good (very good) use of the additional information from splitting at the root?

Cluster Rybka

Re: more data

Re: more data

Re: more data

Re: more data

Re: more data

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka (real data)

Re: Cluster Rybka (real data)

Re: Cluster Rybka (real data)