Dirt wrote:Maybe you were thinking about this data that James posted earlier in the thread:bob wrote:...But on one of your tests you sent, I though I saw a drop in nodes at say cpus=2, compared to 1, then a jump back up later, but I might well be remembering wrong. But you actually can search fewer nodes with a parallel search at times, that is what produces the super-linear speedups we had discussed elsewhere.Code: Select all
CPUS Time Speedup Nodes NPS 1 1.00 1.00 1.00 1.00 2 0.49 2.03 0.98 1.94 3 0.37 2.68 1.06 2.77 4 0.32 3.16 1.14 3.58 5 0.28 3.54 1.22 4.30 6 0.27 3.76 1.31 4.95 7 0.26 3.90 1.39 5.50 8 0.25 3.95 1.49 6.00
Cluster Rybka
Moderator: Ras
Re: more data
Ah, that would make sense. That was splitting at the root only using completely random move ordering...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: more data
Could be. That was also in the email he sent to me. That 0.98 nodes caught my eye for 2 cpus although that is hardly unheard of. I am not sure if that is a single position or a set of positions. If a set, it would be problematic, if a single position, it is not that unusual...Dirt wrote:Maybe you were thinking about this data that James posted earlier in the thread:bob wrote:...But on one of your tests you sent, I though I saw a drop in nodes at say cpus=2, compared to 1, then a jump back up later, but I might well be remembering wrong. But you actually can search fewer nodes with a parallel search at times, that is what produces the super-linear speedups we had discussed elsewhere.Code: Select all
CPUS Time Speedup Nodes NPS 1 1.00 1.00 1.00 1.00 2 0.49 2.03 0.98 1.94 3 0.37 2.68 1.06 2.77 4 0.32 3.16 1.14 3.58 5 0.28 3.54 1.22 4.30 6 0.27 3.76 1.31 4.95 7 0.26 3.90 1.39 5.50 8 0.25 3.95 1.49 6.00
-
- Posts: 1922
- Joined: Thu Mar 09, 2006 12:51 am
- Location: Earth
Re: more data
I have also started to fix NPS scaling. Well, I need to fix node overhead as well, but getting NPS is simpler to measure and diagnose. However, it seems hard to fix NPS. I've looked at cache coherency, and it doesn't seem to be too much of a problem. There is a good amount of copying, but I don't think it should be more than Crafty. I might try to reduce lock contention by increasing granularity (the same lock is used for attaching to a split point and getting a move from it, when those might be able to be separated). The weird thing is though that these shouldn't account for much, I would think this might total .2 out of 4 or so...jswaff wrote:I'm glad to read this... I was wondering how you arrived at that conclusion myself.
My concern was (is) that you've stated roughly 30% increase in node count with each additional processor, with nearly perfect NPS scaling. I'm seeing a much lower overhead, and much worse NPS scaling. The NPS scaling doesn't concern me that much (not that it doesn't, but I can "explain" it). It's the fact that my search overhead seems to be so much lower than yours... it makes me suspect "bug" as well.
That said, on every test I've run Prophet has produced the same answers with eight processors as it has with one, just faster. Also, I've gone as far as splitting min-max style (no cutoffs) and verifying the node counts are correct. So, that makes a bug seem unlikely.
What I took away from our chat during the tournament is that I should focus on improving NPS scaling and see what happens from there.
Edit: Just wanted to point out, I don't ever see a decrease in nodes searched, just a "lower increase." And just to be clear: my intent wasn't to say "Bob, look, I have a lower increase than you!" ... it was more like "Bob, something must be wrong here... any ideas?"
--
James
Here's my current numbers:
Code: Select all
CPUS Time Speedup Nodes NPS
1 1.00 1.00 1.00 1.00
2 0.67 1.49 1.30 1.94
3 0.50 1.98 1.36 2.69
4 0.48 2.09 1.60 3.36
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: more data
I have a lock in each split block to limit access. That lock is used when I try to select a move at that ply and want to make sure that the "next move critical section" is serialized. But with a lock per split block, different splits do not interact via a lock contention issue. NPS scaling is an issue. Cache is the most likely candidate whenever this is a problem, but it is not always easy to discover the specific problem. Anything that is shared needs examination. If it is modified at all, it really needs examination. Just having two vales adjacent is enough to cause a problem, even if each cpu only modifies one of the adjacent values... Cache coherency is by the line/block, not by the byte or word.Zach Wegner wrote:I have also started to fix NPS scaling. Well, I need to fix node overhead as well, but getting NPS is simpler to measure and diagnose. However, it seems hard to fix NPS. I've looked at cache coherency, and it doesn't seem to be too much of a problem. There is a good amount of copying, but I don't think it should be more than Crafty. I might try to reduce lock contention by increasing granularity (the same lock is used for attaching to a split point and getting a move from it, when those might be able to be separated). The weird thing is though that these shouldn't account for much, I would think this might total .2 out of 4 or so...jswaff wrote:I'm glad to read this... I was wondering how you arrived at that conclusion myself.
My concern was (is) that you've stated roughly 30% increase in node count with each additional processor, with nearly perfect NPS scaling. I'm seeing a much lower overhead, and much worse NPS scaling. The NPS scaling doesn't concern me that much (not that it doesn't, but I can "explain" it). It's the fact that my search overhead seems to be so much lower than yours... it makes me suspect "bug" as well.
That said, on every test I've run Prophet has produced the same answers with eight processors as it has with one, just faster. Also, I've gone as far as splitting min-max style (no cutoffs) and verifying the node counts are correct. So, that makes a bug seem unlikely.
What I took away from our chat during the tournament is that I should focus on improving NPS scaling and see what happens from there.
Edit: Just wanted to point out, I don't ever see a decrease in nodes searched, just a "lower increase." And just to be clear: my intent wasn't to say "Bob, look, I have a lower increase than you!" ... it was more like "Bob, something must be wrong here... any ideas?"
--
James
Here's my current numbers:
BTW, are the numbers there just for one position? I thought I remembered you saying your 4x speedup was 2.2x or so.Code: Select all
CPUS Time Speedup Nodes NPS 1 1.00 1.00 1.00 1.00 2 0.67 1.49 1.30 1.94 3 0.50 1.98 1.36 2.69 4 0.48 2.09 1.60 3.36
Re: more data
I've posted three different sets of numbers at various points in this thread, and all were for sets of positions. The first shows a max speedup in Prophet of 2.27 splitting only at the root. Bob pointed out even that was too high and indicated move ordering problems. I remarked that those numbers were derived with some parts of Prophet's search disabled (null move for example), and posted some numbers using _completely_ random move ordering. Later I posted some numbers with "everything on", which showed a max speedup of 1.24 (or something like that). But remember, we were talking about splitting only at the root....
I've implemented both YBW and DTS in Prophet. The DTS scales a bit better but still just a little buggy, so I ran with YBW in the ACCA tournament. With YBW I see speedups in the 2.2-2.3 range using four CPUS. With DTS it's just a little higher.
I've implemented both YBW and DTS in Prophet. The DTS scales a bit better but still just a little buggy, so I ran with YBW in the ACCA tournament. With YBW I see speedups in the 2.2-2.3 range using four CPUS. With DTS it's just a little higher.
Zach Wegner wrote:BTW, are the numbers there just for one position? I thought I remembered you saying your 4x speedup was 2.2x or so.
-
- Posts: 4558
- Joined: Tue Jul 03, 2007 4:30 am
Re: Cluster Rybka
Vasik Rajlich's on Rybka Forum:
"These things are really complicated. The cycles are:
1) Think
2) Implement
3) Do other things
4) Think again
5) Realize that it can be done much better
Slowly, everything cycles forward.
It's true that centaur analysis during freestyle games has been an excellent source of ideas. I typically have several machines to coordinate and the machines have limited ability to share state. It's basically the same problem.
Anyway, the 100 Elo figure going from one node to five nodes is real. I'm willing to make a wager on this.
Vas"
"These things are really complicated. The cycles are:
1) Think
2) Implement
3) Do other things
4) Think again
5) Realize that it can be done much better
Slowly, everything cycles forward.
It's true that centaur analysis during freestyle games has been an excellent source of ideas. I typically have several machines to coordinate and the machines have limited ability to share state. It's basically the same problem.
Anyway, the 100 Elo figure going from one node to five nodes is real. I'm willing to make a wager on this.
Vas"
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Cluster Rybka
That's always the way to wriggle out of a discussion. You _can_ do as I usually do and produce real data and post it here. Or you can say "I'll bet you what I claimed is true..."Ovyron wrote:Vasik Rajlich's on Rybka Forum:
"These things are really complicated. The cycles are:
1) Think
2) Implement
3) Do other things
4) Think again
5) Realize that it can be done much better
Slowly, everything cycles forward.
It's true that centaur analysis during freestyle games has been an excellent source of ideas. I typically have several machines to coordinate and the machines have limited ability to share state. It's basically the same problem.
Anyway, the 100 Elo figure going from one node to five nodes is real. I'm willing to make a wager on this.
Vas"
I'm not interested enough to make a wager and waste further time... I know what splitting at the root can gain. It has been documented for 30 years. I suppose I could run the test on Crafty easily enough since it splits at the root in addition to everywhere else. I could just make it split only at the root. Stay tuned and I'll post some real data since they obviously are not...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Cluster Rybka (real data)
As I said, I prefer to let the data do the talking, rather than making claims that are obviously exaggerations...
I modified Crafty to only split at the root. Was trivial to do since it already splits at the root as well as everywhere else in the tree. I just commented out the "split everywhere else" code, and then commented out the "search the first move at root serially before starting parallel search" (young-brothers wait idea) so that we start splitting at the root and search the first N moves in parallel as "cluster rybka does." This test is actually a little more favorable to parallel search than a cluster approach, because even though root moves are searched in parallel in this test, the hash table is still shared, while it is not on a cluster. So this gives an upper bound on split-at-root only performance, a cluster will actually be worse...
I am copying one line from 12 log files. This represents three positions (I chose kopec 20, 21 and 22 as I already had a quick test set up for those). log.001 thru log.004 represent runs on kopec20, depth=16, with 1, 2, 4 and 8 cpus. longs 5-8 are the same for kopec21 and logs 9-12 are for kopec22.
Analysis. Position K22 does horribly. Every split at root is slower. Since we search N moves in parallel before we have a good alpha bound, the tree blows up badly. over 4x bigger. NFG on #1
2 actually shows a minimal speedup with 8 cpus, somewhat close to the 1.5X upper bound I had mentioned previously. Note NPS doesn't scale very well either as when a processor runs out of work at the root, it sits, where in Crafty it rips back into the search to help out.
#3 is another bad one.
Do you_really_ believe that is going to give you a +100 Elo rating boost.
If so I have some ocean-front property for sale in Kansas (USA) that you might be interested in as well.
For comparison, here is a real parallel search that splits everywhere, same positions, same number of processors...
I used these because I test with this particular group frequently. #1 gives Crafty fits with a parallel search, and #3 is often very bad because the evaluation tuning can make it change from Bxe4 to another move and back every other iteration...
You decide. Crafty produced better than a 4x speedup overall above. Which is probably worth 100 Elo. In the split at root, it was often _worse_ and not better.
Believe what you want. But facts speak louder than hyperbole, IMHO..
I modified Crafty to only split at the root. Was trivial to do since it already splits at the root as well as everywhere else in the tree. I just commented out the "split everywhere else" code, and then commented out the "search the first move at root serially before starting parallel search" (young-brothers wait idea) so that we start splitting at the root and search the first N moves in parallel as "cluster rybka does." This test is actually a little more favorable to parallel search than a cluster approach, because even though root moves are searched in parallel in this test, the hash table is still shared, while it is not on a cluster. So this gives an upper bound on split-at-root only performance, a cluster will actually be worse...
I am copying one line from 12 log files. This represents three positions (I chose kopec 20, 21 and 22 as I already had a quick test set up for those). log.001 thru log.004 represent runs on kopec20, depth=16, with 1, 2, 4 and 8 cpus. longs 5-8 are the same for kopec21 and logs 9-12 are for kopec22.
Code: Select all
log.001: time=26.36 mat=0 n=62520166 fh=94% nps=2.4M
log.002: time=1:30 mat=0 n=270966345 fh=94% nps=3.0M
log.003: time=55.44 mat=0 n=248918593 fh=94% nps=4.5M
log.004: time=1:38 mat=0 n=336495120 fh=94% nps=3.4M
log.005: time=15.42 mat=0 n=41452680 fh=95% nps=2.7M
log.006: time=16.75 mat=0 n=64363895 fh=95% nps=3.8M
log.007: time=33.40 mat=0 n=145168285 fh=96% nps=4.3M
log.008: time=12.43 mat=0 n=75031988 fh=95% nps=6.0M
log.009: time=11.73 mat=0 n=24606252 fh=94% nps=2.1M
log.010: time=36.05 mat=0 n=84805095 fh=94% nps=2.4M
log.011: time=41.37 mat=0 n=162147757 fh=93% nps=3.9M
log.012: time=1:06 mat=0 n=198969789 fh=93% nps=3.0M
2 actually shows a minimal speedup with 8 cpus, somewhat close to the 1.5X upper bound I had mentioned previously. Note NPS doesn't scale very well either as when a processor runs out of work at the root, it sits, where in Crafty it rips back into the search to help out.
#3 is another bad one.
Do you_really_ believe that is going to give you a +100 Elo rating boost.

For comparison, here is a real parallel search that splits everywhere, same positions, same number of processors...
Code: Select all
log.001: time=26.44 mat=0 n=62520166 fh=94% nps=2.4M
log.002: time=24.20 mat=0 n=113450816 fh=94% nps=4.7M
log.003: time=7.99 mat=0 n=78127749 fh=94% nps=9.8M
log.004: time=12.69 mat=0 n=214517406 fh=94% nps=16.9M
log.005: time=15.38 mat=0 n=41452680 fh=95% nps=2.7M
log.006: time=8.68 mat=0 n=45952809 fh=95% nps=5.3M
log.007: time=6.35 mat=0 n=64642998 fh=94% nps=10.2M
log.008: time=4.85 mat=0 n=76170088 fh=94% nps=15.7M
log.009: time=11.73 mat=0 n=24606252 fh=94% nps=2.1M
log.010: time=5.66 mat=0 n=24232074 fh=94% nps=4.3M
log.011: time=3.03 mat=0 n=25672767 fh=94% nps=8.5M
log.012: time=5.27 mat=0 n=71320093 fh=92% nps=13.5M
You decide. Crafty produced better than a 4x speedup overall above. Which is probably worth 100 Elo. In the split at root, it was often _worse_ and not better.
Believe what you want. But facts speak louder than hyperbole, IMHO..
Re: Cluster Rybka (real data)
By showing data about crafty you prove absolutely nothing about the gain Vasik gets by whatever he does (and you dont know what he does).bob wrote: Believe what you want. But facts speak louder than hyperbole, IMHO..
Re: Cluster Rybka (real data)
Could the increase in Elo be due to evaluation rather than speed increase? By somehow making good (very good) use of the additional information from splitting at the root?