If You have cluster/PC with 100 CPUs...

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

FlavusSnow
Posts: 89
Joined: Thu Apr 01, 2010 5:28 am
Location: Omaha, NE

Re: If You have cluster/PC with 100 CPUs...

Post by FlavusSnow »

Do the math. You are at depth D. Searching each move at the same time. It will take as long to search them all as it will to search just one. So you use 38x the horsepower to search to depth D as you would if you used all processors. And since that is a bit over 2^5 (38 compared to 32) we could go 5 plies _deeper_ while you are searching to the nominal depth D. (since we are 5 doublings faster using all 32 processors on one tree we go 5 plies deeper with an effective branching factor of 2.0) Which sounds better?
If we're only searching the top two moves at the root node, we're not using 38x the horsepower, we would be using roughly 2x the horsepower.

And really what I've been suggesting all along is that at some depth, we're spending too much time searching nodes that will never be played. We might be able to ignore many of these nodes by advancing the search up one ply along the PV. The reason I'd search the top two moves at the root node is so that new findings could still affect the best move in the current position. Simply advancing along a sole PV would be pointless because it would never change the decision made at the root node.

So the million dollar question, when does spending 2x the horsepower near the root node pay off for depth along the PV? I'd feel pretty confident that if you did a fixed depth game at say 30 ply depth, you could reasonably get that depth using the method I just devised. I think it would be time prohibitive to allow an engine to search to 30 ply on each move. Granted, a true 30-ply search would be more accurate (stronger), but it would take much longer. And a 30 ply search using my method may be stronger than a 28 ply (or however deep it would get in the same time) search using traditional algorithms.
User avatar
hgm
Posts: 27820
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: If You have cluster/PC with 100 CPUs...

Post by hgm »

So why don't you try it?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: If You have cluster/PC with 100 CPUs...

Post by bob »

FlavusSnow wrote:
Do the math. You are at depth D. Searching each move at the same time. It will take as long to search them all as it will to search just one. So you use 38x the horsepower to search to depth D as you would if you used all processors. And since that is a bit over 2^5 (38 compared to 32) we could go 5 plies _deeper_ while you are searching to the nominal depth D. (since we are 5 doublings faster using all 32 processors on one tree we go 5 plies deeper with an effective branching factor of 2.0) Which sounds better?
If we're only searching the top two moves at the root node, we're not using 38x the horsepower, we would be using roughly 2x the horsepower.
How, exactly, do you find the top 2 moves without searching them? And the top two moves will likely be different depending on what depth you use to obtain them.


And really what I've been suggesting all along is that at some depth, we're spending too much time searching nodes that will never be played. We might be able to ignore many of these nodes by advancing the search up one ply along the PV. The reason I'd search the top two moves at the root node is so that new findings could still affect the best move in the current position. Simply advancing along a sole PV would be pointless because it would never change the decision made at the root node.

So the million dollar question, when does spending 2x the horsepower near the root node pay off for depth along the PV? I'd feel pretty confident that if you did a fixed depth game at say 30 ply depth, you could reasonably get that depth using the method I just devised. I think it would be time prohibitive to allow an engine to search to 30 ply on each move.

Sounds good in theory, until you analyze it more carefully. How do you know which two moves to search to depth 30, without searching _all_ moves to depth 29? This is subject to _major_ tactical oversights. And again, each additional ply typically only adds 2x the time needed for the previous ply. Searching two different moves on two different nodes will also take 2x longer to go one ply deeper..


Granted, a true 30-ply search would be more accurate (stronger), but it would take much longer. And a 30 ply search using my method may be stronger than a 28 ply (or however deep it would get in the same time) search using traditional algorithms.
You keep saying "much longer" where IMHO 2x is not "much longer"... And with the problem of identifying the 2 best moves, I simply don't see how it will work.
FlavusSnow
Posts: 89
Joined: Thu Apr 01, 2010 5:28 am
Location: Omaha, NE

Re: If You have cluster/PC with 100 CPUs...

Post by FlavusSnow »

I want to in the worst of ways. I dabble a bit, but I am engineer with a full plate of work in my own field. The best I could do is probably chop up some MSCP (or similar) code to do a proof of concept, but I don't think this would do it any justice because its not the same type of search as the premier engines. And otherwise, I could manually performing the search through a few games, but a few games are kind of worthless, statistically (not to mention that critics would question my manual methods...as they should).

So for now I think I'm relegated to the peanut gallery.
FlavusSnow
Posts: 89
Joined: Thu Apr 01, 2010 5:28 am
Location: Omaha, NE

Re: If You have cluster/PC with 100 CPUs...

Post by FlavusSnow »

well, you'd take the top two moves from some shallower depth and search them deeper (but not a full-width, only deeper along their respective PVs). I agree, it would result in oversights, but I think thats normal for any form of pruning. And maybe that would be the downfall...I don't know how it would pan out if implemented.

As far as the 'much longer'. It comes from watching an engine play... I've seen a handful (probably several thousand complete) games played by Stockfish dating back to version 1.6 when it caught my eye. Running on a 64-bit 4-core AMD machine at 3.4 Ghz, it rattles through the first ply very quickly up to some number (depends heavily on the position). Sometimes it starts slowing down at 18 ply, sometimes is 26 or 27 ply. But inevitably it hits something like a brick wall and spends roughly half of its time thinking on the final ply. In some positions I'd say it sits for 2/3rds of its time on the same ply. And after all that, it plays a move, the opposing engine replies the same move Stockfish is pondering, and then the Stockfish is thinking again and the score changes up or down because now its seeing further out. So why doesn't the engine just do that before it plays the move...what happens to the score if I play the move and my opponent responds how I think they will respond? How does that compare to what I thought was the 2nd best move? Its essentially 6 searches (3 for each of the top 2 moves). The question is, can it perform the 6 reduced depth searches faster than searching 2 more ply full width?

And keep in mind, this is in the context of a large core count machine. I don't think its a good idea at all to try on a single threaded machine.

Maybe this weekend I'll pull Singularity(C) off of FICS and use the hardware to play a couple of games with Crafty vs. Stockfish. Document some times per move and depths and results of the games.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: If You have cluster/PC with 100 CPUs...

Post by bob »

FlavusSnow wrote:well, you'd take the top two moves from some shallower depth and search them deeper (but not a full-width, only deeper along their respective PVs). I agree, it would result in oversights, but I think thats normal for any form of pruning. And maybe that would be the downfall...I don't know how it would pan out if implemented.

As far as the 'much longer'. It comes from watching an engine play... I've seen a handful (probably several thousand complete) games played by Stockfish dating back to version 1.6 when it caught my eye. Running on a 64-bit 4-core AMD machine at 3.4 Ghz, it rattles through the first ply very quickly up to some number (depends heavily on the position). Sometimes it starts slowing down at 18 ply, sometimes is 26 or 27 ply. But inevitably it hits something like a brick wall and spends roughly half of its time thinking on the final ply. In some positions I'd say it sits for 2/3rds of its time on the same ply. And after all that, it plays a move, the opposing engine replies the same move Stockfish is pondering, and then the Stockfish is thinking again and the score changes up or down because now its seeing further out. So why doesn't the engine just do that before it plays the move...what happens to the score if I play the move and my opponent responds how I think they will respond? How does that compare to what I thought was the 2nd best move? Its essentially 6 searches (3 for each of the top 2 moves). The question is, can it perform the 6 reduced depth searches faster than searching 2 more ply full width?

And keep in mind, this is in the context of a large core count machine. I don't think its a good idea at all to try on a single threaded machine.

Maybe this weekend I'll pull Singularity(C) off of FICS and use the hardware to play a couple of games with Crafty vs. Stockfish. Document some times per move and depths and results of the games.
Just for clarity, with an effective branching factor of around 2, the _last_ ply will, by definition, take about 1/2 the total time. :) It will take 2x longer than what has been used so far.
FlavusSnow
Posts: 89
Joined: Thu Apr 01, 2010 5:28 am
Location: Omaha, NE

Re: If You have cluster/PC with 100 CPUs...

Post by FlavusSnow »

But on a 64 core machine, I could divide the cores up a little more to gain efficiency. 32 cores could be searching the root node, while 16 cores could be scaling each of the top two PVs.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: If You have cluster/PC with 100 CPUs...

Post by bob »

FlavusSnow wrote:But on a 64 core machine, I could divide the cores up a little more to gain efficiency. 32 cores could be searching the root node, while 16 cores could be scaling each of the top two PVs.
Trust me, this has been analyzed to death. The best approach is to use all processors together, rather than dividing things up at the root. My Ph.D. dissertation has the mathematical proof, if you want this from a math perspective. The minute you search two moves independently, you violate the basis of alpha/beta and efficiency goes in the toilet.
FlavusSnow
Posts: 89
Joined: Thu Apr 01, 2010 5:28 am
Location: Omaha, NE

Re: If You have cluster/PC with 100 CPUs...

Post by FlavusSnow »

Do you think that a less efficient method could ever yield a higher ELO performance?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: If You have cluster/PC with 100 CPUs...

Post by bob »

FlavusSnow wrote:Do you think that a less efficient method could ever yield a higher ELO performance?
No, That is a contradiction in terms, in fact...