Actual speedups from YBWC and ABDADA on 8+ core machines?

TomKerrigan · Post by **TomKerrigan** » Sat Jul 11, 2015 5:11 am

bob wrote: ...
20cpus: 16.3

That seems pretty spectacular. Seeing a 15.3 speedup with 16.3x the nodes.

That means your tree with 20 cores is only 6.5% bigger than with 1 core. I would have thought that to be almost impossible. Roughly 14% of the nodes that I split at end up failing high. Your percentage must be astronomically low.

mar · Post by **mar** » Sat Jul 11, 2015 9:24 am

TomKerrigan wrote:I've seen a number of references to Lazy SMP but I can't find an explanation of it anywhere. Could somebody provide a link please?

If Lazy SMP is inferior to ABDADA though, I'm not really sure why anybody would use it, since ABDADA is crazy easy to implement.

I haven't tried ABDADA myself but you should find references to Lazy SMP easily in this forum (Dan Homan came up with it).
It's basically just shared TT where each other slave starts and depth+1 (or variations on this scheme).
So I think it would be interesting to compare to ABDADA.
I didn't do any speedup tests, only measured elo gain in real games.
It's important to make sure that each time a "slave" finishes root before master, who blocks id loop, it must be aborted and slave's result properly propagated.
This can become a problem especially at high depths (in fact my initial implementation
suffered from this as well and I noticed it while analyzing a position with 4 cores but only two were working

Of course broken lazy smp aka "fancy" shared TT often fail to do that and then we see lots of rumors here

syzygy · Post by **syzygy** » Sat Jul 11, 2015 2:36 pm

bob wrote:(2) threads now exclusively do what Cray Blitz (DTS) called a "late join". Once a thread requests a split, it enters a tight loop looking for split points. As soon as it finds one (or more) it chooses the best and then joins by allocating a split block and repeating the parent's actions for itself.

So an iterative search is not needed after all?

I wonder how much of this old discussion applies. Repeating parent's actions = replaying the moves from the root position to the split node?

bob · Post by **bob** » Sat Jul 11, 2015 3:45 pm

TomKerrigan wrote:
bob wrote: ...
20cpus: 16.3
That seems pretty spectacular. Seeing a 15.3 speedup with 16.3x the nodes.

That means your tree with 20 cores is only 6.5% bigger than with 1 core. I would have thought that to be almost impossible. Roughly 14% of the nodes that I split at end up failing high. Your percentage must be astronomically low.

I have the actual tree size increases, which is a normal measurement. And yes, the overhead is less than expected. But then these are not tactical positions with an oddball solution move, these are positions from a real game so there is not a lot of failing high and failing low going on, which certainly helps compared to just searching random positions where the best move is often ranked last and where normal move ordering is lousy due to the convoluted nature of the positions...

Looking at the spreadsheet, the average search overhead is about 10% (tree contains about 10% more nodes than the 1 cpu tree).

bob · Post by **bob** » Sat Jul 11, 2015 3:51 pm

syzygy wrote:
bob wrote:(2) threads now exclusively do what Cray Blitz (DTS) called a "late join". Once a thread requests a split, it enters a tight loop looking for split points. As soon as it finds one (or more) it chooses the best and then joins by allocating a split block and repeating the parent's actions for itself.
So an iterative search is not needed after all?

I wonder how much of this old discussion applies. Repeating parent's actions = replaying the moves from the root position to the split node?

Not sure what you mean. If you are talking about the DTS-like iterative search, the benefit is that you can find a split point wherever you want. This seems to be a pretty decent alternative, that of pre-splitting at those good split points, which gives about the same effect. But comparing these results to the older DTS results is a bit unfair since the tree has changed so much. I think it much easier to search a depth 25 tree in parallel than it is to search a depth 10 tree. DTS looked better in 1992 or so searching depth=10 or so trees than it did in my dissertation in 1988 when searching the Bratko-Kopec positions to only 5 plies on the Sequent Balance 2100 30cpu box.

"repeating parent's actions" means allocating a split block, and then copying whatever data is needed from the parent split block to the current split block. IE repetition list, killer move list, a very few other things... particularly the current board position for the split point..

syzygy · Post by **syzygy** » Sat Jul 11, 2015 4:16 pm

bob wrote:"repeating parent's actions" means allocating a split block, and then copying whatever data is needed from the parent split block to the current split block. IE repetition list, killer move list, a very few other things... particularly the current board position for the split point..

So the parent had to copy all that data to prepare for the split?

bob · Post by **bob** » Sat Jul 11, 2015 7:04 pm

syzygy wrote:
bob wrote:"repeating parent's actions" means allocating a split block, and then copying whatever data is needed from the parent split block to the current split block. IE repetition list, killer move list, a very few other things... particularly the current board position for the split point..
So the parent had to copy all that data to prepare for the split?

Yes, but "all" is not very big. The board is 112 bytes. The rest of the stuff maybe 512 bytes or so total, I have not counted precisely.

But that is all it does and it is then searching again. Each child that joins has to copy that same data to its own local split block and link to the parent's split block to find moves at the split ply. It really is a low-overhead approach compared to the previous versions of Crafty.

TomKerrigan · Post by **TomKerrigan** » Sat Jul 11, 2015 7:20 pm

bob wrote: I have the actual tree size increases, which is a normal measurement. And yes, the overhead is less than expected. But then these are not tactical positions with an oddball solution move, these are positions from a real game so there is not a lot of failing high and failing low going on, which certainly helps compared to just searching random positions where the best move is often ranked last and where normal move ordering is lousy due to the convoluted nature of the positions...

Looking at the spreadsheet, the average search overhead is about 10% (tree contains about 10% more nodes than the 1 cpu tree).

Ah, I see. I did some quick runs with your positions and am getting significantly better speedups. ~5x on 8 cores instead of ~4x. My trees are often getting much more than 10% bigger though unfortunately.

TomKerrigan · Post by **TomKerrigan** » Sat Jul 11, 2015 7:36 pm

bob wrote: ...
(1) very lightweight splitting code. When a processor goes idle, any active thread will try to split and give it something to do. In fact, this is pretty useful as with 20 cores, when a thread goes idle, it will likely see 6-8 potential split points at a minimum, and it gets to choose which one to accept. All a splitting thread does is grab a new split block, link it to the parent, copy the necessary data, and then continue searching.
...

I find this interesting.

In my code, every time the search function finishes searching a move (that didn't fail high), it checks to see if there are idle threads to help it search the rest of the moves. If so, it creates a split data structure and assigns the idle threads to it. In other words, the idle threads just work on whatever becomes available first. Maybe this isn't the best.

Can you explain a bit about how the idle threads choose which split node to work on?

bob · Post by **bob** » Sat Jul 11, 2015 7:46 pm

TomKerrigan wrote:
bob wrote: I have the actual tree size increases, which is a normal measurement. And yes, the overhead is less than expected. But then these are not tactical positions with an oddball solution move, these are positions from a real game so there is not a lot of failing high and failing low going on, which certainly helps compared to just searching random positions where the best move is often ranked last and where normal move ordering is lousy due to the convoluted nature of the positions...

Looking at the spreadsheet, the average search overhead is about 10% (tree contains about 10% more nodes than the 1 cpu tree).
Ah, I see. I did some quick runs with your positions and am getting significantly better speedups. ~5x on 8 cores instead of ~4x. My trees are often getting much more than 10% bigger though unfortunately.

It would be an interesting debate on choosing positions. I tend to think the games are more interesting, but if one is only interested in problem positions, then unrelated positions might give a better indication.

The 8-core tests just finished. I usually run the same set 4 times, no changes, since SMP is pretty non-deterministic.

The 8-core numbers look like this compared to 1 cpu. To explain the numbers, I compute the speedup for each position and then compute the mean of the speedups, and the geometric mean. I am going to add a raw speedup number which is time taken by 1 cpu for all positions, divided by time taken for N cpus. In any case, here's the results for 8:

mean: 6.9 6.2 7.0 6.9, average = 6.7
geomean: 7.8 7.3 7.4 7.3, average = 7.4

For tree growth, I get 8.2%, 9.3%, 2.4% and the last run was -0.6% Average (for 8) was about 4.7%. About 1/2 of what I saw with 16.

One note, as I mentioned, I clear the hash between positions. I don't clear killers or history counters. I had to clear hash otherwise the depth limits would take forever to set to get a good average time per move. If you change the depth at position N, you affect every position after that one and have to re-adjust those as well. Clearing hash lets me make one run, then set the depths for each, and there is little influence from one to the next...

Actual speedups from YBWC and ABDADA on 8+ core machines?

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines