On parallelization

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Onno Garms
Posts: 224
Joined: Mon Mar 12, 2007 7:31 pm
Location: Bonn, Germany

On parallelization

Post by Onno Garms »

I wonder about the multiprocessor implementation in some open source
engines.

Worst is Toga with a shared transposition table approach. Just run
many searches in parallel, communicate only over the transposition
table. The author knew that this was a quick hack.

I have studied Glaurung's SMP implementation. It is much better, but
still some things seem strange to me. Has SMP implementation been
changed in Stockfish?

What seemed strange to me is that an idle thread gets new work close to the
leaves. Is there some reasoning behind that or is that mostly due to
the implementation (more convenient to split where we are right now
instead of scanning the stack for a split position)?

My multiprocessor (YBW) implementation follows Feldmann's thesis of
1993, see
http://wwwcs.uni-paderborn.de/fachberei ... nn_phd.pdf
That means, even though I work on one computer only, I
implemented virtual messaging. Threads send messages requesting work,
send results etc. This way, we can chose where to split, narrow
windows etc.

It took some time to fully implement this. Feldmann did not say much
how to adapt to scout searches, search window changes such as bad
pruning etc. Also I found it necessary to send receipts for many
messages in order to sync properly. And you have to avoid message
floods. But as you know it works and scales well now.

I have some ideas how to improve selection of split points. Splitting
as close to the root as possible is not everything. But these ideas
are not yet well formulated, let alone implemented or tested.

Suppose we have three threads. Thread 1 is master, thread 2 is slave
of thread 1. When thread 3 is idle, it is most likely better to ask
thread 1 for work then to ask thread 2 for work. In general, every
thread should have a counter how many split points of the global
search tree there are before the current node. Work requests should go
the the thread(s) that has (have) the fewest.

Also, if there is more then one thread working at the same split point
(mostly the case, master and one slave, but there might also be master
and more slaves) and there are no more moves available at this split
point, split requests should go to the thread that searches the
earliest move. (This is not necessarily the master!) We have spent
lots of effort on move ordering, to parallel search should also follow
this move ordering.
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: On parallelization

Post by zamar »

Onno Garms wrote: I have studied Glaurung's SMP implementation. It is much better, but
still some things seem strange to me. Has SMP implementation been
changed in Stockfish?

What seemed strange to me is that an idle thread gets new work close to the
leaves. Is there some reasoning behind that or is that mostly due to
the implementation (more convenient to split where we are right now
instead of scanning the stack for a split position)?
We tried to change this and implemented joining to old split point, but this didn't give any difference in strength. But we only have quads for testing, so your approach _might_ be better when one has at least octal.
Joona Kiiski
User avatar
Onno Garms
Posts: 224
Joined: Mon Mar 12, 2007 7:31 pm
Location: Bonn, Germany

Re: On parallelization

Post by Onno Garms »

What I observed with my approach is that the speedup (measured in time to fixed depth) varies considerably when a search is run multiple times on the same position. This might be due to lucky or not-so-lucky selection of split points. With the reasonning in the original post it might be possible to improve performance with smart selection of split points to get good speedup most of the time.

Does the speedup vary that much with your approach too? One might think that as you split close to the leaves, you split and join much more often, and my the law of large numbers you get more stable results than I do.

I tried to figure out by code inspection where your approach splits. (I did not watch it in the debugger. Maybe I should have done that.) I thought if you have just two CPUs, the splitpoint would walk in the direction to the root quickly, so most of the time the threads would work in fairly different branches of the search tree. I thought that this happens because in the moment where the master goes back to the parent, the other thread has no work, so it waits until the master splits at the parent. Am I correct so far?

However, if you have more threads, things might become different. Suppose threads A and B search fairly different branches. C becomes a slave of A close to the leaves. But when C is finished, C might become a slave of B (also close to the leaves) before A gives work at the parent to C. Is that correct?

Are improvements where to split also applicable to your approach?