re-inventing the SMP wheel

bob · Post by **bob** » Sun Aug 19, 2007 8:49 pm

I am not seeing how this avoids the classic race condition, which would allow multiple processors to search the same exact tree. If you introduce critical sections, I believe their overhead will _far_ exceed the overhead of doing a normal split algorithm like DTS or YBW...

When you start a brand new iteration, how do you avoid this, which is the easiest case to avoid it on? How does the open/close updating logic add to overhead and those have to be locked to avoid races, which introduces further overhead and bypasses cache in the process...

diep · Post by **diep** » Sun Aug 19, 2007 9:54 pm

hgm wrote:Before looking into existing solutions for SMP parallellized search, I have been thinking a little myself how I would like to have a bunch of CPUs search a tree. This in order to be better able to pick a solution I like. Properties I value are simplicity and efficiency.

Hello Harm-Geert,

Nice to know you're interested in parallellization of your chess engine. I see that most replies here in this thread deal with implementation issues regarding deadlocks and or race conditions. Of course that's far below the level you like to be busy at, as i'm sure you can fix implementation details (though certain approaches might take a few years of bugfixing).

The most important question IMHO is how much TIME are you prepared to invest into parallellization in chess?

As i see it there is 3 different approaches to parallellization.

Category I : the shared hashtable idea
Central speedup idea is that the hashtable will give you the speedup, whereas search gets its speedup by means of hashtable cutoffs.

The advantage is that a good programmer can get this real quickly to work, and that it gives some speedup. By far most programs getting parallellized use this approach, and there is a good reason, because the step to category II is real big in terms of bugfixing.

Before presenting the basic idea of just sharing hashtable, some ancestors were there such as the 'alibaba' (or whatever that french dude named it) approach which was doing some idiotic locking inside hashtable of nodes that were shared. Real clumsy IMHO.

Programs using this approach are not limited by but at least include:

SOS (Rudolf being the godfather of this approach), Baron, Rybka, Junior, Shredder and many others

An additional reason for some to implement a category 1 parallel approach is simply the sheer complexity of getting a category II to work.

Average speedup out of 2 processors is about 2.0 out of 4. When scaling the method up, it's still pretty ok up to a cpu or 8 (baron has roughly 4.0 out of 8 with this method), above that for most programs (a few crappy ones excepted) speedup is ugly. For some programs speedup is very bad even at a few cpu's. Both Frans Morsch and myself can report a bad speedup when using this method for their program. Both Frans and i have about 1.2 out of 2 when using this and when i tried at 60 cpu's speedup was not very good, about a few out of 60.

Aphid (Jonathan Schaeffer) definitely belongs in category 1.

Category II : centralized lock and splitting according to some method, usually YBW

This is already a mature way to parallellize your program. You need to be a good programmer to get it to work. Basically you can split positions, usually following the YBW concept. The central idea is that all parallel functions (for example: split, abort-fail-high, abort-all, abort-me), first do a lock to a centralized lock, and are not allowed to undertake any action when that lock has been taken.

Many programs with good speedups belong in this category: Deep Sjeng, Ferret, Crafty, Fritz, Glaurung (?) and several other programs, hopefully Buzz soon too.

Speedup is usually about 3 out of 4.

Of course a logical result from the centralized lock approach is that it is hard to scale well with it, as you really need to be a good programmer to keep communication down. Cray Blitz copied about 64KB, original crafty copied up to 44KB of data (thanks to Nalimov and others to improve that to a few kilobyte now). Usually when number of cpu's goes up a lot, the communication latency between cpu's increases a lot. Most NUMA oriented 'supercomputers', soon have a latency of 5-7 us between cpu's. So this approach is not possible at supercomputers. Therefore already decades ago another category started to exist.

Spearheaded by Rainer Feldmann with Zugzwang:

Category III : YBW with decentralized locking

Of course stubbornly i've tried several experiments without using the YBW constraints to split at many cpu's, and all of them had a real UGLY speedup. The key to a good parallellization is having an algorithm that preserves your branching factor a tad. YBW is the only algorithm that does do. Its main concept is real simple: only split *after* you have searched 1 move.

This is real hard to implement at a supercomputer. Only 2 programs in history succeeded doing that without slowing down the program tens of factors: Diep and Zappa.

With a few statistics it is easy to demonstrate why a centralized locking approach doesn't work very well at supercomputers. You split many positions a second a cpu on average, the latency from cpu to cpu is simply too ugly to be able to handle that. Long waiting rows start to exist. Additionally locking at supercomputers is quite slow. After some x-nanoseconds then a lock at cpu X will get modified by the OS automatically to letting your process idle. That's hammering down performance.

You simply CANNOT copy 64KB of data and move it from cpu X to cpu Y. You CANNOT let processes search themselve for the best splitpoint. That's all TOO expensive at NUMA machines.

To quote a sysadmin of SARA (www.sara.nl) : "supercomputers are delicate machines"

In Diep up to a cpu or 8-16 i'm using next split strategy (and more conditions) : if a position is known by hashtable to previously have given a >= beta cutoff, i do not split this position until at least 2 moves have been searched. In other positions i basically split after first move has been searched (nullmove doesn't count).

Actually it is a problem to be very selective in picking your splitpoints at 500 cpu's. I was forced to make a modification that above a cpu 8-16 Diep already splits after the first move has been searched, regardless of which type of node we are in now. Getting the cpu's as quickly as possible into action is simply priority 1.

I also tried at the end of the world champs some experiments splitting already in PV nodes, before the first move had been searched. That was good for scaling, but it was horrible for speedup.

From the last experiment that ran overnight i do not have the logfile, as they had directly cut off access from me to the machine, so i never managed to get that logfile. It would have been interesting to see the outcome of that experiment.

If it would be possible to buy a machine now with 500 cpu's @ 50Mhz or so for a decent price, i would directly buy it, just to get diep better to work at it (don't confuse cpu's with gpu's by the way), knowing that such hardware is a lot slower than a single quadcore.

Vincent

hgm · Post by **hgm** » Sun Aug 19, 2007 9:56 pm

bob wrote:The problem is this:

You probe A and miss and do a cache line fill. Now you search all kinds of nodes, and then return to store position A's results. What's the probability it is still in L1/L2?? Pretty thin I would suspect...

On the contrary. A 2MB L2 cache contains 32,768 cache lines, so the chances that an entry gets flushe out of L2 becomes only apreciable after 32K probes, i.e. in a search with 32K nodes. Now only 1/32K = 0.003% of all searches (= hash probes) can have that size.

So the probability that the hash entry is still around from the original proobing by the time the hash store occurs is 99.997%. I would hesitate to call that 'pretty thin'.

I must be more careful with the last threat to leave a node having to wait for saturated daughters t complete. Threads might very well start waiting for each other (with resulting deadlock), if they hit upon each others branches below the depth where concurrency is allowed. So a thread that really has nothing useful to do, should try to do something useless (helping in a branch that does't really benifit from help) rather than wait.

Beta cutoffs indeed seem a problem, but I wonder how real this is. At the level where most splits will occur, several move in an all-node are likely being searched by a single thread. In such a situation, a move giving an unexpected beta cutoff is likely to take much longer than the others, so the beta cutoff will become manifest only after the search of the other moves have completed, and their threads have already left the node for lack of work. There wouldn't be any threads to recall.

Dan Andersson · Post by **Dan Andersson** » Sun Aug 19, 2007 11:32 pm

Nagging is a pretty neat and conceptually simple algorithm that works with alpha-beta. It consists of a master search that is a serial implementation of the search. It selects nag points forward in its current search window and communicates the point and current search. The nag search parameters can then be further constrained by chance or heuristics and the search order may also be permuted to give a further chance for an early cutoff. Recursive nagging is possible.

http://vinci.cs.uiowa.edu/papers/
http://vinci.cs.uiowa.edu/papers/nag.pdf
http://citeseer.ist.psu.edu/90244.html

MvH Dan Andersson

bob · Post by **bob** » Mon Aug 20, 2007 2:11 am

hgm wrote:
bob wrote:The problem is this:

You probe A and miss and do a cache line fill. Now you search all kinds of nodes, and then return to store position A's results. What's the probability it is still in L1/L2?? Pretty thin I would suspect...
On the contrary. A 2MB L2 cache contains 32,768 cache lines, so the chances that an entry gets flushe out of L2 becomes only apreciable after 32K probes, i.e. in a search with 32K nodes. Now only 1/32K = 0.003% of all searches (= hash probes) can have that size.

I assume your program only probes the hash table, but never generates moves, does repetition checks, evaluation, etc? All of those place an even _bigger_ demand on L1/L2 than the infrequent hash probes, which was my point. What is the chances of doing a probe, then that cache line surviving for a significant amount of time with all the other crap going on between the two probes? I'd suspect that it is far worse than you are predicting.

So the probability that the hash entry is still around from the original proobing by the time the hash store occurs is 99.997%. I would hesitate to call that 'pretty thin'.

See above. I think it is far worse than that overall.

I must be more careful with the last threat to leave a node having to wait for saturated daughters t complete. Threads might very well start waiting for each other (with resulting deadlock), if they hit upon each others branches below the depth where concurrency is allowed. So a thread that really has nothing useful to do, should try to do something useless (helping in a branch that does't really benifit from help) rather than wait.

Beta cutoffs indeed seem a problem, but I wonder how real this is. At the level where most splits will occur, several move in an all-node are likely being searched by a single thread. In such a situation, a move giving an unexpected beta cutoff is likely to take much longer than the others, so the beta cutoff will become manifest only after the search of the other moves have completed, and their threads have already left the node for lack of work. There wouldn't be any threads to recall.

All "waiting" has to be eliminated, obviously. Otherwise you get performance like the original PVS parallel search, which was OK for 2 or even 4 processors, but not good beyond that. We already know that 4 cores on a single chip is not the end of the pathway...

hgm · Post by **hgm** » Mon Aug 20, 2007 8:28 am

bob wrote:I assume your program only probes the hash table, but never generates moves, does repetition checks, evaluation, etc?

It doesn't use much memory for that. About 6KB of global data, with, the board, piece list and a few lokup tables (e.g. Zobrist random table), ~256 bytes of local variables, and 4 bytes per move in the move list. Say 0.5 KB per ply. Subtrees of less than 32K nodes are ~ 8 ply deep, s that is another 4KB, 10 KB in total.

So space for other stuff is completely negligible to the L2 size. It certainly fits within a single L2 cache way. Code of course also makes a demand on L2, and this is more important. But still only ~50 KB.

Even hash probes that collide with these 'heavily used' L2 regions, have a goos chance to survive: the LRU replacement works such that the heavily used stuff permanently occupies one cache way, and the hash probes are diverted to the other ways (L2 is usually 4-way, so there are 3 other ways.) This means that the survival time in L2 of these 5% of probes that are so unlucky to map into the 'heavily' used cache area, is only shortened by 25% compared to the remaining 95% of probes. Totally negligible overall.

Vincent:
Our posts crossed, and I see yours only now.
The doctor doesn't want me to type, which gives me lots of time to think. (Basically I can do little else then reading and thinking,

and SMP is a nice subject to ponder about )

bob · Post by **bob** » Mon Aug 20, 2007 7:41 pm

hgm wrote:
bob wrote:I assume your program only probes the hash table, but never generates moves, does repetition checks, evaluation, etc?
It doesn't use much memory for that. About 6KB of global data, with, the board, piece list and a few lokup tables (e.g. Zobrist random table), ~256 bytes of local variables, and 4 bytes per move in the move list. Say 0.5 KB per ply. Subtrees of less than 32K nodes are ~ 8 ply deep, s that is another 4KB, 10 KB in total.

So space for other stuff is completely negligible to the L2 size. It certainly fits within a single L2 cache way. Code of course also makes a demand on L2, and this is more important. But still only ~50 KB.

Even hash probes that collide with these 'heavily used' L2 regions, have a goos chance to survive: the LRU replacement works such that the heavily used stuff permanently occupies one cache way, and the hash probes are diverted to the other ways (L2 is usually 4-way, so there are 3 other ways.) This means that the survival time in L2 of these 5% of probes that are so unlucky to map into the 'heavily' used cache area, is only shortened by 25% compared to the remaining 95% of probes. Totally negligible overall.

Vincent:
Our posts crossed, and I see yours only now.
The doctor doesn't want me to type, which gives me lots of time to think. (Basically I can do little else then reading and thinking, and SMP is a nice subject to ponder about )

For mine, with the magic tables, a _lot_ of code, and a lot of data, I still see significant improvement going from 2M to 4M L2 cache, meaning that I am overwriting way too much data. Also when you start to use threads, you replicate some local data for each thread, further increasing the demands on the L1/L2 system.

hgm · Post by **hgm** » Mon Aug 20, 2007 8:16 pm

Well, big magic tables is a total waste of memory. But even those usually won't waste more than one cache way, leaving 3 cache ways for the hash probes. In my estimate I was already using 2MB of L2 cache, while the usual size nowadays is 4MB. This is shared by 2 cores, but the cores would also share the magic tables. Even if if the shared tables/code plus private lists would be 3MB, and only one cache way of 1MB would be available for storing hash probes, there would be 16K cache lines in there.

So the probability to stay in L2 from initial probe to hash store would be 99.994% in stead of 99.997%. Big deal...

A much more interesting question would be what is the best place for split points. Should they be concentrated at remaining depth =4, or 8, or ~16 ply?

bob · Post by **bob** » Mon Aug 20, 2007 11:11 pm

I don't like the assumption however. You assume that with 4-way, 4mb cache, each way is 1mb. I agree there. But then you assume that your program is laid out in memory in such a way that the "important stuff" doesn't have any aliasing issues (here I am talking about memory to cache set addressing conflicts). It is very likely that on a 4mb L2, you can only use a part of that in the normal case anyway, because things are not going to lay out well enough to get the important stuff to map uniformly into the cache set address space. There will be lots of unused lines in some sets due to this aliasing, and there will be lots of sets with too many lines mapping into them causing thrashing.

hgm · Post by **hgm** » Mon Aug 20, 2007 11:50 pm

This is very unlikely. For one, compilers and linkers allocate variables as a contiguous set. they have no reason at all to leave large, unused holes between them. So if your total of global variables / tables does not exceed 1MB, they will all nicely map in one L2 cache way without possibilities for collisions. The system stack grows as a contiguous address range.

Secondly, when there might be doubt, I make sure that there will not be any collisions, by laying out the memory map of variables and tables by hand (declaring them as part of a single struct). This is so far only critical for L1, as the total memory usage of the rest of my engine (code + data + stacks) is only a tiny, tiny fraction of a single L2 cache way.

Even if yuo want to accept that you have several frequently used areas wandering through memory because the are stacks, there will not be enough to give more than e 3-fold collision. (System stack, move stack, globals, code.) In L1 this can still be fatal on some machines with a 2-way L1 (P-IV, AMD), despite the fat that code does not go in their. On the scale of L2 the move stack is just part of the contiguous block of globals, as it is declared as a global array. So I have only 3 contiguous blocks, each small compared to a cache way, which can never be a problem in a 4-way cache.

If I would make many threads that share an L2, obviously I would have to keep their stacks apart after mapping into L2. But so far L2 is never shared betwen more than 2 cores, so this is not such a problem. I plan to go to an iterative implementation of the search, which maintains its own stack of locals as structs that are put on the move stack, to merge them into one contiguous stack and guarantee they will never collide.

Having programs that cause cache thrashing wrecks efficiency. Never do it!

re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Nagging search.

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel