re-inventing the SMP wheel

bob · Post by **bob** » Fri Aug 17, 2007 6:58 pm

hgm wrote:Well, with 100 threads each searching upto 40 ply deep, there would only be 4000 open nodes max. And as I would prefer split points close to the leaves, most of the lines would largely coincide, so 400 is a more likely number (not to mention that my number of threads is more likely to be 8 than 100...) My main hash is significantly larger then 400 entries (about 150,000 times?). This 400-entry table easily fits in L2. Sharing L2 data is not so expensive. Of course I would not put any L2-miss processing in a critical section, or atempt busy waiting for a critical section that could take long.

How do you figure? Would not _every_ node need to be "open" because what would prevent another thread from reaching that same position via a different path, and then search the thing a second time wasting cycles?

My vision of a really functional search is one where any idle processor can jump in and start searching somewhere immediately. If you limit it to one open node per thread, and you have 40 threads active, what happens if one node becomes idle, and the other 39 are busy searching the _last_ move at their open nodes? No place to do any parallel search, that 40th processor is just wasting away doing nothing...

hgm · Post by **hgm** » Fri Aug 17, 2007 8:49 pm

We seem to be miscommunicating. In my terminology node = tree node, not thread or processor. In each thread, all nodes in the tree it is searching on the direct path from root to position currentlly begin searched (i.e. the top of its recursion stack) would be open. All other nodes in that tree would already have been searched (in which case their score is in the hash, and other threads hitting it would get hash pruned), or they would not heve been searched yet (in which case they are not found in the hash and other threads would be free to start searching it, after listing it as open first).

In the scheme I described in my first post, if 40 threads would have entered the same node to search it tgether, and all nodes in the sub-tree for which the search is not yet completed are already saturated with searchers, each thread becoming idle would simply leave the tree in an abort-like fashion (leaving it for the remaining ones to return a score), to attempt a split at a lower level (closer to the root).

bob · Post by **bob** » Sat Aug 18, 2007 3:25 am

hgm wrote:We seem to be miscommunicating. In my terminology node = tree node, not thread or processor. In each thread, all nodes in the tree it is searching on the direct path from root to position currentlly begin searched (i.e. the top of its recursion stack) would be open. All other nodes in that tree would already have been searched (in which case their score is in the hash, and other threads hitting it would get hash pruned), or they would not heve been searched yet (in which case they are not found in the hash and other threads would be free to start searching it, after listing it as open first).

In the scheme I described in my first post, if 40 threads would have entered the same node to search it tgether, and all nodes in the sub-tree for which the search is not yet completed are already saturated with searchers, each thread becoming idle would simply leave the tree in an abort-like fashion (leaving it for the remaining ones to return a score), to attempt a split at a lower level (closer to the root).

In each thread, all nodes in the tree it is searching on the direct path from root to position currentlly begin searched (i.e. the top of its recursion stack) would be open.

That is exactly what I thought you meant. But those nodes change so frequently (each time you make / unmake a move, you have a new one) that "opening" one takes a lot of overhead. Because eventually you end up opening "every node you search" and that is way beyond expensive...

[/i]

hgm · Post by **hgm** » Sat Aug 18, 2007 6:11 am

Well, I hash-probe every node I search. That requires a read from DRAM. If that is not 'beyond expensive', why would this be? Accessing a table in a shared L2 seems very cheap in comparison with DRAM access.

Is your point that in CPUs with more than 2 cores L2 is not shared by all cores, and that it would lead to too many mamory write cycles as a side effect of keeping the cache coherency between the different L2 caches if one core modifies a shared cache line?

That could be solved by keeping the information that the node is open in the main hash (e.g. as the number of the thread that owns it). Each node is going to probe the hash anyway. After that the entry will be in general be in the 'exclusive' state in its caches, if none of the other threads is using it, as the main hash is so big that there is little chance that two threads would accidentally use the same cache line for different nodes. Marking the node as ' open' if the depth was not sucfficient (or on a miss, to create the entry) would put it into the 'modified' state. This might require a snoop cycle (that misses the caches of all other cores), but this cycle is unavoidable anyway, as on a miss we are at some point going to write the entry with the new search result.

If additional information on open nodes is needed, (such as the number of threads working on it, or the best score so far), it could be put now in a small hash table maintained by the thread that owns the node, but shared with other threads for reading. Normally the data in this table would be in the M state, as the owner constantly updates it. (For every node it searches and opens / closes.) Only if another thread enters the same node, this thread would 'steal' ownership of the entry through a snoop hit (triggering a DRAM write cycle) when incrementing the count, which would eventually be undone as the original owner cleans up the entry.

But what happens in truly shared nodes hardly affects performance. Even if all threads in a 20-ply search share the first 15 nodes from the root, this would still be only a minority of the nodes in the total tree.

bob · Post by **bob** » Sat Aug 18, 2007 6:50 am

hgm wrote:Well, I hash-probe every node I search. That requires a read from DRAM. If that is not 'beyond expensive', why would this be? Accessing a table in a shared L2 seems very cheap in comparison with DRAM access.

Here's the problem: you probe once, which is not cheap. But now you have to (a) store positions every time as well, which doubles the cost. And also now everyone has to probe the "open position" hash as well.

If I am counting correctly, that is 4x the hash traffic we started with???

Probably be better to use the normal hash table for this stuff, as that would cut the overhead to 50%, adding a position at the front of the search and flagging it "open" and then storing the search result when exiting that node's search removing the "open" flag and setting the proper UPPER/LOWER/EXACT flag instead..

This all assumes I am understanding your idea for "open".

I think it far better to simply "split" the tree at reasonable points, and go from there.

Is your point that in CPUs with more than 2 cores L2 is not shared by all cores, and that it would lead to too many mamory write cycles as a side effect of keeping the cache coherency between the different L2 caches if one core modifies a shared cache line?

That's simply another problem, but yes. A very small table being updated by a large number of threads is going to smoke the MOESI protocol and make the caches the bottleneck.

That could be solved by keeping the information that the node is open in the main hash (e.g. as the number of the thread that owns it). Each node is going to probe the hash anyway. After that the entry will be in general be in the 'exclusive' state in its caches, if none of the other threads is using it, as the main hash is so big that there is little chance that two threads would accidentally use the same cache line for different nodes. Marking the node as ' open' if the depth was not sucfficient (or on a miss, to create the entry) would put it into the 'modified' state. This might require a snoop cycle (that misses the caches of all other cores), but this cycle is unavoidable anyway, as on a miss we are at some point going to write the entry with the new search result.

That was my suggestion above. But it still doubles the traffic to/from hash table, which means to/from memory. It is unlikely that the original fetch to mark it open will still be there when the node is completed and marked with the right hash result.

If additional information on open nodes is needed, (such as the number of threads working on it, or the best score so far), it could be put now in a small hash table maintained by the thread that owns the node, but shared with other threads for reading. Normally the data in this table would be in the M state, as the owner constantly updates it. (For every node it searches and opens / closes.) Only if another thread enters the same node, this thread would 'steal' ownership of the entry through a snoop hit (triggering a DRAM write cycle) when incrementing the count, which would eventually be undone as the original owner cleans up the entry.

But what happens in truly shared nodes hardly affects performance. Even if all threads in a 20-ply search share the first 15 nodes from the root, this would still be only a minority of the nodes in the total tree.

The problem is that you would/could end up with everyone searching the exact same sub-tree, or else waiting on upper/lower bounds or having to search without them, etc.

It is trivially easy to blow the size of the tree searched by 4 cpus to more than 4 times the size of the tree searched by just one processor. The net result is then no speedup at all.

This is really a non-trivial problem. The efficiency of the splitting algorithm is insignificant compared to the extra search overhead produced when threads search the same search space multiple times, or when they search without proper upper/lower bound information...

hgm · Post by **hgm** » Sat Aug 18, 2007 10:18 am

Well, originally I was counting on the small table being in L2, and L2 access to be nearly infinitely fast compared to DRAM access. (Like I do now in my non-SMP engine, where the small hash table of open nodes serves to identify rep-draws.) Then it would not nearly be a factor 4.

Your factor 2 also seems a gross overestimate. In a normal non-SMP engine I need a DRAM read for the hash probe. This brings the entry into L2. If it was an OK hit, that's it. If not, almost always it would still be in L2 by the time the search of the node completes. Then it is made dirty (Exclusive -> Modified) by the hash store, which hits the cache. In due time the dirty line will be replaced, leading to a DRAM write.

Now if I mark the node as 'open' on a probe that is not an OK hit, hardly anything changes. The cache line containing the entry now is dirty from the beginning, but as long as it is not replaced this has no consequences on the other cores and memory system. And even if it would, because other agents are or have been using that cache line, merely the same thing would happen as what would have happened when I would finally store the search score. When I now do finally store the search score, the line is already in the M state, so no state change (and accompanying bus snoop cycle) occurs.

So it seems to me that the factor is pretty close to 1.00, as in both cases the bus activity is exactly the same: a DRAM read on probing, a potential snoop cycle on the first write (be it for opening or for the final hash store) if the line was not Exclusive, but Shared, and a DRAM write when the line is flushed. The only extra thing is an L2 write hit, which does not even suffer from the L2 latency, as writes are simply posted.

In Joker I now have 7 hash entries per cache line (9 bytes each), but that could be easily changed to six 10-byte entries, with 4 bytes left for storing info about an open node. That would limit the number of open nodes per hash bucket to one, but that seems not really a severe limitation, as it is exceedingly unlikely that two nodes from the same bucket would have to be open simultaneously. For threads that do not cooperate (i.e. after they split off) the only 'overhead' would be the reduced storage density of the hash table (which now is 14% smaller), which is next to nothing.

Only in the split points real overhead would occur, as the hash bucket would be shuttled between the L2s of the cooperating CPUs through snoop cycles, with (posted) DRAM writes as a side effect. This argues for putting the split points somewhat farther from the leaves then one otherwise would. But the scheme I described fully allows that, by setting the maximum thread number for searches below a certain depth to one.

I do realize now that I have to be a bit more careful about returning scores from nodes worked on by several threads: all cooperating threads might have reached this node from a different parent. So they all must return a score. Which is a problem if one of the threads is kicked out early (because there was no move left to search that wasn't already saturated), perhaps even from the beginning (because the node itself was already saturated). After an early return with an 'invalid' score (meaning it got a hash hit on a position under search by other threads, where it could not usefully help speeding up that search) it can pospone the scoring of that move until after the search of some other moves. But if there are no other moves to search (because we have done them all, or because this is a cut node) it seems there is no useful way to continue the search before the score of this move comes in. This means we can get stalled. That does not have to be a problem if it happens only very rarely. If not, it would require the CPU to find other useful work, e.g. by having multiple (virtual) threads per CPU, and switching to another thread while the current one is suspended.

Zach Wegner · Post by **Zach Wegner** » Sat Aug 18, 2007 6:25 pm

H.G.,

Your scheme to me sounds rather non-trivial to implement, and might become much more complex than a YBW search. Things like sharing a move list r a value will require a semaphore. What happens when a child processor gets a value > alpha? It has to write in the hash table. How will the other processors know? Furthermore, only the main thread will be able to return alpha, so if there are no more moves left, it must wait for a child to write its score to the hashtable. The only thing it can do is walk up the tree on a node it already visited, hoping to complete nodes that the child has not seen yet. This is basically like a normal tree-splitting approach, but you can only guess where to split and hope that you don't duplicate work. I think that splitting in the normal way is not really that complicated, unless you do DTS. But having to handle incomplete scores and using virtual threads means that it could become more complex than even DTS. I think if simplicity is your goal, then YBW is the way to go.

Zach

P.S. I am getting close to completing my DTS implementation, and would be happy to discuss it with you. Having an virtual threads would require some of the same framework, i.e. iterative search.

hgm · Post by **hgm** » Sat Aug 18, 2007 9:46 pm

Well, perhaps sharing the move list is not a great idea. I was supposing it would be in L2 anyway, and that it would be faster to get it from there than to generate it. But teh degree of L2 sharing in current multi-core x86 CUs might not be enough for this (unlike e.g. ultrasparc T2 where 64 threads share one L2). Anyway, it is not essential to the basic scheme. So suppose each thread maintains its own move list.

I am also not convinced that the virtual threads then are really needed. The only situation where a thread could get stalled is when it is the only thread in a node, there are still unsearched moves, all these moves are being searched, ('open'), and none of these moves accepts another searcher. All threads in the node except the last can simply leave with an invalid score. Each thread sorts such postponed moves to the back of the move list, so they will be hash-probed again later.

Very little complexity is left, then. Just keep the number of times a node is open in the main hash, plus the best score so far, and the number of unsearched moves not saturated with searchers. This would be updated each time a thread would return to it with a score. A thread probing it would know from this if the score being build is already a cut-move, and if not, if it can enter the node to help, or if it should back off and retry later. This inquiery can be done unprotected and leaves the hash entry as Shared in its L2. There the retries of it are made by L2 hits (so that they really can be done after every move that was searched in stead, without much cost), until the state of the saturated node changes somehow.

bob · Post by **bob** » Sun Aug 19, 2007 2:19 am

The problem is this:

You probe A and miss and do a cache line fill. Now you search all kinds of nodes, and then return to store position A's results. What's the probability it is still in L1/L2?? Pretty thin I would suspect...

Zach Wegner · Post by **Zach Wegner** » Sun Aug 19, 2007 5:49 pm

hgm wrote:Well, perhaps sharing the move list is not a great idea. I was supposing it would be in L2 anyway, and that it would be faster to get it from there than to generate it. But teh degree of L2 sharing in current multi-core x86 CUs might not be enough for this (unlike e.g. ultrasparc T2 where 64 threads share one L2). Anyway, it is not essential to the basic scheme. So suppose each thread maintains its own move list.

I am also not convinced that the virtual threads then are really needed. The only situation where a thread could get stalled is when it is the only thread in a node, there are still unsearched moves, all these moves are being searched, ('open'), and none of these moves accepts another searcher. All threads in the node except the last can simply leave with an invalid score. Each thread sorts such postponed moves to the back of the move list, so they will be hash-probed again later.

Thinking about it, a separate hash table might be a better idea. Each entry could be a a little larger to include which threads are searching, and which moves they are searching. That way each thread could walk through this list each time they need a new move and delete them from their local move lists. A lock could be used for each entry, rather than the whole table. About incomplete scores: helper threads (i.e. threads that don't return the final iteration score) can just ignore them, and act like that move didn't exist. However, for the main thread, you must know the scores of all the moves before you can return a score. So if it finishes searching the last move at a split point, it can either just wait for the other threads to finish, or walk up the tree on a move that is being searched by another thread. I suppose the second option would be preferable, but not as efficient as in a YBW scheme.

The whole scheme looks like an asynchronous YBW, where instead of waiting for the first move to be finished, it follows its path until the depth is so low that any overhead of searching without a proper bound is negligible. The only real problem I see is that there is no way to see beta cutoffs on split points until you return to them, unless you want to incur massive overhead of checking at every node each element in the current path that had more than one thread searching it.

Zach

re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel