hgm wrote:Well, I hash-probe every node I search. That requires a read from DRAM. If that is not 'beyond expensive', why would this be? Accessing a table in a shared L2 seems very cheap in comparison with DRAM access.
Here's the problem: you probe once, which is not cheap. But now you have to (a) store positions every time as well, which doubles the cost. And also now everyone has to probe the "open position" hash as well.
If I am counting correctly, that is 4x the hash traffic we started with???
Probably be better to use the normal hash table for this stuff, as that would cut the overhead to 50%, adding a position at the front of the search and flagging it "open" and then storing the search result when exiting that node's search removing the "open" flag and setting the proper UPPER/LOWER/EXACT flag instead..
This all assumes I am understanding your idea for "open".
I think it far better to simply "split" the tree at reasonable points, and go from there.
Is your point that in CPUs with more than 2 cores L2 is not shared by all cores, and that it would lead to too many mamory write cycles as a side effect of keeping the cache coherency between the different L2 caches if one core modifies a shared cache line?
That's simply another problem, but yes. A very small table being updated by a large number of threads is going to smoke the MOESI protocol and make the caches the bottleneck.
That could be solved by keeping the information that the node is open in the main hash (e.g. as the number of the thread that owns it). Each node is going to probe the hash anyway. After that the entry will be in general be in the 'exclusive' state in its caches, if none of the other threads is using it, as the main hash is so big that there is little chance that two threads would accidentally use the same cache line for different nodes. Marking the node as ' open' if the depth was not sucfficient (or on a miss, to create the entry) would put it into the 'modified' state. This might require a snoop cycle (that misses the caches of all other cores), but this cycle is unavoidable anyway, as on a miss we are at some point going to write the entry with the new search result.
That was my suggestion above. But it still doubles the traffic to/from hash table, which means to/from memory. It is unlikely that the original fetch to mark it open will still be there when the node is completed and marked with the right hash result.
If additional information on open nodes is needed, (such as the number of threads working on it, or the best score so far), it could be put now in a small hash table maintained by the thread that owns the node, but shared with other threads for reading. Normally the data in this table would be in the M state, as the owner constantly updates it. (For every node it searches and opens / closes.) Only if another thread enters the same node, this thread would 'steal' ownership of the entry through a snoop hit (triggering a DRAM write cycle) when incrementing the count, which would eventually be undone as the original owner cleans up the entry.
But what happens in truly shared nodes hardly affects performance. Even if all threads in a 20-ply search share the first 15 nodes from the root, this would still be only a minority of the nodes in the total tree.
The problem is that you would/could end up with everyone searching the exact same sub-tree, or else waiting on upper/lower bounds or having to search without them, etc.
It is trivially easy to blow the size of the tree searched by 4 cpus to more than 4 times the size of the tree searched by just one processor. The net result is then no speedup at all.
This is really a non-trivial problem. The efficiency of the splitting algorithm is insignificant compared to the extra search overhead produced when threads search the same search space multiple times, or when they search without proper upper/lower bound information...