nps scaling

bob · Post by **bob** » Mon Mar 02, 2009 5:56 pm

Daniel Shawul wrote:How much of a bottleneck is it on 4 cpus? Does the nps drops significantly for example if you comment out the "smp_idle" check at the beginning of split. I really couldn't notice any change with and without it.

I don't do it that way on Crafty, but on 8 cores I don't notice _any_ issue with NPS scaling at all, it is nearly perfect. I found the same on a 4 x 4core box last year as well. But locks (spinlocks using xchg()) can be expensive if over-used... particularly in hashing stuff which gets executed every node. You do _not_ want to set/clear a lock on every node, period.

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 02, 2009 8:22 pm

I was only interested its effect on this particular case where splits are done for 3plies or more left, which is not a significant portion of the search. That is why I am guessing that its effect is minimal for me. Just to make sure ,I did a quick test on it by enclosing the node counter like this

Code: Select all

       l_lock&#40;lock_smp&#41;;
       nodes++;
       l_unlock&#40;lock_smp&#41;;

This dropped the nps only slightly , I then enclosed the quiescence search node counter similarly, but this time the nps dropped significantly. In this case the locks are retained for quite a short time (compared to hashtable updates, which could be far worse especially if done in QS). I am just trying to justify the result I got with and without it.
[/code]

frankp · Post by **frankp** » Mon Mar 02, 2009 8:28 pm

bob wrote:[
5. Hashing is a key issue. Do not use a lock to protect hash storing as that will be a bottleneck for _every_ node searched. Either use the XOR trick I discussed a month or two ago, or ignore the problem if it won't make you crash or break.

I get bad (invalid) moves back when I don't lock the hash, which implies bad bounds also, probably. I guess the implication of ignoring bad bounds is that it is too infrequent to matter on average. Not seem the XOR trick, so will do a archive search. Certainly the hash lock is expensive.

bob · Post by **bob** » Mon Mar 02, 2009 8:53 pm

Daniel Shawul wrote:I was only interested its effect on this particular case where splits are done for 3plies or more left, which is not a significant portion of the search. That is why I am guessing that its effect is minimal for me. Just to make sure ,I did a quick test on it by enclosing the node counter like this
Code: Select all
       l_lock&#40;lock_smp&#41;;
       nodes++;
       l_unlock&#40;lock_smp&#41;;
This dropped the nps only slightly , I then enclosed the quiescence search node counter similarly, but this time the nps dropped significantly. In this case the locks are retained for quite a short time (compared to hashtable updates, which could be far worse especially if done in QS). I am just trying to justify the result I got with and without it.
[/code]

That makes sense to me. non-qsearch nodes don't add up very quickly, while q-search nodes do. My point was that the xchg() instruction is very expensive and if done too frequently, performance nose-dives.

CThinker · Post by **CThinker** » Tue Mar 03, 2009 1:40 am

Daniel Shawul wrote:Thanks very much Lance! I did the changes you recommended but it didn't improve it on the 4 cpus (same nps for 30 sec runs). Hopefully it does better on the 8 cpus as the effect becomes more pronounced.

I found another one...

Your lock method really locks the bus in a loop.

Code: Select all

while&#40;InterlockedExchange&#40;&#40;LPLONG&#41;&&#40;x&#41;,1&#41; != 0&#41;

I suggest that you only lock if you have a chance of getting the lock.

Code: Select all

while&#40;InterlockedExchange&#40;&#40;LPLONG&#41;&&#40;x&#41;,1&#41; != 0&#41; &#123; while&#40;&#40;x&#41;!=0&#41;; &#125;

Also, I suggest that for Linux, replace the calls to pthreads with inline assembly. I think pthreads are too expensive.

Lastly, you can remove those code for Windows critical section since the Interlocked primitives is what you need.

Cheers...

bob · Post by **bob** » Tue Mar 03, 2009 7:30 am

frankp wrote:
bob wrote:[
5. Hashing is a key issue. Do not use a lock to protect hash storing as that will be a bottleneck for _every_ node searched. Either use the XOR trick I discussed a month or two ago, or ignore the problem if it won't make you crash or break.
I get bad (invalid) moves back when I don't lock the hash, which implies bad bounds also, probably. I guess the implication of ignoring bad bounds is that it is too infrequent to matter on average. Not seem the XOR trick, so will do a archive search. Certainly the hash lock is expensive.

if the bad moves don't crash you, they won't hurt. I doubt you get more than a couple every search unless you run on 8 cores. And even then 20-30 per minute has zero effect on the search based on the testing Cozzie and I did for the ICGA paper we wrote...

Hash locks are very expensive because they are done so often. And I don't even do them in the q-search and they slow me down over 30% if I put them in...

sje · Post by **sje** » Tue Mar 03, 2009 11:23 am

I've been doing some experiments with a two-tier storage system with my SMP movepath enumerator (i.e., perft). Each thread gets its own transposition table implemented in a typical, non-locking manner. However, only the relatively cheaply counted nodes (i.e., low draft) are stored there. The high cost nodes (i.e., big counts) are stored in a locking structure that's shared by all the enumerator threads.

The shared structure is actually a binary tree, not a table. Also, instead of a single lock, there's a lock for each of the four shared trees (trees are indexed by ply).

So far it looks like this approach is better than a monolithic table. But the caution is that only the relatively small number of low ply levels can be stored in the shared tree, else the locking requirements excessively degrade the overall speed.

For movepath enumeration of the initial array, I have found that storing only the top four ply gives the best results. Note: as a side effect, this also gives the 72,078 unique positions at ply four; that's about 37% of the emp(4) value of 197,281 for a factor of about 2.74 speed up due to transpositions.

Extending the tree to include the 822,518 unique positions at the next ply made for big slowdown with four concurrent enumeration threads. It might be the case that this slowdown reverses itself at much higher enumeration depths, like perft(10) or more.

Daniel Shawul · Post by **Daniel Shawul** » Tue Mar 03, 2009 5:00 pm

I read somewhere that the optimization works only if there is cache coherency and that is probably the reason why I left it as it is. When I tested once or twice with the asm spinlocks I found from here http://www.cis.temple.edu/~ingargio/ci ... nsem.html they did not really perform any better than mutexes. For windows also, critical sections perform almost as good as spin locks on my computer.

I am currently testing how busy processors are during search. What I am trying to do is split only at the root and search it at large depth to check if I get a better scaling. Theoretically I should since I don't lock the HT which is the only thing that could connect them. If the result is still the same it is definitely a lock contention problem.

Thanks again.

Daniel Shawul · Post by **Daniel Shawul** » Tue Mar 03, 2009 5:50 pm

That is an interesting experiment. It might be difficult to directly use the results of your experiment to a regular search because of the relatively smaller search depth attained in perft. Also I guess that locks have to be used for the perft's case because you need the exact counts, unlike the other one where you can get away with an incorrect move and score. Have you tried to use this system for a regular search with alpha-beta, null move and all. If so what is the optimum depth for separte use ofthe locked/unlocked TT?

bob · Post by **bob** » Tue Mar 03, 2009 8:14 pm

Daniel Shawul wrote:That is an interesting experiment. It might be difficult to directly use the results of your experiment to a regular search because of the relatively smaller search depth attained in perft. Also I guess that locks have to be used for the perft's case because you need the exact counts, unlike the other one where you can get away with an incorrect move and score. Have you tried to use this system for a regular search with alpha-beta, null move and all. If so what is the optimum depth for separte use ofthe locked/unlocked TT?

The biggest problem is that our normal chess searches are 3x deeper than a perft search can reach. Which greatly reduces the chances for transpositions in perft. While we normally see the majority of the transpositions nearer the leaves than the root, which a local transposition table hurts...

nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

TTTT: Two Tier Transposition Table

Re: nps scaling

Re: TTTT: Two Tier Transposition Table

Re: TTTT: Two Tier Transposition Table