nps scaling

bob · Post by **bob** » Wed Mar 04, 2009 1:50 am

Daniel Shawul wrote:I read somewhere that the optimization works only if there is cache coherency and that is probably the reason why I left it as it is. When I tested once or twice with the asm spinlocks I found from here http://www.cis.temple.edu/~ingargio/ci ... nsem.html they did not really perform any better than mutexes. For windows also, critical sections perform almost as good as spin locks on my computer.

I am currently testing how busy processors are during search. What I am trying to do is split only at the root and search it at large depth to check if I get a better scaling. Theoretically I should since I don't lock the HT which is the only thing that could connect them. If the result is still the same it is definitely a lock contention problem.

Thanks again.

There is _always_ cache coherency, or else you can't write a multi-threaded code that wll execute properly. The "shadow lock" code in Crafty depends on this. It uses bus-snooping to avoid the long periods where the bus (or in newer architectures all caches) get locked while the interlocked-exchange is executed...

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 04, 2009 3:05 am

Basic spinlock implementations I found on the net have the simple test and set operation with no optimization. In the link I gave above I found this

A big problem of the spinlock is that by repeatedly testandsetting a location in memory it can use a high percentage of the memory bus bandwidth, thus reducing the utilization of other procesors sharing the memory bus. This location that is repeatedly accessed is called a Hot Spot in the memory. A partial solution (in the case of cache-coherent shared memories) is to change the Lock operation as follows:
Code: Select all
	void Lock&#40;SpinLock *L&#41;
	&#123;   while &#40;TestAndSet&#40;L&#41;)
               while (*L == 1&#41;
		  ;
	&#125;
which uses less of the memory bandwidth since the TestAndSet instruction involves both a read and a write transaction on the memory while (*L==1) involves only a read operation, and this read operation is normally carried out on a cached copy. [The inner While Loop operates on the cached copy without affecting the shared memory. It assumes that the hardware supports some form (like snooping caches) of cache coherence algorithm which invalidates the cache copy when the shared memory copy is updated.]

Those statements are the reason why I was reluctant to use it.

I do not know specifically what the "shadow lock" in crafty but from what I see in lock.h it looks to me that you do same thing as mine. I am pretty sure that I misunderstood something that I don't know.

Code: Select all

#    if (&#40;defined (_M_IA64&#41; || defined (_M_AMD64&#41;) && !defined&#40;NT_INTEREX&#41;)
#      include <windows.h>
#      pragma intrinsic (_InterlockedExchange&#41;
typedef volatile LONG lock_t&#91;1&#93;;

#      define LockInit&#40;v&#41;      (&#40;v&#41;&#91;0&#93; = 0&#41;
#      define LockFree&#40;v&#41;      (&#40;v&#41;&#91;0&#93; = 0&#41;
#      define Unlock&#40;v&#41;        (&#40;v&#41;&#91;0&#93; = 0&#41;
__forceinline void Lock&#40;volatile LONG * hPtr&#41;
&#123;
  int iValue;

  for (;;) &#123;
    iValue = _InterlockedExchange&#40;&#40;LPLONG&#41; hPtr, 1&#41;;
    if &#40;0 == iValue&#41;
      return;
    while (*hPtr&#41;;
  &#125;
&#125;
#    else                       /* NT non-Alpha/Intel, without assembler Lock&#40;) */
#      define lock_t           volatile int
#      define LockInit&#40;v&#41;      (&#40;v&#41; = 0&#41; <===========
#      define LockFree&#40;v&#41;      (&#40;v&#41; = 0&#41;
#      define Lock&#40;v&#41;          do &#123;                                         \
                             while&#40;InterlockedExchange&#40;&#40;LPLONG&#41;&&#40;v&#41;,1&#41; != 0&#41;;  \ <==============
                           &#125; while &#40;0&#41;
#      define Unlock&#40;v&#41;        (&#40;v&#41; = 0&#41;
#    endif                      /* architecture check */
#  else

That optimization does not seem to be used in the section I marked. Another question,why are you using ((v)[0] = 0) instead of simply (v) = 0 in some of the above code...

Yet another question, after the SearchParallel() call finished for a thread and you are about to copy the result from the child, you do a lock on the general lock IE lock_smp. Is that really necessary? I lock only the parent lock only so that children would not simultaneously update their results and it never gave me a problem that I don't lock the general lock_smp for that purpose.

Code: Select all

value =
        SearchParallel&#40;thread&#91;tid&#93;, thread&#91;tid&#93;->alpha, thread&#91;tid&#93;->beta,
        thread&#91;tid&#93;->value, thread&#91;tid&#93;->wtm, thread&#91;tid&#93;->depth,
        thread&#91;tid&#93;->ply&#41;;
    Lock&#40;lock_smp&#41;;
    Lock&#40;thread&#91;tid&#93;->parent->lock&#41;;
   ....

Thanks.

bob · Post by **bob** » Wed Mar 04, 2009 6:30 am

Daniel Shawul wrote:Basic spinlock implementations I found on the net have the simple test and set operation with no optimization. In the link I gave above I found this
A big problem of the spinlock is that by repeatedly testandsetting a location in memory it can use a high percentage of the memory bus bandwidth, thus reducing the utilization of other procesors sharing the memory bus. This location that is repeatedly accessed is called a Hot Spot in the memory. A partial solution (in the case of cache-coherent shared memories) is to change the Lock operation as follows:
Code: Select all
	void Lock&#40;SpinLock *L&#41;
	&#123;   while &#40;TestAndSet&#40;L&#41;)
               while (*L == 1&#41;
		  ;
	&#125;
which uses less of the memory bandwidth since the TestAndSet instruction involves both a read and a write transaction on the memory while (*L==1) involves only a read operation, and this read operation is normally carried out on a cached copy. [The inner While Loop operates on the cached copy without affecting the shared memory. It assumes that the hardware supports some form (like snooping caches) of cache coherence algorithm which invalidates the cache copy when the shared memory copy is updated.]
Those statements are the reason why I was reluctant to use it.

I do not know specifically what the "shadow lock" in crafty but from what I see in lock.h it looks to me that you do same thing as mine. I am pretty sure that I misunderstood something that I don't know.
Code: Select all
#    if (&#40;defined (_M_IA64&#41; || defined (_M_AMD64&#41;) && !defined&#40;NT_INTEREX&#41;)
#      include <windows.h>
#      pragma intrinsic (_InterlockedExchange&#41;
typedef volatile LONG lock_t&#91;1&#93;;

#      define LockInit&#40;v&#41;      (&#40;v&#41;&#91;0&#93; = 0&#41;
#      define LockFree&#40;v&#41;      (&#40;v&#41;&#91;0&#93; = 0&#41;
#      define Unlock&#40;v&#41;        (&#40;v&#41;&#91;0&#93; = 0&#41;
__forceinline void Lock&#40;volatile LONG * hPtr&#41;
&#123;
  int iValue;

  for (;;) &#123;
    iValue = _InterlockedExchange&#40;&#40;LPLONG&#41; hPtr, 1&#41;;
    if &#40;0 == iValue&#41;
      return;
    while (*hPtr&#41;;
  &#125;
&#125;
#    else                       /* NT non-Alpha/Intel, without assembler Lock&#40;) */
#      define lock_t           volatile int
#      define LockInit&#40;v&#41;      (&#40;v&#41; = 0&#41; <===========
#      define LockFree&#40;v&#41;      (&#40;v&#41; = 0&#41;
#      define Lock&#40;v&#41;          do &#123;                                         \
                             while&#40;InterlockedExchange&#40;&#40;LPLONG&#41;&&#40;v&#41;,1&#41; != 0&#41;;  \ <==============
                           &#125; while &#40;0&#41;
#      define Unlock&#40;v&#41;        (&#40;v&#41; = 0&#41;
#    endif                      /* architecture check */
#  else
That optimization does not seem to be used in the section I marked. Another question,why are you using ((v)[0] = 0) instead of simply (v) = 0 in some of the above code...

Yet another question, after the SearchParallel() call finished for a thread and you are about to copy the result from the child, you do a lock on the general lock IE lock_smp. Is that really necessary? I lock only the parent lock only so that children would not simultaneously update their results and it never gave me a problem that I don't lock the general lock_smp for that purpose.
Code: Select all
value =
        SearchParallel&#40;thread&#91;tid&#93;, thread&#91;tid&#93;->alpha, thread&#91;tid&#93;->beta,
        thread&#91;tid&#93;->value, thread&#91;tid&#93;->wtm, thread&#91;tid&#93;->depth,
        thread&#91;tid&#93;->ply&#41;;
    Lock&#40;lock_smp&#41;;
    Lock&#40;thread&#91;tid&#93;->parent->lock&#41;;
   ....
Thanks.

You are looking at the wrong lock. That is the inline asm for the dec Alpha chip.

Here is the intel lock:

Code: Select all

static void __inline__ LockX86&#40;volatile int *lock&#41; &#123;
  int dummy;
  asm __volatile__(
      "1&#58;          movl    $1, %0"   "\n\t"
      "            xchgl   (%1&#41;, %0" "\n\t"
      "            testl   %0, %0"   "\n\t"
      "            jz      3f"       "\n\t"
      "2&#58;          pause"            "\n\t"
      "            movl    (%1&#41;, %0" "\n\t"
      "            testl   %0, %0"   "\n\t"
      "            jnz     2b"       "\n\t"
      "            jmp     1b"       "\n\t"
      "3&#58;"                           "\n\t"
      &#58;"=&q"&#40;dummy&#41;
      &#58;"q"&#40;lock&#41;
      &#58;"cc", "memory");
&#125;

That is simply inline asm for this:

while (exchange(lock, 1) while(lock);

The idea is that the exchange is an atomic operation, and very slow. So you try it first, if it gets a 1 (the lock is owned by another thread) then we get into the while(lock) which is just a fetch and jnz operation (not atomic). The lock gets copied into cache and we spin on the cache hit until bus snooping detects that another processor/thread has changed the lock value back to zero. At that point, the while(lock) becomes false, and we go back and try the safe atomic xchg() instruction again...

I do not know of _any_ multiprocessor architecture today that does not have coherent cache. Otherwise you could not write SMP-safe code on them...

As far as the lock goes, if you are looking at the one at the bottom of SearchParallel() you might have overlooked the key point that that is only done in a fail-high situation, where I need to stop other threads that are helping at this split point, or at any split point below this split point that is a part of this search. I need to set that lock so that I can safely set stop indicators for any split block that is helping me, without having the ugly case of another thread trying to set my stop flag while I am setting his, because more than one can fail high at the same node. Is that what you are looking at???

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 04, 2009 8:03 am

You are looking at the wrong lock. That is the inline asm for the dec Alpha chip.

Ok then I take it that the optimization works except for this chip.

As far as the lock goes, if you are looking at the one at the bottom of SearchParallel() you might have overlooked the key point that that is only done in a fail-high situation, where I need to stop other threads that are helping at this split point, or at any split point below this split point that is a part of this search. I need to set that lock so that I can safely set stop indicators for any split block that is helping me, without having the ugly case of another thread trying to set my stop flag while I am setting his, because more than one can fail high at the same node. Is that what you are looking at???

No that is not it. I quoted the relevant code in my previous post. It is after a thread runs out of work (no more moves to search at the split point) then you return from SearchParallel() and call the CopyFromChild(). I think this section needs to be protected with only the parent lock.

sje · Post by **sje** » Wed Mar 04, 2009 10:33 pm

It's hard enough to predict two tier movepath enumeration times let alone those of a deep search. Only testing a real implementation of a TTTT over many runs will help.

----

For what it's worth, I have the latest enumeration results on a four core 2.66 GHz Xeon, and the program managed to come up with perft(10) in under three hours. The upper five ply (approximately one million unique positions and their counts) were kept in a shared locking tree; the rest were kept in per-thread local transposition tables.

Code: Select all

Depth&#58; 8   Count&#58; 84998978956   Elapsed&#58; 42.0357
2.02207e+09 Hz   4.94543e-10 seconds

Depth&#58; 9   Count&#58; 2439530234167   Elapsed&#58; 617.726
3.94921e+09 Hz   2.53215e-10 seconds

Depth&#58; 10   Count&#58; 69352859712417   Elapsed&#58; 9922.21
6.98966e+09 Hz   1.43069e-10 seconds

Edited to add: the shared tree is accessed via pthread locking while the threads are synchronized using sleeplocks.

bob · Post by **bob** » Thu Mar 05, 2009 7:39 am

Daniel Shawul wrote:
You are looking at the wrong lock. That is the inline asm for the dec Alpha chip.
Ok then I take it that the optimization works except for this chip.

As far as the lock goes, if you are looking at the one at the bottom of SearchParallel() you might have overlooked the key point that that is only done in a fail-high situation, where I need to stop other threads that are helping at this split point, or at any split point below this split point that is a part of this search. I need to set that lock so that I can safely set stop indicators for any split block that is helping me, without having the ugly case of another thread trying to set my stop flag while I am setting his, because more than one can fail high at the same node. Is that what you are looking at???
No that is not it. I quoted the relevant code in my previous post. It is after a thread runs out of work (no more moves to search at the split point) then you return from SearchParallel() and call the CopyFromChild(). I think this section needs to be protected with only the parent lock.

Aha. No, there are other race conditions. One can be backing up at the same instant another is saying "we need to stop everyone, I've found a beta cutoff." I can't let those two things overlap as it can wreck the copying process. Normal exits don't need the smp lock, but I use it to protect against a race condition I found a long time back. I think I actually fixed the problem so that lock might not be needed, but it will take a _ton_ of testing to prove it works without it since the timing window is so small. however, since I'm not seeing any performance issues with this at the moment, I've not taken the time to continue the testing...

nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: TTTT: Two Tier Transposition Table

Re: nps scaling