nps scaling

CThinker · Post by **CThinker** » Sun Mar 01, 2009 6:21 pm

Daniel Shawul wrote:Recently I had a chance to run scorpio on 8 processor machine.
Unfortunately it did not seem to do better than the 4 processor version at all! I see that all the cpus are busy (100%) but obviously sitting idle. The work allocation scheme is pretty much straight forward but I don't know if that is good enough for keeping the processors alive. The 4 processor version scales almost 4x nps wise as expected but the 8 cpu run does not increase the nps at all!! Bewildered I took crafty and run it there and it sweapt to 10 Mnps with no problem! I do not have a lot of chance to run on that machine but I can test on 4 cpus as much as I want to, which I am hoping will help me figure out the problem. I heard problems of this kind happening to other guys here, so if you could point me to things to watch out for, it is much appreciated.
Daniel

I found something in your parallel search code...

Every time you need to find out if you can do a split, you do a global lock and then loop through every processor state to see if there are idle processors.

This looks really expensive. With more processors, the worse it gets. Most of the time, there are no idle processors, and yet you have to do this loop for every move. At the same time, you lock everything up.

I suggest that you maintain a counter for idle CPUs. Simply use this counter to see if you can split.

Here is your code:

Code: Select all

            if&#40;n_processors > 1
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                for&#40;i = 0;i < n_processors;i++) &#123;
                    if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                        attach_processor&#40;i&#41;;
                    &#125;
                &#125;
                if&#40;n_workers&#41; &#123;
                    attach_processor&#40;processor_id&#41;;
                    for&#40;i = 0; i < n_processors; i++) &#123;
                        if&#40;workers&#91;i&#93;)
                            processors&#91;i&#93;.state = GO;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;

And this is my suggestion:

Code: Select all

            if&#40;g_n_idle_processors > 0 // there has to be an avaiable CPU
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                if &#40;g_n_idle_processors&#41; &#123; // check again, in case another thread acquired the CPUs while we were blocked on the lock
                    for&#40;i = 0;i < n_processors;i++) &#123;
                        if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                            attach_processor&#40;i&#41;;
                        &#125;
                    &#125;
                    if&#40;n_workers&#41; &#123;
                        attach_processor&#40;processor_id&#41;;
                        for&#40;i = 0; i < n_processors; i++) &#123;
                            if&#40;workers&#91;i&#93;)
                                processors&#91;i&#93;.state = GO;
                        &#125;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;

frankp · Post by **frankp** » Sun Mar 01, 2009 7:01 pm

Yes. The nps are scaling nicely, but the time (to fixed depth) is disappointing - a factor of two to three. Still a long way to go.

bob · Post by **bob** » Sun Mar 01, 2009 8:00 pm

Daniel Shawul wrote:Yes, no splits below 3 plies left. Last time I tried 4 plies the idle time of processors increased (less nps), but I never checked if the real speed up may be better. Without going to low-level cache optimization, are there any general guidelines useful to avoid such problems before they crop up? I remember you gave such advice before but i can not find it. So if possible could you recite it here.

Here are some things that can affect NPS scaling (I am assuming threads here rather than fork() processes for some of these).

1. if you have an array like nodes[n] or any array indexed by thread id, where each thread might do something like nodes[thread_id]++ then that can be a killer, because the cache line with those values gets bounced from cache to cache as each cpu modifies one of the values.

2. The above can be generalized to a case where if a thread modifies any value in memory, no other thread should modify a value within 64 bytes of that, on a regular basis, or cache-bouncing will kill performance. This is true whether you have MESI or MOESI cache coherency.

3. atomic locks have to be used sparingly. They result in huge overhead/delays because of the way they "lock" the memory accesses, which hangs other caches as well. Use a different lock for each thing that can be locked, rather than one global lock and you can reduce this to managable levels.

4. Minimize memory traffic. In copying split block data, only copy the parts that are actually used in the parallel search below that node, don't copy everything.

5. Hashing is a key issue. Do not use a lock to protect hash storing as that will be a bottleneck for _every_ node searched. Either use the XOR trick I discussed a month or two ago, or ignore the problem if it won't make you crash or break.

bob · Post by **bob** » Sun Mar 01, 2009 8:04 pm

CThinker wrote:
Daniel Shawul wrote:Recently I had a chance to run scorpio on 8 processor machine.
Unfortunately it did not seem to do better than the 4 processor version at all! I see that all the cpus are busy (100%) but obviously sitting idle. The work allocation scheme is pretty much straight forward but I don't know if that is good enough for keeping the processors alive. The 4 processor version scales almost 4x nps wise as expected but the 8 cpu run does not increase the nps at all!! Bewildered I took crafty and run it there and it sweapt to 10 Mnps with no problem! I do not have a lot of chance to run on that machine but I can test on 4 cpus as much as I want to, which I am hoping will help me figure out the problem. I heard problems of this kind happening to other guys here, so if you could point me to things to watch out for, it is much appreciated.
Daniel
I found something in your parallel search code...

Every time you need to find out if you can do a split, you do a global lock and then loop through every processor state to see if there are idle processors.

This looks really expensive. With more processors, the worse it gets. Most of the time, there are no idle processors, and yet you have to do this loop for every move. At the same time, you lock everything up.

I suggest that you maintain a counter for idle CPUs. Simply use this counter to see if you can split.

Here is your code:
Code: Select all
            if&#40;n_processors > 1
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                for&#40;i = 0;i < n_processors;i++) &#123;
                    if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                        attach_processor&#40;i&#41;;
                    &#125;
                &#125;
                if&#40;n_workers&#41; &#123;
                    attach_processor&#40;processor_id&#41;;
                    for&#40;i = 0; i < n_processors; i++) &#123;
                        if&#40;workers&#91;i&#93;)
                            processors&#91;i&#93;.state = GO;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;
And this is my suggestion:
Code: Select all
            if&#40;g_n_idle_processors > 0 // there has to be an avaiable CPU
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                if &#40;g_n_idle_processors&#41; &#123; // check again, in case another thread acquired the CPUs while we were blocked on the lock
                    for&#40;i = 0;i < n_processors;i++) &#123;
                        if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                            attach_processor&#40;i&#41;;
                        &#125;
                    &#125;
                    if&#40;n_workers&#41; &#123;
                        attach_processor&#40;processor_id&#41;;
                        for&#40;i = 0; i < n_processors; i++) &#123;
                            if&#40;workers&#91;i&#93;)
                                processors&#91;i&#93;.state = GO;
                        &#125;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;	

Even better:

Once you determine there are idle processors in the above, save the number and set the global value to zero to that others will not be stacking up on the above code while you are grabbing the processors that are idle. That will leave a very tiny window where a second processor might wait on the above code, but it will be quite rare. Right now it is probably guaranteed to happen since I'll bet the "attach_processor()" call takes more than one node's worth of time to execute, which lets _everybody_ notice there are idle processors and stack up on the spinlock and doing nothing.

CThinker · Post by **CThinker** » Sun Mar 01, 2009 8:56 pm

bob wrote:
CThinker wrote:
Daniel Shawul wrote:Recently I had a chance to run scorpio on 8 processor machine.
Unfortunately it did not seem to do better than the 4 processor version at all! I see that all the cpus are busy (100%) but obviously sitting idle. The work allocation scheme is pretty much straight forward but I don't know if that is good enough for keeping the processors alive. The 4 processor version scales almost 4x nps wise as expected but the 8 cpu run does not increase the nps at all!! Bewildered I took crafty and run it there and it sweapt to 10 Mnps with no problem! I do not have a lot of chance to run on that machine but I can test on 4 cpus as much as I want to, which I am hoping will help me figure out the problem. I heard problems of this kind happening to other guys here, so if you could point me to things to watch out for, it is much appreciated.
Daniel
I found something in your parallel search code...

Every time you need to find out if you can do a split, you do a global lock and then loop through every processor state to see if there are idle processors.

This looks really expensive. With more processors, the worse it gets. Most of the time, there are no idle processors, and yet you have to do this loop for every move. At the same time, you lock everything up.

I suggest that you maintain a counter for idle CPUs. Simply use this counter to see if you can split.

Here is your code:
Code: Select all
            if&#40;n_processors > 1
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                for&#40;i = 0;i < n_processors;i++) &#123;
                    if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                        attach_processor&#40;i&#41;;
                    &#125;
                &#125;
                if&#40;n_workers&#41; &#123;
                    attach_processor&#40;processor_id&#41;;
                    for&#40;i = 0; i < n_processors; i++) &#123;
                        if&#40;workers&#91;i&#93;)
                            processors&#91;i&#93;.state = GO;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;
And this is my suggestion:
Code: Select all
            if&#40;g_n_idle_processors > 0 // there has to be an avaiable CPU
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                if &#40;g_n_idle_processors&#41; &#123; // check again, in case another thread acquired the CPUs while we were blocked on the lock
                    for&#40;i = 0;i < n_processors;i++) &#123;
                        if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                            attach_processor&#40;i&#41;;
                        &#125;
                    &#125;
                    if&#40;n_workers&#41; &#123;
                        attach_processor&#40;processor_id&#41;;
                        for&#40;i = 0; i < n_processors; i++) &#123;
                            if&#40;workers&#91;i&#93;)
                                processors&#91;i&#93;.state = GO;
                        &#125;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;	
Even better:

Once you determine there are idle processors in the above, save the number and set the global value to zero to that others will not be stacking up on the above code while you are grabbing the processors that are idle. That will leave a very tiny window where a second processor might wait on the above code, but it will be quite rare. Right now it is probably guaranteed to happen since I'll bet the "attach_processor()" call takes more than one node's worth of time to execute, which lets _everybody_ notice there are idle processors and stack up on the spinlock and doing nothing.

Interestingly, Bob, you don't use what you have suggested in your Crafty code. Your 'thread idle counter' is incremented/decremented only by the Thread Procedure. So, your split code has no control over it.

bob · Post by **bob** » Mon Mar 02, 2009 1:50 am

CThinker wrote:
bob wrote:
CThinker wrote:
Daniel Shawul wrote:Recently I had a chance to run scorpio on 8 processor machine.
Unfortunately it did not seem to do better than the 4 processor version at all! I see that all the cpus are busy (100%) but obviously sitting idle. The work allocation scheme is pretty much straight forward but I don't know if that is good enough for keeping the processors alive. The 4 processor version scales almost 4x nps wise as expected but the 8 cpu run does not increase the nps at all!! Bewildered I took crafty and run it there and it sweapt to 10 Mnps with no problem! I do not have a lot of chance to run on that machine but I can test on 4 cpus as much as I want to, which I am hoping will help me figure out the problem. I heard problems of this kind happening to other guys here, so if you could point me to things to watch out for, it is much appreciated.
Daniel
I found something in your parallel search code...

Every time you need to find out if you can do a split, you do a global lock and then loop through every processor state to see if there are idle processors.

This looks really expensive. With more processors, the worse it gets. Most of the time, there are no idle processors, and yet you have to do this loop for every move. At the same time, you lock everything up.

I suggest that you maintain a counter for idle CPUs. Simply use this counter to see if you can split.

Here is your code:
Code: Select all
            if&#40;n_processors > 1
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                for&#40;i = 0;i < n_processors;i++) &#123;
                    if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                        attach_processor&#40;i&#41;;
                    &#125;
                &#125;
                if&#40;n_workers&#41; &#123;
                    attach_processor&#40;processor_id&#41;;
                    for&#40;i = 0; i < n_processors; i++) &#123;
                        if&#40;workers&#91;i&#93;)
                            processors&#91;i&#93;.state = GO;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;
And this is my suggestion:
Code: Select all
            if&#40;g_n_idle_processors > 0 // there has to be an avaiable CPU
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                if &#40;g_n_idle_processors&#41; &#123; // check again, in case another thread acquired the CPUs while we were blocked on the lock
                    for&#40;i = 0;i < n_processors;i++) &#123;
                        if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                            attach_processor&#40;i&#41;;
                        &#125;
                    &#125;
                    if&#40;n_workers&#41; &#123;
                        attach_processor&#40;processor_id&#41;;
                        for&#40;i = 0; i < n_processors; i++) &#123;
                            if&#40;workers&#91;i&#93;)
                                processors&#91;i&#93;.state = GO;
                        &#125;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;	
Even better:

Once you determine there are idle processors in the above, save the number and set the global value to zero to that others will not be stacking up on the above code while you are grabbing the processors that are idle. That will leave a very tiny window where a second processor might wait on the above code, but it will be quite rare. Right now it is probably guaranteed to happen since I'll bet the "attach_processor()" call takes more than one node's worth of time to execute, which lets _everybody_ notice there are idle processors and stack up on the spinlock and doing nothing.
Interestingly, Bob, you don't use what you have suggested in your Crafty code. Your 'thread idle counter' is incremented/decremented only by the Thread Procedure. So, your split code has no control over it.

bob · Post by **bob** » Mon Mar 02, 2009 1:52 am

CThinker wrote:
bob wrote:
CThinker wrote:
Daniel Shawul wrote:Recently I had a chance to run scorpio on 8 processor machine.
Unfortunately it did not seem to do better than the 4 processor version at all! I see that all the cpus are busy (100%) but obviously sitting idle. The work allocation scheme is pretty much straight forward but I don't know if that is good enough for keeping the processors alive. The 4 processor version scales almost 4x nps wise as expected but the 8 cpu run does not increase the nps at all!! Bewildered I took crafty and run it there and it sweapt to 10 Mnps with no problem! I do not have a lot of chance to run on that machine but I can test on 4 cpus as much as I want to, which I am hoping will help me figure out the problem. I heard problems of this kind happening to other guys here, so if you could point me to things to watch out for, it is much appreciated.
Daniel
I found something in your parallel search code...

Every time you need to find out if you can do a split, you do a global lock and then loop through every processor state to see if there are idle processors.

This looks really expensive. With more processors, the worse it gets. Most of the time, there are no idle processors, and yet you have to do this loop for every move. At the same time, you lock everything up.

I suggest that you maintain a counter for idle CPUs. Simply use this counter to see if you can split.

Here is your code:
Code: Select all
            if&#40;n_processors > 1
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                for&#40;i = 0;i < n_processors;i++) &#123;
                    if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                        attach_processor&#40;i&#41;;
                    &#125;
                &#125;
                if&#40;n_workers&#41; &#123;
                    attach_processor&#40;processor_id&#41;;
                    for&#40;i = 0; i < n_processors; i++) &#123;
                        if&#40;workers&#91;i&#93;)
                            processors&#91;i&#93;.state = GO;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;
And this is my suggestion:
Code: Select all
            if&#40;g_n_idle_processors > 0 // there has to be an avaiable CPU
                && pstack->depth > 2 * UNITDEPTH 
                ) &#123;
                register int i;
                /*
                attach idle processors to current processor
                */
                l_lock&#40;lock_smp&#41;;
                if &#40;g_n_idle_processors&#41; &#123; // check again, in case another thread acquired the CPUs while we were blocked on the lock
                    for&#40;i = 0;i < n_processors;i++) &#123;
                        if&#40;processors&#91;i&#93;.state == WAIT&#41; &#123;
                            attach_processor&#40;i&#41;;
                        &#125;
                    &#125;
                    if&#40;n_workers&#41; &#123;
                        attach_processor&#40;processor_id&#41;;
                        for&#40;i = 0; i < n_processors; i++) &#123;
                            if&#40;workers&#91;i&#93;)
                                processors&#91;i&#93;.state = GO;
                        &#125;
                    &#125;
                &#125;
                l_unlock&#40;lock_smp&#41;;
            ...
            &#125;	
Even better:

Once you determine there are idle processors in the above, save the number and set the global value to zero to that others will not be stacking up on the above code while you are grabbing the processors that are idle. That will leave a very tiny window where a second processor might wait on the above code, but it will be quite rare. Right now it is probably guaranteed to happen since I'll bet the "attach_processor()" call takes more than one node's worth of time to execute, which lets _everybody_ notice there are idle processors and stack up on the spinlock and doing nothing.
Interestingly, Bob, you don't use what you have suggested in your Crafty code. Your 'thread idle counter' is incremented/decremented only by the Thread Procedure. So, your split code has no control over it.

I don't have the extra overhead of the code given above. I found this works a bit better for me, because a thread can go idle while I am splitting, and I will now pick that up and just add it in to the current split point. He obviously has some bottlenecks, and my suggestion will get rid of what might be a pretty significant "lock contention" bottleneck...

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 02, 2009 2:31 pm

Thanks very much Lance! I did the changes you recommended but it didn't improve it on the 4 cpus (same nps for 30 sec runs). Hopefully it does better on the 8 cpus as the effect becomes more pronounced.

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 02, 2009 2:34 pm

Thank you. I think my problem is probably 3 and may be 2 also. But the rest I do not have them at least not that I am aware of.

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 02, 2009 2:37 pm

How much of a bottleneck is it on 4 cpus? Does the nps drops significantly for example if you comment out the "smp_idle" check at the beginning of split. I really couldn't notice any change with and without it.

nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling

Re: nps scaling