quality of parallel search question

Uri Blass · Post by **Uri Blass** » Mon Oct 06, 2008 4:18 pm

see the following thread in the rybka forum
http://rybkaforum.net/cgi-bin/rybkaforu ... l?tid=7762

I wonder if quality of analysis cannot suffer significantly from using number of threads that is not constant(during analysis you start non chess process that works during the analysis).

This question is relevant for every program that use parallel search and not only to rybka.

Suppose that you have 4 cores and you plan to use 1 core for other tasks in part of the time of the analysis.

Is it better to start with 4 cores for chess or with 3 cores for chess?

Uri

bob · Post by **bob** » Mon Oct 06, 2008 5:30 pm

Uri Blass wrote:see the following thread in the rybka forum
http://rybkaforum.net/cgi-bin/rybkaforu ... l?tid=7762

I wonder if quality of analysis cannot suffer significantly from using number of threads that is not constant(during analysis you start non chess process that works during the analysis).

This question is relevant for every program that use parallel search and not only to rybka.

Suppose that you have 4 cores and you plan to use 1 core for other tasks in part of the time of the analysis.

Is it better to start with 4 cores for chess or with 3 cores for chess?

Uri

For reasonable implementations, more cores = better analysis. What you appear to be asking is "is it better to have constant skill when analyzing, or would you prefer more skill when possible?" I think max-cores is the way to go. Yes, if you steal a core for something else, the analysis will suffer a bit, but why would you want to do it voluntarily for all moves, rather than just on occasional moves???

There is another interpretation of your question. Always use max threads and then run something else randomly? This is bad. A good program will use spin-locks to synchronize threads here and there. An operating system can not recognize the case when 3 threads are spinning waiting on the 4th to complete what it is doing and release the lock. There is a good chance that it will run the three spinning processes for a time-slice, doing absolutely no useful work, while the 4th thread that is doing something has to sit idle waiting for a CPU.

If this is what you mean, I would _always_ use 3 cores and leave one free. Using 4 can _greatly_ hurt search speed in the above-type cases. This can be partially solved by using MUTEX-type locks, but they are _far_ slower in the normal case because blocking/unblocking is inefficient (hence why we use spinlocks).

Tord Romstad · Post by **Tord Romstad** » Mon Oct 06, 2008 7:08 pm

bob wrote:There is another interpretation of your question. Always use max threads and then run something else randomly? This is bad. A good program will use spin-locks to synchronize threads here and there. An operating system can not recognize the case when 3 threads are spinning waiting on the 4th to complete what it is doing and release the lock. There is a good chance that it will run the three spinning processes for a time-slice, doing absolutely no useful work, while the 4th thread that is doing something has to sit idle waiting for a CPU.

If this is what you mean, I would _always_ use 3 cores and leave one free. Using 4 can _greatly_ hurt search speed in the above-type cases. This can be partially solved by using MUTEX-type locks, but they are _far_ slower in the normal case because blocking/unblocking is inefficient (hence why we use spinlocks).

I am not sure spinlocks vs mutex locks is the only issue. I use mutex locks (because they are more easily portable than spinlocks, and spinlocks are not measureably faster for my program), but I still usually see a slowdown far bigger than 50% if my program is using two threads while some other, single-threaded CPU-intensive program is running. Because of this, I hardly ever run my program with two search threads, except in tournaments.

Tord

sje · Post by **sje** » Mon Oct 06, 2008 7:31 pm

Tord Romstad wrote:I use mutex locks (because they are more easily portable than spinlocks, and spinlocks are not measureably faster for my program), but I still usually see a slowdown far bigger than 50% if my program is using two threads while some other, single-threaded CPU-intensive program is running.

I've seen this, and I'd say that a big contributor to this effect is the (mostly) instruction cache contention caused by the second application.

bob · Post by **bob** » Wed Oct 08, 2008 7:25 am

sje wrote:
Tord Romstad wrote:I use mutex locks (because they are more easily portable than spinlocks, and spinlocks are not measureably faster for my program), but I still usually see a slowdown far bigger than 50% if my program is using two threads while some other, single-threaded CPU-intensive program is running.
I've seen this, and I'd say that a big contributor to this effect is the (mostly) instruction cache contention caused by the second application.

Should not be a problem. Both "processes" are using the _same_ instruction stream, with the same virtual / physical addresses. Poor NPS scaling can almost always be traced back to cache thrashing, most commonly caused by the previously mentioned poor memory update approach of arrays of values, one for each thread...

Here's a quick test I ran on my dual-core-2 laptop. Quick test of a single position searched to fixed depth, with first 1, and then 2 threads. The NPS value at the end is the interesting part as that is the raw NPS numbers, while the speedup is the more useful number for measuring actual search effectiveness...

log.001: time=37.12 mat=0 n=61133510 fh=91% nps=1.6M
log.002: time=19.33 mat=0 n=57938373 fh=91% nps=3.0M

So NPS scales as 3.0 / 1.6 = 1.875
speedup is 1.92

The core-2 has one issue. Two cpus, two L1 caches, 1 L2 cache, and one path to memory which can be a bit of a bottleneck. quad-core boxes show a bigger NPS scaling issue, and dual-quads (the machines I typically use at the CCT type events) are even worse:

log.001: time=40.12 mat=0 n=84174375 fh=93% nps=2.1M
log.002: time=20.65 mat=0 n=86718338 fh=93% nps=4.2M
log.003: time=8.62 mat=0 n=71404894 fh=93% nps=8.3M
log.004: time=6.44 mat=0 n=94728780 fh=92% nps=14.7M

On this box, the NPS scales to right at 7x. On longer searches it climbs a bit more, but the same search on one cpu takes too long to watch.

Notice that two threads scales perfectly and that 4 is not bad, but 8 cores fighting over two memory paths gets to be a bit of a choke point, the drawback to multiple-core chips.

quality of parallel search question

quality of parallel search question

Re: quality of parallel search question

Re: quality of parallel search question

Re: quality of parallel search question

Re: quality of parallel search question