Explanation for non-expert?

bob · Post by **bob** » Wed Feb 18, 2015 2:59 pm

lucasart wrote:
bob wrote: I didn't understand it because spLevel was not in the Stockfish6 code I looked at. That change is not going to be worth 50 Elo or so however. But I will let YOU figure out why. And yes, if you look at my old DTS paper this kind of thing was factored in. There's very little new stuff in normal PVS/YBW type parallel searching.
If you've understood all this since.the 80's, how come Crafty scales so badly from 8 to 16 cores ?

It doesn't on my machine. It depends on the box. The PC is not exactly the perfect platform. 16 cores typically has two chips although it could be a single now. 16 cores on a single chip, with a single path to memory has a problem.

Additionally, I chose to write Crafty in a simpler way that the DTS algorithm used in Cray Blitz. I may well go back and rewrite that one day, because it is significantly better. It just hasn't been a high priority. But the issues for SMP scaling are not new. SMP search has been around for almost 40 years now... Alpha/Beta hasn't changed.

On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores. You do have to know what you are doing when testing, and make certain turbo is disabled, or the answers are useless. Most miss that.

bob · Post by **bob** » Wed Feb 18, 2015 4:29 pm

zamar wrote:Hi Bob,

bob wrote: #1 is not quite so simple. If you split below a split point, you still incur about the same amount of overhead. This is a sort of similarity one might find with speculative execution in a modern CPU. Remember that the only overhead you incur is the overhead searched by this new thread, if the split turns out to be wrong. It really doesn't matter where it happens.
You missed the point. If the thread is searching under a splitpoint which is under existing splitpoint which is under existing split point, and so on (called spLevel in SF), you are speculating over speculations over speculations. If any of those speculations in the chain turn out to be wrong, the work of the thread is lost. Just don't do that.

From a real life:
- If an idiot helps a wise man, something good can still result from it.
- If an idiot helps an idiot who helps an idiot who helps an idiot who helps a wice man, nothin good comes out of it.

#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.
Well, there is no such a thing as a sure ALL node, but in in principle I agree. An ideal splitting algorithm would take the node type and the count of already searched moves into account as this affects the probability of cut-off.

(a) a "late joiner" joins but does nothing because all work has already been completed or scheduled by other threads and is in progress;
SF sets a flag when there is no more work to be done.

(b) N threads split but only N-m have work to do due to a small branching factor at that node (this is particularly bad in endgames and is why I still use a minimum thread group as I did in Cray Blitz to avoid having too many threads at one point.)
In SF this is controlled by minimum split depth and max slaves per split point parameters.

I understood the spLevel idea. You missed what I wrote. We are talking about a single thread. Wherever you stick it, if you are wrong its effort will be wasted. If you stick it below one or more split points, yes if one of those is wrong then this will waste effort. But the probability of one of those split points being bad is no worse than the probability of a brand new split point being bad. So no matter where you stick this thread, all you risk is making its effort wasted.

When I worked on my dissertation, using up to 30 CPUs at the time, I didn't have much luck trying to take advantage of unpredictable behavior. The obvious idea of trying to split near the root is problematic, because any mis-ordering can turn that ALL node into a CUT node in unpredictable ways. The only really good split point is the root since that is a guaranteed ALL node (PV nodes are also ALL).

But there are so many issues to deal with and so many solutions, that doing things automatically is hard. You have to be very careful with cache management and how you share things to avoid all sorts of cache invalidation/forwarding traffic, memory bandwidth on a single chip is limited, you don't want to have a thread bounce to a different CPU so processor affinity is important, and then there are other things starting with NUMA/memory allocation.

I've tried repeatedly to do something automatic in regard to setting things up optimally. No luck so far, because there is too much variability in the hardware. I tested on a dual 8 core a few weeks back and it really ran well. I tested on a dual 12 core a little later and the plain NPS scaling was bad and took a lot of tweaking to get right. I see this almost every time I run on something newer...

bob · Post by **bob** » Wed Feb 18, 2015 4:37 pm

zullil wrote:Question from non-expert:

Would it be possible to include a UCI option "Write SMP log" (false by default) that would dump possibly useful information about activity/inactivity of threads (or whatever), so users with high-core machines might be able to test/contribute feedback?

I understand that testing multi-threaded searching is difficult, because the searches themselves are highly indeterminate. Testing with very short time control matches seems problematic to me.

Thanks.

This is not easy. First, such a log could be quite big. As a humorous note, Cray once asked me to do this when I was testing on the T90 with 32 cpus. I asked "are you sure you want to see that?" I ran the test, told them where to find the log, Dennis called me back and said "7+ terabytes is a bit large, eh?"

There are several issues to deal with. IF NPS scaling is poor, that is almost always a result of either (a) threads sitting idle waiting for work too long or (b) cache/memory bottlenecks. Discovering which is not so easy. If you watch Crafty run, you will see it display time as "time=3:14(97%)". What that means is that 97% of the time ALL threads were busy, or the inverse, 3% of the total performance power of the machine (N cores) was lost due to waiting for a place to split. If NPS is low but that percentage is high, then we are looking at cache/memory issues which are difficult for a user to deal with. Intel changes their cache mechanism regularly, particularly with automatic pre-fetching. This makes figuring out where the NPS bottleneck is quite difficult at times. Intel / AMD have some internal tools I have used in the past to help me discover unknown cache hot-spots and the like. But every time they make an improvement, it bites some program until the programmer fixes that.

Easy-to-tune is a goal. But a VERY difficult one. I'm interested in how the dual 18-core boxes might run and what quirks they are going to have...

mar · Post by **mar** » Wed Feb 18, 2015 4:40 pm

bob wrote:On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores.

11x speedup on 2x6 cores?...

zullil · Post by **zullil** » Wed Feb 18, 2015 5:11 pm

bob wrote:
zullil wrote:Question from non-expert:

Would it be possible to include a UCI option "Write SMP log" (false by default) that would dump possibly useful information about activity/inactivity of threads (or whatever), so users with high-core machines might be able to test/contribute feedback?

I understand that testing multi-threaded searching is difficult, because the searches themselves are highly indeterminate. Testing with very short time control matches seems problematic to me.

Thanks.
This is not easy. First, such a log could be quite big. As a humorous note, Cray once asked me to do this when I was testing on the T90 with 32 cpus. I asked "are you sure you want to see that?" I ran the test, told them where to find the log, Dennis called me back and said "7+ terabytes is a bit large, eh?"

There are several issues to deal with. IF NPS scaling is poor, that is almost always a result of either (a) threads sitting idle waiting for work too long or (b) cache/memory bottlenecks. Discovering which is not so easy. If you watch Crafty run, you will see it display time as "time=3:14(97%)". What that means is that 97% of the time ALL threads were busy, or the inverse, 3% of the total performance power of the machine (N cores) was lost due to waiting for a place to split. If NPS is low but that percentage is high, then we are looking at cache/memory issues which are difficult for a user to deal with. Intel changes their cache mechanism regularly, particularly with automatic pre-fetching. This makes figuring out where the NPS bottleneck is quite difficult at times. Intel / AMD have some internal tools I have used in the past to help me discover unknown cache hot-spots and the like. But every time they make an improvement, it bites some program until the programmer fixes that.

Easy-to-tune is a goal. But a VERY difficult one. I'm interested in how the dual 18-core boxes might run and what quirks they are going to have...

Thanks. Even having access to such a percentage (time threads active per total time) for Stockfish might be helpful. Perhaps there is some way to measure this by monitoring the Stockfish process as it runs. Don't know enough.

bob · Post by **bob** » Wed Feb 18, 2015 6:37 pm

mar wrote:
bob wrote:On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores.
11x speedup on 2x6 cores?...

No. 11x NPS scaling. I didn't take the time to run the long speedup test, it takes a lot of time when you want the 12 core or 16 core test to take 3 minutes per position. Which means you want the 1 core test to take an hour per position or so. My test has over 300 positions, which means the 1 cpu test run takes almost 2 weeks by itself.

The NPS scaling is always a good starting point, because if it is bad, the parallel speedup will be even worse. It took quite a bit of program tweaking and tuning, as I started off on that particular box at maybe 8x NPS scaling, which translates to maybe a 6x speedup. Not so good. When I finished (this code is in 25.0 and will be released soon) it was "back to normal".

Until the next CPU change by Intel or AMD of course.

bob · Post by **bob** » Wed Feb 18, 2015 6:42 pm

zullil wrote:
bob wrote:
zullil wrote:Question from non-expert:

Would it be possible to include a UCI option "Write SMP log" (false by default) that would dump possibly useful information about activity/inactivity of threads (or whatever), so users with high-core machines might be able to test/contribute feedback?

I understand that testing multi-threaded searching is difficult, because the searches themselves are highly indeterminate. Testing with very short time control matches seems problematic to me.

Thanks.
This is not easy. First, such a log could be quite big. As a humorous note, Cray once asked me to do this when I was testing on the T90 with 32 cpus. I asked "are you sure you want to see that?" I ran the test, told them where to find the log, Dennis called me back and said "7+ terabytes is a bit large, eh?"

There are several issues to deal with. IF NPS scaling is poor, that is almost always a result of either (a) threads sitting idle waiting for work too long or (b) cache/memory bottlenecks. Discovering which is not so easy. If you watch Crafty run, you will see it display time as "time=3:14(97%)". What that means is that 97% of the time ALL threads were busy, or the inverse, 3% of the total performance power of the machine (N cores) was lost due to waiting for a place to split. If NPS is low but that percentage is high, then we are looking at cache/memory issues which are difficult for a user to deal with. Intel changes their cache mechanism regularly, particularly with automatic pre-fetching. This makes figuring out where the NPS bottleneck is quite difficult at times. Intel / AMD have some internal tools I have used in the past to help me discover unknown cache hot-spots and the like. But every time they make an improvement, it bites some program until the programmer fixes that.

Easy-to-tune is a goal. But a VERY difficult one. I'm interested in how the dual 18-core boxes might run and what quirks they are going to have...
Thanks. Even having access to such a percentage (time threads active per total time) for Stockfish might be helpful. Perhaps there is some way to measure this by monitoring the Stockfish process as it runs. Don't know enough.

What I do is have each thread count the number of milliseconds it spends waiting on work. I know how long a search takes (elapsed time) So this usage number is

usage = 100 - total_idle_time / # cores * elapsed time

I believe. If idle time is zero, which happens on my 4 core iMac all the time, then usage = 100%. With 4 cores I have been seeing 97-100% which is what I want. On that 12 core box, I finally drove it up to the numbers previously given (92% or so). It's a challenge to work well everywhere.

petero2 · Post by **petero2** » Wed Feb 18, 2015 6:57 pm

zamar wrote:
#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.
Well, there is no such a thing as a sure ALL node, but in in principle I agree. An ideal splitting algorithm would take the node type and the count of already searched moves into account as this affects the probability of cut-off.

Texel already does that. It collects cut off statistics during search and uses the information to estimate the probability that a move at a potential split point would be searched by a corresponding single threaded search. It also collects "overhead" statistics during search, which estimates how much time a thread needs to spend on starting/stopping the search, as a function of the depth of the subtree to search. For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.

zamar · Post by **zamar** » Wed Feb 18, 2015 8:20 pm

petero2 wrote: For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.

I've thought along similar lines, so that makes perfect sense to me.
About your algorithm: Why are you not just using YBW?

petero2 · Post by **petero2** » Wed Feb 18, 2015 8:39 pm

zamar wrote:
petero2 wrote: For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I've thought along similar lines, so that makes perfect sense to me.
About your algorithm: Why are you not just using YBW?

I wanted to design my own algorithm and I wanted to include some of the good properties from DTS in the algorithm.

I also wanted the algorithm to give a larger elo improvement than YBWC when using many cores, but I don't know if that succeeded. At least YBWC does not seem to be as good to me, because it basically replaces all the probability estimates with 0 (when first move has not been searched) and 1 (when first move has been searched).

Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?