parallel search speedup/overhead

bob · Post by **bob** » Wed Aug 07, 2013 6:40 am

Sven Schüle wrote:Hi Folkert,

two questions:
1) How many different testpositions were used to obtain these data?
2) Is this a fixed-depth test so that you can calculate "time to depth"?

If the answer to 2) were "yes" then my conclusion would be (after calculating TTD, NPS speedup and TTD-based SMP speedup):
Code: Select all
nCores  nodes    NPS    TTD    NPS speedup  TTD-based SMP speedup
     1   300000  17000  17.65  1.00         1.00
     2  2100000  32000  65.63  1.88         0.27
     3  2000000  48000  41.67  2.82         0.42
     4  2700000  64000  42.19  3.76         0.42
     5  2800000  69000  40.58  4.06         0.43
     6  3500000  83000  42.17  4.88         0.42
     7  3300000  86000  38.37  5.06         0.46
     8  4000000  87000  45.98  5.12         0.38
     9  5000000  90000  55.56  5.29         0.32
    10  4700000  90000  52.22  5.29         0.34
    11  6000000  97000  61.86  5.71         0.29
a) Your NPS speedup looks almost perfect up to 4 logical cores used, and still acceptable for 5 and 6 logical cores. With >6 logical cores you get practically no further NPS speedup, which is not unexpected for me since your machine only has 6 physical cores. So better don't look too much at the values for >6.

b) Your "time to depth" is horrible. It *increases* drastically from 1 to 2 logical cores, where everyone would expect it to *decrease*, then goes down a bit for 3 cores, and basically remains constant at about 42 sec for more logical cores. That is not what it should be. The TTD-based SMP speedup for 3..6 logical cores would be around 0.42 in your case which is absolutely ineffective. At least a 2.0 for 4 cores would be o.k., 2.5 already much better, and 3.0 really professional.

So I hope your underlying test is actually not a "fixed-depth" test

Sven

My students continue to measure time incorrectly on various parallel machines. One can use clock() and on some machines it returns a value that is the sum of all the processor times for all the threads. On others, just the cpu time of the main thread. This data sort of looks like the cumulative time perhaps...

I've used gettimeofday() on unix forever which solves that and gives reasonable resolution as well.

Sven · Post by **Sven** » Wed Aug 07, 2013 12:26 pm

bob wrote:

Sven Schüle wrote:

Code: Select all

nCores  nodes    NPS    TTD    NPS speedup  TTD-based SMP speedup
     1   300000  17000  17.65  1.00         1.00
     2  2100000  32000  65.63  1.88         0.27
     3  2000000  48000  41.67  2.82         0.42
     4  2700000  64000  42.19  3.76         0.42
     5  2800000  69000  40.58  4.06         0.43
     6  3500000  83000  42.17  4.88         0.42
     7  3300000  86000  38.37  5.06         0.46
     8  4000000  87000  45.98  5.12         0.38
     9  5000000  90000  55.56  5.29         0.32
    10  4700000  90000  52.22  5.29         0.34
    11  6000000  97000  61.86  5.71         0.29

This data sort of looks like the cumulative time perhaps...

I don't think so since the NPS data look almost plausible for nCores <= 6. But look at the "nodes" and "TTD" columns (where I simply set TTD := nodes/NPS). 300000 nodes with nCores=1 and 2100000 with nCores=2 can't be right, and there is nothing about time measurement involved. It looks like the second core leads to an overall search tree explosion for Folkert, the third helps a bit, and all additional cores do not help at all.

Sven

flok · Post by **flok** » Wed Aug 07, 2013 10:21 pm

Hi,

I did some tweaking.

Code: Select all

7 1 8 22025 1.637 13454.489920586439 E7-E6
7 2 8 104101 3.849 27046.245778124186 E7-E5
7 3 8 256249 6.105 41973.62817362817 D7-D5
7 4 8 328653 6.352 51740.081863979845 E7-E5
7 5 8 372687 6.518 57178.122123350724 D7-D5
7 6 8 366094 6.223 58829.1820665274 D7-D5
7 7 8 672516 10.126 66414.77384949636 E7-E5
7 8 8 515614 7.9 65267.594936708854 E7-E6
7 9 8 453629 6.848 66242.55257009345 E7-E5
7 10 8 700660 10.531 66533.09277371569 E7-E5
7 11 8 500537 7.79 64253.786906290115 D7-D5
7 12 8 677317 9.659 70122.89056838182 E7-E6

Each row is a fresh start of the program.
So the fourth row was the program started with 4 codes. It search 328653 nodes in 6.352 seconds and did 51740 nodes/s. It then came with e7-e5 as the result.
The time shown is measured with "System.currentTimeMillis()" (this is a Java program).

If anyone would like to give the program a try: http://vps001.vanheusden.com/~folkert/D ... 130807.jar

invoke it like this (using java 5!):

java -jar DeepBrutePos-1.7-20130807.jar --io-mode console --depth 7 --max-search-duration 0 --logfile test.log

If you would like to run it from xboard, it might work to do:
java -jar DeepBrutePos-1.7-20130807.jar --io-mode xboard --depth 7 --max-search-duration 0 --logfile test.log
but I have not tested that for a while

bob · Post by **bob** » Thu Aug 08, 2013 12:40 am

Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:
Code: Select all
nCores  nodes    NPS    TTD    NPS speedup  TTD-based SMP speedup
     1   300000  17000  17.65  1.00         1.00
     2  2100000  32000  65.63  1.88         0.27
     3  2000000  48000  41.67  2.82         0.42
     4  2700000  64000  42.19  3.76         0.42
     5  2800000  69000  40.58  4.06         0.43
     6  3500000  83000  42.17  4.88         0.42
     7  3300000  86000  38.37  5.06         0.46
     8  4000000  87000  45.98  5.12         0.38
     9  5000000  90000  55.56  5.29         0.32
    10  4700000  90000  52.22  5.29         0.34
    11  6000000  97000  61.86  5.71         0.29
This data sort of looks like the cumulative time perhaps...
I don't think so since the NPS data look almost plausible for nCores <= 6. But look at the "nodes" and "TTD" columns (where I simply set TTD := nodes/NPS). 300000 nodes with nCores=1 and 2100000 with nCores=2 can't be right, and there is nothing about time measurement involved. It looks like the second core leads to an overall search tree explosion for Folkert, the third helps a bit, and all additional cores do not help at all.

Sven

Almost looks like an abbada type program but every program is searching the same tree at the same time. Lots of nodes. No extra depth.

flok · Post by **flok** » Thu Aug 08, 2013 6:02 pm

Almost looks like an abbada type program but every program is searching the same tree at the same time. Lots of nodes. No extra depth.

I optimized it slightly more:

Code: Select all

7 1 8 21218 1.662 12766.546329723225 E7-E6
7 2 8 46230 1.763 26222.34826999433 E7-E5
7 3 8 85331 2.468 34574.959481361424 E7-E5
7 4 8 89856 2.164 41523.10536044362 E7-E5
7 5 8 182247 3.721 48977.96291319538 D7-D5
7 6 8 136816 2.802 48827.98001427552 D7-D5
7 7 8 176298 3.275 53831.45038167939 E7-E5
7 8 8 487211 8.394 58042.768644269716 E7-E6
7 9 8 256852 4.76 53960.50420168068 D7-D5
7 10 8 138661 2.877 48196.38512339243 E7-E5
7 11 8 405952 7.776 52205.76131687243 E7-E5
7 12 8 182149 3.401 53557.483093207884 E7-E5

relevant columns:
2nd: thread count
4th: nodes visited
5th: how long it took
6th: average number of nodes per second

Note: each line is complete new invocation of the program. This is also with a transposition table of only 65536 elements (that is entries: not bytes).

Also some other VM on that pc was using one complete core sometimes en more so that's why 6 threads had ho improvement over 5.

bob · Post by **bob** » Thu Aug 08, 2013 9:35 pm

flok wrote:
Almost looks like an abbada type program but every program is searching the same tree at the same time. Lots of nodes. No extra depth.
I optimized it slightly more:
Code: Select all
7 1 8 21218 1.662 12766.546329723225 E7-E6
7 2 8 46230 1.763 26222.34826999433 E7-E5
7 3 8 85331 2.468 34574.959481361424 E7-E5
7 4 8 89856 2.164 41523.10536044362 E7-E5
7 5 8 182247 3.721 48977.96291319538 D7-D5
7 6 8 136816 2.802 48827.98001427552 D7-D5
7 7 8 176298 3.275 53831.45038167939 E7-E5
7 8 8 487211 8.394 58042.768644269716 E7-E6
7 9 8 256852 4.76 53960.50420168068 D7-D5
7 10 8 138661 2.877 48196.38512339243 E7-E5
7 11 8 405952 7.776 52205.76131687243 E7-E5
7 12 8 182149 3.401 53557.483093207884 E7-E5
relevant columns:
2nd: thread count
4th: nodes visited
5th: how long it took
6th: average number of nodes per second

Note: each line is complete new invocation of the program. This is also with a transposition table of only 65536 elements (that is entries: not bytes).

Also some other VM on that pc was using one complete core sometimes en more so that's why 6 threads had ho improvement over 5.

The peculiar thing is that there appears to be no speedup at all. And it slows down in, in fact, although the 1cpu to 2cpu slowdown is minor.

flok · Post by **flok** » Thu Aug 08, 2013 10:16 pm

bob wrote:
flok wrote:
Almost looks like an abbada type program but every program is searching the same tree at the same time. Lots of nodes. No extra depth.
I optimized it slightly more:
Code: Select all
7 1 8 21218 1.662 12766.546329723225 E7-E6
7 2 8 46230 1.763 26222.34826999433 E7-E5
7 3 8 85331 2.468 34574.959481361424 E7-E5
7 4 8 89856 2.164 41523.10536044362 E7-E5
7 5 8 182247 3.721 48977.96291319538 D7-D5
7 6 8 136816 2.802 48827.98001427552 D7-D5
7 7 8 176298 3.275 53831.45038167939 E7-E5
7 8 8 487211 8.394 58042.768644269716 E7-E6
7 9 8 256852 4.76 53960.50420168068 D7-D5
7 10 8 138661 2.877 48196.38512339243 E7-E5
7 11 8 405952 7.776 52205.76131687243 E7-E5
7 12 8 182149 3.401 53557.483093207884 E7-E5
relevant columns:
2nd: thread count
4th: nodes visited
5th: how long it took
6th: average number of nodes per second

Note: each line is complete new invocation of the program. This is also with a transposition table of only 65536 elements (that is entries: not bytes).

Also some other VM on that pc was using one complete core sometimes en more so that's why 6 threads had ho improvement over 5.
The peculiar thing is that there appears to be no speedup at all. And it slows down in, in fact, although the 1cpu to 2cpu slowdown is minor.

The improvement is in the aspiration algorithm and the properly handling of cut-offs so that the search duration for a certain depth (7 plies in this case) is decreased.

Sven · Post by **Sven** » Thu Aug 08, 2013 11:29 pm

flok wrote:The improvement is in the aspiration algorithm and the properly handling of cut-offs so that the search duration for a certain depth (7 plies in this case) is decreased.

This is indeed an improvement of the search itself, the 1-core version now needs only 21000 instead of 300000 nodes for the 7-ply search. But the key problem remains: going from one to two cores more than doubles the overall tree size, and adding a third core almost doubles it again.

Even if each thread (i.e. each core) would search exactly the same tree and would not interact with any other search thread at all, you would expect a total tree size of nCores * (tree size of 1-core search) at most. So I would say, there is still some serious error in the parallel search algorithm.

Maybe you could add separate node counters per thread, and other statistics, to see a bit more of what actually happens?

Sven

flok · Post by **flok** » Fri Aug 09, 2013 5:19 pm

Sven, thanks for point me at that; I had not noticed that in fact.
This is indeed rather strange.

I added the counters and did some tests:

1 thread:
0: 216223

2 threads:
0: 150864
1: 1368185

3 threads:
0: 10569
1: 746257
2: 9471

this is...odd.

flok · Post by **flok** » Sat Aug 10, 2013 2:15 am

flok wrote:Sven, thanks for point me at that; I had not noticed that in fact.
This is indeed rather strange.

Found the problem: when iterating through the list of moves, I accidently increased the index twice sometimes. So about half of all moves were not even looked at.

parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead