two questions:
1) How many different testpositions were used to obtain these data?
2) Is this a fixed-depth test so that you can calculate "time to depth"?
If the answer to 2) were "yes" then my conclusion would be (after calculating TTD, NPS speedup and TTD-based SMP speedup):
a) Your NPS speedup looks almost perfect up to 4 logical cores used, and still acceptable for 5 and 6 logical cores. With >6 logical cores you get practically no further NPS speedup, which is not unexpected for me since your machine only has 6 physical cores. So better don't look too much at the values for >6.
b) Your "time to depth" is horrible. It *increases* drastically from 1 to 2 logical cores, where everyone would expect it to *decrease*, then goes down a bit for 3 cores, and basically remains constant at about 42 sec for more logical cores. That is not what it should be. The TTD-based SMP speedup for 3..6 logical cores would be around 0.42 in your case which is absolutely ineffective. At least a 2.0 for 4 cores would be o.k., 2.5 already much better, and 3.0 really professional.
So I hope your underlying test is actually not a "fixed-depth" test
Sven
My students continue to measure time incorrectly on various parallel machines. One can use clock() and on some machines it returns a value that is the sum of all the processor times for all the threads. On others, just the cpu time of the main thread. This data sort of looks like the cumulative time perhaps...
I've used gettimeofday() on unix forever which solves that and gives reasonable resolution as well.
This data sort of looks like the cumulative time perhaps...
I don't think so since the NPS data look almost plausible for nCores <= 6. But look at the "nodes" and "TTD" columns (where I simply set TTD := nodes/NPS). 300000 nodes with nCores=1 and 2100000 with nCores=2 can't be right, and there is nothing about time measurement involved. It looks like the second core leads to an overall search tree explosion for Folkert, the third helps a bit, and all additional cores do not help at all.
Each row is a fresh start of the program.
So the fourth row was the program started with 4 codes. It search 328653 nodes in 6.352 seconds and did 51740 nodes/s. It then came with e7-e5 as the result.
The time shown is measured with "System.currentTimeMillis()" (this is a Java program).
If you would like to run it from xboard, it might work to do: java -jar DeepBrutePos-1.7-20130807.jar --io-mode xboard --depth 7 --max-search-duration 0 --logfile test.log
but I have not tested that for a while
This data sort of looks like the cumulative time perhaps...
I don't think so since the NPS data look almost plausible for nCores <= 6. But look at the "nodes" and "TTD" columns (where I simply set TTD := nodes/NPS). 300000 nodes with nCores=1 and 2100000 with nCores=2 can't be right, and there is nothing about time measurement involved. It looks like the second core leads to an overall search tree explosion for Folkert, the third helps a bit, and all additional cores do not help at all.
Sven
Almost looks like an abbada type program but every program is searching the same tree at the same time. Lots of nodes. No extra depth.
relevant columns:
2nd: thread count
4th: nodes visited
5th: how long it took
6th: average number of nodes per second
Note: each line is complete new invocation of the program. This is also with a transposition table of only 65536 elements (that is entries: not bytes).
Also some other VM on that pc was using one complete core sometimes en more so that's why 6 threads had ho improvement over 5.
relevant columns:
2nd: thread count
4th: nodes visited
5th: how long it took
6th: average number of nodes per second
Note: each line is complete new invocation of the program. This is also with a transposition table of only 65536 elements (that is entries: not bytes).
Also some other VM on that pc was using one complete core sometimes en more so that's why 6 threads had ho improvement over 5.
The peculiar thing is that there appears to be no speedup at all. And it slows down in, in fact, although the 1cpu to 2cpu slowdown is minor.
relevant columns:
2nd: thread count
4th: nodes visited
5th: how long it took
6th: average number of nodes per second
Note: each line is complete new invocation of the program. This is also with a transposition table of only 65536 elements (that is entries: not bytes).
Also some other VM on that pc was using one complete core sometimes en more so that's why 6 threads had ho improvement over 5.
The peculiar thing is that there appears to be no speedup at all. And it slows down in, in fact, although the 1cpu to 2cpu slowdown is minor.
The improvement is in the aspiration algorithm and the properly handling of cut-offs so that the search duration for a certain depth (7 plies in this case) is decreased.
flok wrote:The improvement is in the aspiration algorithm and the properly handling of cut-offs so that the search duration for a certain depth (7 plies in this case) is decreased.
This is indeed an improvement of the search itself, the 1-core version now needs only 21000 instead of 300000 nodes for the 7-ply search. But the key problem remains: going from one to two cores more than doubles the overall tree size, and adding a third core almost doubles it again.
Even if each thread (i.e. each core) would search exactly the same tree and would not interact with any other search thread at all, you would expect a total tree size of nCores * (tree size of 1-core search) at most. So I would say, there is still some serious error in the parallel search algorithm.
Maybe you could add separate node counters per thread, and other statistics, to see a bit more of what actually happens?
flok wrote:Sven, thanks for point me at that; I had not noticed that in fact.
This is indeed rather strange.
Found the problem: when iterating through the list of moves, I accidently increased the index twice sometimes. So about half of all moves were not even looked at.