For fun and profit I did a test how well my chess-program behaves and improves if I let it use more threads.
Well, not so good.
In fact: it plays best with only 1 thread.
Now a while ago I noticed that it was not always using all cores but it is always using at least one. I figured: any other running thread should help the search of the "primary" search-thread because of their updates to the transposition table. So when adding more threads, the tt hit rate should increase.
To measure is to know:
What I've learned: with more threads, the average amount of cpu time used per thread decreases. But also: with more threads, the average hit ratio of the transposition table does not increase linearly, far from it.
Question: how does the tt behave on other multithreaded programs? What should I expect?
To me that seems to indicate that your parallel search algorithm is incorrect, causing most threads to be idle (or waiting for something) most of the time. It does not seem related to TT usage at all.
Sven Schüle wrote:To me that seems to indicate that your parallel search algorithm is incorrect, causing most threads to be idle (or waiting for something) most of the time. It does not seem related to TT usage at all.
Just to verify I removed all locking from the TT. Of course that can give a couple of incorrect results but just to see how it affects the threading. Result? Exact(+/-) same graph.
My threading search alg. is: when going through a list of moves, I check each time if there's a thread-slot free. If so, then that move is searched in a thread, if not: in the current thread.
I can't think of an other way to do this?
Sven Schüle wrote:To me that seems to indicate that your parallel search algorithm is incorrect, causing most threads to be idle (or waiting for something) most of the time. It does not seem related to TT usage at all.
Just to verify I removed all locking from the TT. Of course that can give a couple of incorrect results but just to see how it affects the threading. Result? Exact(+/-) same graph.
My threading search alg. is: when going through a list of moves, I check each time if there's a thread-slot free. If so, then that move is searched in a thread, if not: in the current thread.
I can't think of an other way to do this?
I don't know your current program (I only knew the previous Java program), so I can't tell what you could do better. But if the CPU usage ratio decreases to much less than 50% with 8 or more cores then more than half of your computing resources are "idle" on average.
Sven Schüle wrote:To me that seems to indicate that your parallel search algorithm is incorrect, causing most threads to be idle (or waiting for something) most of the time. It does not seem related to TT usage at all.
Just to verify I removed all locking from the TT. Of course that can give a couple of incorrect results but just to see how it affects the threading. Result? Exact(+/-) same graph.
My threading search alg. is: when going through a list of moves, I check each time if there's a thread-slot free. If so, then that move is searched in a thread, if not: in the current thread.
I can't think of an other way to do this?
I don't know your current program (I only knew the previous Java program), so I can't tell what you could do better. But if the CPU usage ratio decreases to much less than 50% with 8 or more cores then more than half of your computing resources are "idle" on average.
It is basically the same apart from the make/unmake code and new chess algos (null move etc). The threading is the same.
And idle: yes could be. I do the threading only in the regular (non-sq) search. Should I do it in qs as well? As that is usually only 1 or 2 ply I think?
Sven Schüle wrote:To me that seems to indicate that your parallel search algorithm is incorrect, causing most threads to be idle (or waiting for something) most of the time. It does not seem related to TT usage at all.
Just to verify I removed all locking from the TT. Of course that can give a couple of incorrect results but just to see how it affects the threading. Result? Exact(+/-) same graph.
My threading search alg. is: when going through a list of moves, I check each time if there's a thread-slot free. If so, then that move is searched in a thread, if not: in the current thread.
I can't think of an other way to do this?
I don't know your current program (I only knew the previous Java program), so I can't tell what you could do better. But if the CPU usage ratio decreases to much less than 50% with 8 or more cores then more than half of your computing resources are "idle" on average.
It is basically the same apart from the make/unmake code and new chess algos (null move etc). The threading is the same.
And idle: yes could be. I do the threading only in the regular (non-sq) search. Should I do it in qs as well? As that is usually only 1 or 2 ply I think?
There is some interesting theory about the topic of "random walk" that proves that eventually it will get to the final destination. But it takes a LONG time to do so.
This is NOT the way to develop a chess engine. "Just insert code and test??" Everything needs to have a valid reason behind it before adding code. If you can't keep 8 cores almost 100% busy, there is a basic flaw in your algorithm. But until you understand this "lazy SMP" type search, just writing and testing random code is not going to take you very far. In fact, it is more likely to take you in a backward direction.
flok wrote:What I've learned: with more threads, the average amount of cpu time used per thread decreases.
Maybe you're only using 1 core. Maybe your JVM cannot handle more than 1 core. Maybe your machine only has 1 core...
Or maybe you should simply not lock when accessing the TT (this has been suggested to you a long time ago, but you were sure that your locking did not hurt... well..).
flok wrote:What I've learned: with more threads, the average amount of cpu time used per thread decreases.
Maybe you're only using 1 core.
No systemwide there's > 100% cpu usage.
Maybe your JVM cannot handle more than 1 core. Maybe your machine only has 1 core...
It is a C++ application so no jvm. Also the system is 12 cores.
Or maybe you should simply not lock when accessing the TT (this has been suggested to you a long time ago, but you were sure that your locking did not hurt... well..).
Well I measured it and it does not help to remove locking.
Note that I use (number of cores) * some_value read/write locks. So unless every thread wants to do an update of the same position at the same time, I don't expect lock-contention. Just to be sure I also measured lock-contention and there is none (I used "mutrace" for that).