SMP search in Viper and idea about search in cluster system

nkg114mc · Post by **nkg114mc** » Fri May 03, 2013 7:54 am

Hi Daniel, and all other guys:

I begin to learn the code of Scorpio this week. Meanwhile, a question came into my mind, and maybe it is also worth to hear some opinions from other guys.

If I understand correctly, in the search architecture of Scorpio there are two "level"s: one is thread (which is called as "processor" in Scorpio src) level parallel search, and the other is process level (called "host" in src). At some position where the search can make a split, it has at least two choices: share its work with the threads located in the same host, or share its work with other hosts, or do both of these two. When a split point is available to do both kind of splits, is there any priority ordering between these two kind of splits? I see Scorpio allow both of these two splits, and try "process split" first, then "thread split".

I concern about this question because, if we allow the "process split" and "thread split" at the same split point, is there any risk of load imbalance between "thread work" and "process work"? Because it is obvious that the computation capability of a process is stronger than a thread, and a process may also contain several threads as well.

On the other perspective of this question is: How to organize the relation between threads and processes?

A possible way, or an intuitive way is: each thread is a child (or a slave) of a process. The threads in one host are transparent to other hosts. The main thread of each host responds to get and send messages to communicate with other hosts, while other non-main threads can only share the task got by the main thread. Thus, here process and threads form an two level architecture, and two threads in different host can not communicate directly.

Another way, which just came into my mind this afternoon, is: only use process as a container of threads, and allow the threads in different hosts to be able to communicate each other without the process. The process only provide some methods to help the communication between threads in different host, but the implementation of these "help methods" are almost transparent to threads (there may be some technique difficulties that I have not realized yet). At each split point, the master thread will try to assign work to the threads in the same host first, and then to the threads in other hosts, so that the transportation table can be used to improve the efficiency.

Which relation of threads and processes above will perform better in your opinion?

Thanks in advance!

Daniel Shawul · Post by **Daniel Shawul** » Fri May 03, 2013 3:39 pm

Hi Chao
The naming of search units maybe a little confusing but 'processors' are threads, and hosts are remote nodes as you said. It is beneficial to search with threads when possible because transposition table and other stuff are shared that will make the search more efficient. The difference used to be really big so a hybrid approach of MPI and threads was worth the effort. But after I switched to polling with a thread for incoming messages from other nodes, the performance of MPI only implementation improved significantly. I still advise you to first do MPI only then you can add threads later. Look at the results at the bottom where nps decreased from 6.4M to 6M due to two additional processes.

When a split point is available to do both kind of splits, is there any priority ordering between these two kind of splits?

Well when a node finds idle threads or nodes, it grabs them all so it doesn't matter which comes first. There is a limit on number of workers (separately for threads and hosts) than can participate at a split point but that is the only restriction. Having prioritized selection makes sense only if you want to delay grabbing the helper hosts for later splits. I don't do that now.

How to organize the relation between threads and processes?

For a hybrid MPI-threads approach, processes represent nodes and threads cores.

Another way, which just came into my mind this afternoon, is: only use process as a container of threads, and allow the threads in different hosts to be able to communicate each other without the process.

I would advise you not to do inter-node communication using threads. It is very messy to implement and also violates the intuiative approach that processes represent nodes. MPI provides some thread safe calls but you would need the latest version to implement what you need efficiently (MPI-3 IIRC). Look at the MPI_Init_thread(...) function that requests for available thread support. The worst you can get is a serialized support which means your code will work but very slow. Staring processes on all cores is a lot easier in this regard and is very intuitive. For efficiency use a hybrid approach that will still keep the essence of the previous method while disallowing communication between threads in different processes.

---------------------------------------------------------------------
8 threads:

Code: Select all

feature done=0
ht 4194304 X 16 = 64.0 MB
eht 1048576 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
+ Thread 1 started.
+ Thread 2 started.
+ Thread 3 started.
+ Thread 4 started.
+ Thread 5 started.
+ Thread 6 started.
+ Thread 7 started.
processors [8]
EgbbProbe not Loaded!
loading_time = 0s
st 120
go
23 35 7421 450965809  e2-e4 e7-e5 Ng1-f3 d7-d5 d2-d4 e5xd4 e4xd5 Qd8xd5 Qd1xd4 Qd5xd4 Nf3xd4 Ng8-f6 Nb1-c3 Bf8-b4 Bf1-c4 Ke8-g8 Bc1-e3 Rf8-e8 Ke1-g1 Bc8-d7 Ra1-d1 Bb4xc3 b2xc3 Nb8-c6 Rf1-e1 Bd7-g4
nodes = 772663161 <64 qnodes> time = 120003ms nps = 6438699
splits = 278986 badsplits = 22055 egbb_probes = 0
move e2e4

2 processes each starting 4 threads
+ 2 more threads for input output

Code: Select all

feature done=0
feature done=0
ht 4194304 X 16 = 64.0 MB
eht 1048576 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
+ Thread 1 started.
+ Thread 2 started.
+ Thread 3 started.
processors [4]
ht 4194304 X 16 = 64.0 MB
eht 1048576 X 8 = 8.0 MB
pht 32768 X 24 = 0.8 MB
EgbbProbe not Loaded!
loading_time = 0s
+ Thread 1 started.
+ Thread 2 started.
+ Thread 3 started.
processors [4]
EgbbProbe not Loaded!
loading_time = 0s
Process [1/2] on hnd20 : pid 14513
Process [0/2] on hnd20 : pid 14512
[hnd20:14511] 1 more process has sent help message help-mpi-btl-openib.txt / reg mem limit low
[hnd20:14511] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
st 120
go
23 25 10500 639270836  e2-e4 e7-e5 Ng1-f3 Nb8-c6 Bf1-c4 Bf8-c5 d2-d3 Ng8-f6 Bc1-g5 Ke8-g8 Nb1-c3 h7-h6 Bg5xf6 Qd8xf6 Ke1-g1 Ra8-b8 Nc3-d5 Qf6-d6 a2-a4 a7-a6 a4-a5 Nc6-d4 c2-c3 Nd4-c6
nodes = 730317369 <63 qnodes> time = 120008ms nps = 6085572
splits = 155266 badsplits = 19319 egbb_probes = 0
move e2e4
quit
Process [0/2] terminated.
Process [1/2] terminated.
Bye Bye

nkg114mc · Post by **nkg114mc** » Mon May 06, 2013 6:09 am

Hi, Daniel:

Thanks a lot for the valuable advices! I wrote some simple test these days, and also though about my idea again. It is really not good to allow threads in different host be able to communicate independently without telling the host, because it will increase the load of communication (message passing). Since MPI_Isend, MPI_Recv and MPI_Iprobe are not thread safety, the search has to lock a mutex_lock when checking the messages. This would significantly decrease the performance when every thread needs to wait for the mutex_lock before checking the messages from other hosts (instead of checking only by the main thread).

I still have one question: Does each non-main process can only work for one split point, which also means that all threads always need to work together for only one split point at a time?
Let’s say in the cluster, there exists a node machine has 8 cores. So on this machine, we can launch a process (a host) contain 8 threads. Let name this process at host2. Suppose this process is not the main process. When the main process make a split at some position, say this split point is “sp1”, there might be some local threads (threads in the main process) and some other hosts sharing this work at sp1. Suppose 4 local threads plus one non-main host -- host2 -- join this shared work at “sp1”. That means, all 8 threads in host2 has to join the shared work at sp1, and there would be 12 threads in total working at sp1. However, it might be a waste of resources to let 12 threads work together for sp1, maybe 8 threads is already enough. Because all 8 threads in this process has to be assigned to a work, or not, at the same time, which does not looks quite flexible.

Is it possible to let each process be more flexible in thread management during splitting? For example, one process, like host2, can deal with more than one split points: 4 threads for sp1, and the rest 4 threads for another process? (But all threads in the same process still share their transposition table).

Daniel Shawul · Post by **Daniel Shawul** » Mon May 06, 2013 7:29 pm

Thanks a lot for the valuable advices! I wrote some simple test these days, and also though about my idea again. It is really not good to allow threads in different host be able to communicate independently without telling the host, because it will increase the load of communication (message passing). Since MPI_Isend, MPI_Recv and MPI_Iprobe are not thread safety, the search has to lock a mutex_lock when checking the messages. This would significantly decrease the performance when every thread needs to wait for the mutex_lock before checking the messages from other hosts (instead of checking only by the main thread).

Note that you don't have to do the mutex_lock yourself. First I did it like that and had a horrible time managing the threads specially if you allow all threads to probe messages. The better way is to let MPI handle it by requesting the required thread support.

Code: Select all

int namelen,provided,requested = MPI_THREAD_SINGLE;
MPI_Init_thread(&argc, &argv,requested,&provided);

Right now I have two schemes to handle message management. The first is using a message polling thread per node just like other engines use for polling input from console. This is the best approach as it will make life a lot easier, but it would need one core for itself so out of 8 processors only 7 will be available. For loosely coupled systems messages come very infrequently so you can let it give up its time slice to reclaim some cpu time. The second approach is to let the main thread do search as well as message checking every X nodes which is a bit messier. In both cases one thread takes care of messages which is why I requested MPI_THREAD_SINGLE. If you want multiple threads doing message polling, then you request other support but I strongly advise you not to do that.

I still have one question: Does each non-main process can only work for one split point, which also means that all threads always need to work together for only one split point at a time?

No the threads work at different split points just like normal YBW implementation. When you send a message to a helper node, send also the address (pointer) of the split block, and when the helper is finsihed with the work it will send back its result with that address. Then the message polling thread will put the information at the right split point. With this scheme the message polling thread is quite busy on SMP systems and it should take up 100% CPU if you want same results as the threads implementation. The second approach that does polling every N nodes is very bad even with N as low as 50 nodes.

Daniel

nkg114mc · Post by **nkg114mc** » Fri May 10, 2013 3:36 am

Hi Daniel,

Thanks a lot for the detailed reply!

I have checked the latest code of Scorpio, there is a macro variable called "THREAD_POLLING". I think it is the switch of two different ways of message handling, is it?

By the way Daniel, can you explain a little more about why "the second approach that does polling every N nodes is very bad even with N as low as 50 nodes"?
Is it because the thread 0 was doing nothing except polling the message (almost did not search at all), or the messages arrived this host can not be processed in time, which remained too many unprocessed messages in the MPI buffer and stall the communication?

If I understand correctly, each host in Scorpio can only deal with one SPLIT_MESSAGE (a position with a move waiting for search, I see Scorpio transform the position into a pv line to compress the message size) at a time, is it? But the sub search tree of this move is still able to split into threads to let the host be fully loaded, as what YBW did.

Daniel Shawul · Post by **Daniel Shawul** » Fri May 10, 2013 12:50 pm

Hi Chao

I have checked the latest code of Scorpio, there is a macro variable called "THREAD_POLLING". I think it is the switch of two different ways of message handling, is it?

That is better to read, isn't it? The previous one had locks for mpi calls that are gone now.

By the way Daniel, can you explain a little more about why "the second approach that does polling every N nodes is very bad even with N as low as 50 nodes"?
Is it because the thread 0 was doing nothing except polling the message (almost did not search at all), or the messages arrived this host can not be processed in time, which remained too many unprocessed messages in the MPI buffer and stall the communication?

The problem is that MPI is not interrupt driven, which means it doesn't inform you when a message arrives. There are no condition variables so you can't tell the OS to wake you up when an event occurs. Infact even MPI_Recv() vigorously polls for messages by default, burning lots of CPU cycles in the process, which is unlike cin or scanf. Similarly with MPI_Iprobe one has to 'manually' check the queue as often as required. On clusters messages arrive very infrequently so both methods of polling are OK to use. But on SMP machines, messages have to be processed as soon as possible to avoid idle time. I used that at first but my NPS was much less than the one with SMP threads implementation (even for as low as N=50). Polling with a thread solved that problem and now I get comparable NPS values on SMP machines, but it doesn't really improve actual cluster performance since there messages come infrequently anyway. Both approaches are basically the same but from the design point of view a separate thread is better. On a single processor with hyperthreading the I/O thread can take that extra 20% CPU time, however the single thread approach can not take advantage of this situation. If MPI was interrupt drive (which some special edition is btw), you can just handle messages in the interrupt handler without the above polling methods.

If I understand correctly, each host in Scorpio can only deal with one SPLIT_MESSAGE (a position with a move waiting for search, I see Scorpio transform the position into a pv line to compress the message size) at a time, is it? But the sub search tree of this move is still able to split into threads to let the host be fully loaded, as what YBW did.

Yes that is correct. If you are keen on using each thread in a host to work for different split points, then it is much better to let them be a separate process so that they will recieve messages for themselves. Otherwise the SPLIT_MESSAGE has to contain information for many split points, so does the MERGE_MESSAGE which is messy to say the least...

Daniel

SMP search in Viper and idea about search in cluster system

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys