I suspect that a "static split depth" is not going to work well at all, any more than it does in a normal SMP search. You should be able to split anywhere from the root (and splitting at the root is a good thing, once you have searched the first move to establish alpha) to some point limited by communication overhead. Probably this limit needs to be "soft" and dynamically adjusted as the search progresses. 
Yes I plan to use variable split depth for different nodes. The d=8 that I mentioned was when
every processor commincated with message passing. Later when I tested by adding SMP search on
processors at the same node, I noticed I had to increase the split depth. Because now the search
power at a given node had doubled. Another issue is communication costs. One can use larger split 
depths for compute nodes with larger latency. That is not an issue for me right now but i could 
see it as an obstacle.
I have two problems with synchronization. One is that processors have to wait for the first
move to be searched before splitting. Another sync point is when all moves are searched  
except the last one which is being searched by a different host. What to do there ?
Currently the owner waits until it is finished. This is only if there is another host helping
at the split. For the SMP search threads will ofcousre help out down the tree sometimes completely
yielding ownership of that node. But for the distributed search I don't know how to do it?  
I thought of some ways to solve the issue but I expect that the code can get really messy.
I believe these synchronization issues and communication costs are what are killing the performance.
Splitting at the root the "YBW way" may help some but I do not expect it to improve my situation much.
If the second or third move were the best, the speculative root splitting would result in more effort
being spent on the non-relevant part of the tree. I understand that the chances of that happening are 
slim most of the time. I would definately give it a try though.
Related to this is asynchronous iterative search on the root moves. What is your experience with this ?
I plan to do APHID or something similar for a server-client application to be used on a loosely connected
network like on a wifi network. Asynchronous search for distributed search really starts to make sense for
me. When you have a lot of processors at your disposal, it gets harder to keep them busy. With split depth
of 12, ussualy I don't see splits occuring before iteration 15.
Another idea is that as you split, keep up with where you send the work (you can use the hash signature for this) and always try to send the same position to the same node as you deepen the search or encounter transpositions, since this will be more efficient because of hash contents.
Currently each nodes sends a "HELP" message to a randomly selected node, and the node that recieved the help either cancels the request or accepts the node as a helper. That is what Fieldmann and co suggested but they use distributed transpostion tables and I don't, so trying to keep who searched what could be a good idea for me.
First thing to do is to get something working _realiably_. Efficiency comes later once the thing works reliably.
That is top priority now. MPI really helps there hiding the "gorry" details of process communication (sockets, managing system buffer etc..). I am just glad I don't have to do those from scratch.