If we're only searching the top two moves at the root node, we're not using 38x the horsepower, we would be using roughly 2x the horsepower.Do the math. You are at depth D. Searching each move at the same time. It will take as long to search them all as it will to search just one. So you use 38x the horsepower to search to depth D as you would if you used all processors. And since that is a bit over 2^5 (38 compared to 32) we could go 5 plies _deeper_ while you are searching to the nominal depth D. (since we are 5 doublings faster using all 32 processors on one tree we go 5 plies deeper with an effective branching factor of 2.0) Which sounds better?
And really what I've been suggesting all along is that at some depth, we're spending too much time searching nodes that will never be played. We might be able to ignore many of these nodes by advancing the search up one ply along the PV. The reason I'd search the top two moves at the root node is so that new findings could still affect the best move in the current position. Simply advancing along a sole PV would be pointless because it would never change the decision made at the root node.
So the million dollar question, when does spending 2x the horsepower near the root node pay off for depth along the PV? I'd feel pretty confident that if you did a fixed depth game at say 30 ply depth, you could reasonably get that depth using the method I just devised. I think it would be time prohibitive to allow an engine to search to 30 ply on each move. Granted, a true 30-ply search would be more accurate (stronger), but it would take much longer. And a 30 ply search using my method may be stronger than a 28 ply (or however deep it would get in the same time) search using traditional algorithms.