So I added a thread which samples every 250ms the number of idle thread-slots (on a 6 core + ht system) and the amount of cpu time (user + sys) used in that slice.
In a nice graph:

x-axis is sample number, starting 250ms after the search started
So yes there are some moments where there are one or more threads idle, but they are never longer than the 250ms interval (well maybe 499ms).
System overhead is also very low.
Conclusion: it must be a locking issue.
Next step: running it through mutrace and see which locks are holding things back.