That is why it deviates from Amdhal.bob wrote:You are mangling the word "size". In chess, in a parallel search, we search a larger tree due to overhead. THAT is why we don't get perfectly linear speedup.michiguel wrote:I does not for chess engines, and you know it.bob wrote:For EVERY problem, this size can remain constant.michiguel wrote:No, they don't!bob wrote:For the record, Amdahl's law applies to ALL parallel programs.michiguel wrote:Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".abulmo wrote:What I like here is to use the following formula:Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.
That is the reason why the data does not fit well the Amdahl's formula.
MiguelElo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.Edsel Apostol wrote: Performance/Elo based
It applies to the ones in which the size of the problem remains the same.
Not actually true, because it fills the hashtable with information that sometimes it is useful and few times it generates bigger speedups.Who cares? Those extra nodes, for almost every program on the planet, do not give any additional information,
No, we were discussing this in this sub-branch.
they are simply searched due to alpha/beta considerations and imperfect move ordering / splitting.
Can we return to the regularly scheduled discussion now?
No, the work done needs to be the same.
Amdahl's law can be applied to ANY algorithm that has a specific output for a specific input.
You can measure the speed and plot it vs. number of processors. Does it fit an equation y = a * x / (b + x)? with the data we have seen over the years, it doesn't. Does it fit today?
Chess certainly does that for fixed depth measuring time to depth. Since Elo has ALWAYS been directly correlated with speed, one could probably measure Elo, extrapolate to speedup, and Amdahl's law would still apply just fine...
Miguel
Miguel
In a weather model, one would not compare two different programs that examine a different number of data points. Chess certainly fits this description. Changing pruning is certainly unusual, to say the least...
Miguel
Or anything else done related to efficiency, such as internal microprocessor design decisions and such.
The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"
Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...
