michiguel wrote:To optimize my engine I was checking the speed with a benchmark of 90 positions (1 second each, see PS). Since I have been improving the speed in small steps, I observed certain inconsistencies that sent me in the wrong direction a couple of times. I realized that the noise was much higher than I expected. What I am doing now to finally test two versions is to run a script that repeats
loop 1 to 50 {
benchmark for version A
benchmark for version B
}
and later analyze the data with another script comparing two sets of 50 numbers.
How are you analyzing the performance data? The calculated statistic can affect the "noise" greatly.
In almost all cases it's best to take the fastest result of the N runs, discarding the other values. I'm more used to measuring the time for a fixed quantity of calculation, and taking the minimum time from N runs. Such a result is often adequately reproducible ("noise-free") with N as small as 3 or 5. Your test is arranged differently, so that you would take the maximum number of nodes from the N runs, but the principle should be the same.
Calculating the mean result of N is much less satisfactory because its variance decreases only slowly with increasing N. The distribution of times for, say, repeated fixed-depth searches has a definite minimum but is long-tailed to the right.
Robert P.
I evaluate the average, the standard deviation, the ratio between run1, 2, 3 etc. between engines A and B, and the standard deviation of the ratio. The ratio is more accurate when there are drifts and spikes. Generally, the standard deviation tells me whether something was wrong or not. If I see, 500 +/- 2 knps it seems everything went fine. If I see 500 +/- 15 knps I have to watch it because something was funny.
phhnguyen wrote:I have seen some things similar for my tests but mainly lower results or not stable data.
I guess that when your engine has just started with requests of huge resource, the system has to do some heavy tasks such as accessing disk (usually I hear disk sound a while after starting an engine) to write down cache, memory of other sw, optimize memory... so clock tick may not up to date correctly (some events of tick may be delay - overall will be OK but not for first few seconds), leading your measurement incorrect for that period.
Perhaps you may try:
// Warm up only, do not take data
loop 1 to 50 {
benchmark for version A
benchmark for version B
}
// OK now, take data
loop 1 to 50 {
benchmark for version A
benchmark for version B
}
I have seen one spike long after the first runs.
One was like
Engine A Engine B
449,000 450,000
452,000 451,000
...
many runs around 450
...
500,000 502,000
501,000 500,000
500,000 503,000
451,000 449,000
...
all the rest around 450
I have seen them drifting slowly from 430 to 425 etc.
hgm wrote:When optimizing my qperft application, I encountered a case where removing some dead code produced a slowdown of 10%. Strange things can happen. Even when the assembly code is identical, there can still be differences due to alignment of the code. (Some assembler statements produce a variable number of nop instructions, depending on alignment, so identical assembly code does not necessarily mean identical machine code. And identical machine code that is relocated to another address might run at a different speed due to alignment issues (and of course caching issues).
That is why I was trying to avoid this problem with PGO compiles. But, apparently, there is a risk that one PGO compile may be different that the next (I guess this may happen if the bench test is no long enough)
Miguel
PS: I remember long time ago I had a slow down of 5% by renaming one file or a function (I do not remember the details...)