TalkChess.com

Posted: **Sat Aug 21, 2010 8:25 pm**

To optimize my engine I was checking the speed with a benchmark of 90 positions (1 second each, see PS). Since I have been improving the speed in small steps, I observed certain inconsistencies that sent me in the wrong direction a couple of times. I realized that the noise was much higher than I expected. What I am doing now to finally test two versions is to run a script that repeats

loop 1 to 50 {
benchmark for version A
benchmark for version B
}

and later analyze the data with another script comparing two sets of 50 numbers.

1) I just observed that in one case, all of the sudden, the nps increased ~10% for about 3 benchmarks and later came down to the original. Is this common these type of sudden spikes? Any hardware expert?

[EDIT: This was in a dual, Linux, using one core, with nothing else in the background]

2) How the heck do I explain that version A is _consistently_ 2% faster that version B when... they are identical at the source level? version B has only few comments added. I could easily see that with "git diff".
Is it possible that the PGO optimization became different in one case compared to the other? I am about to recompile both versions and re-test.

You may wonder why I care. I have been optimizing in small steps and was able to speed up the engine 50% after a long process. I just had a regression of ~10% that is driving me nuts, when (apparently) it does not make a lot of sense.

Miguel
PS: I preferred 90 positions for 1 second than fewer positions for a longer time. I am not sure this is the best, but I thought this was going to be reliable enough, for repeated quick tests.
PS2: All this would be near impossible w/o GIT and of course a profiler.

Posted: **Sat Aug 21, 2010 8:41 pm**

Did you compare the node counts too, Miguel?
This is mandatory for me when optimize. Otherwise I could miss some 'just-introduced' bug.

Posted: **Sat Aug 21, 2010 8:51 pm**

Mincho Georgiev wrote:Did you compare the node counts too, Miguel?
This is mandatory for me when optimize. Otherwise I could miss some 'just-introduced' bug.

Not in this benchmark because it is limited by time. If I limit it by depth, some positions take much longer and some are too fast. I check the node count separately with fewer positions running a debug version too (filled with assert()'s).

Miguel

Posted: **Sat Aug 21, 2010 9:02 pm**

michiguel wrote:To optimize my engine I was checking the speed with a benchmark of 90 positions (1 second each, see PS). Since I have been improving the speed in small steps, I observed certain inconsistencies that sent me in the wrong direction a couple of times. I realized that the noise was much higher than I expected. What I am doing now to finally test two versions is to run a script that repeats

loop 1 to 50 {
benchmark for version A
benchmark for version B
}

and later analyze the data with another script comparing two sets of 50 numbers.

1) I just observed that in one case, all of the sudden, the nps increased ~10% for about 3 benchmarks and later came down to the original. Is this common these type of sudden spikes? Any hardware expert?

[EDIT: This was in a dual, Linux, using one core, with nothing else in the background]

2) How the heck do I explain that version A is _consistently_ 2% faster that version B when... they are identical at the source level? version B has only few comments added. I could easily see that with "git diff".
Is it possible that the PGO optimization became different in one case compared to the other? I am about to recompile both versions and re-test.

You may wonder why I care. I have been optimizing in small steps and was able to speed up the engine 50% after a long process. I just had a regression of ~10% that is driving me nuts, when (apparently) it does not make a lot of sense.

Miguel
PS: I preferred 90 positions for 1 second than fewer positions for a longer time. I am not sure this is the best, but I thought this was going to be reliable enough, for repeated quick tests.
PS2: All this would be near impossible w/o GIT and of course a profiler.

2) After recompiling version B, now they look they have the same speed.
A = 456.4 +/- 2.4 knps
B = 457.0 +/- 3.8 knps

This after 8 runs, I will let this run longer, but it looks like the PGO compile could have significant variations!
I am guilty, because I thought I could get away with compiling a quick PGO for these tests letting them run for only 3 minutes. Maybe enough most of the time... but once in a while it could vary by 2%.

Miguel

Posted: **Sun Aug 22, 2010 12:54 am**

Just be glad you are not using Windows.

Posted: **Sun Aug 22, 2010 3:16 am**

michiguel wrote:To optimize my engine I was checking the speed with a benchmark of 90 positions (1 second each, see PS). Since I have been improving the speed in small steps, I observed certain inconsistencies that sent me in the wrong direction a couple of times. I realized that the noise was much higher than I expected. What I am doing now to finally test two versions is to run a script that repeats

loop 1 to 50 {
benchmark for version A
benchmark for version B
}

and later analyze the data with another script comparing two sets of 50 numbers.

1) I just observed that in one case, all of the sudden, the nps increased ~10% for about 3 benchmarks and later came down to the original. Is this common these type of sudden spikes? Any hardware expert?

[EDIT: This was in a dual, Linux, using one core, with nothing else in the background]

2) How the heck do I explain that version A is _consistently_ 2% faster that version B when... they are identical at the source level? version B has only few comments added. I could easily see that with "git diff".
Is it possible that the PGO optimization became different in one case compared to the other? I am about to recompile both versions and re-test.

You may wonder why I care. I have been optimizing in small steps and was able to speed up the engine 50% after a long process. I just had a regression of ~10% that is driving me nuts, when (apparently) it does not make a lot of sense.

Miguel
PS: I preferred 90 positions for 1 second than fewer positions for a longer time. I am not sure this is the best, but I thought this was going to be reliable enough, for repeated quick tests.
PS2: All this would be near impossible w/o GIT and of course a profiler.

There are lots of issues that affect speed, but by far the two most common ones are:

(1) something else running. Even if you don't realize it. For example, someone does excessive network broadcasts on another host, which will cause your machine to deal with those packets.

(2) memory layout of your program. A program is first loaded into various memory pages, and these pages then directly map to a specific group of cache blocks. Ideally you would like your program laid out so that there is no conflict issues which cause excessive cache line fills (cache misses). Each time you start the program, it will tend to use a different set of memory pages. And sometimes you will get memory layouts that tend to evenly spread memory accesses across cache, and other times you will get layouts that avoid certain parts of cache, which simply reduces the effective size of cache for that particular test. This is a real problem. There is a solution for this, commonly called "page coloring" but not all operating system versions use this because it increases the cost of allocating real pages of memory.

Posted: **Sun Aug 22, 2010 7:07 am**

When optimizing my qperft application, I encountered a case where removing some dead code produced a slowdown of 10%. Strange things can happen. Even when the assembly code is identical, there can still be differences due to alignment of the code. (Some assembler statements produce a variable number of nop instructions, depending on alignment, so identical assembly code does not necessarily mean identical machine code. And identical machine code that is relocated to another address might run at a different speed due to alignment issues (and of course caching issues).

Posted: **Mon Aug 23, 2010 11:57 pm**

hgm wrote:When optimizing my qperft application, I encountered a case where removing some dead code produced a slowdown of 10%. Strange things can happen. Even when the assembly code is identical, there can still be differences due to alignment of the code. (Some assembler statements produce a variable number of nop instructions, depending on alignment, so identical assembly code does not necessarily mean identical machine code. And identical machine code that is relocated to another address might run at a different speed due to alignment issues (and of course caching issues).

Example: branch prediction. Moving some code around might add or remove a case of collision in the BTB, increasing (or decreasing) branch mispredicts.

Posted: **Tue Aug 24, 2010 2:54 am**

I have seen some things similar for my tests but mainly lower results or not stable data.

I guess that when your engine has just started with requests of huge resource, the system has to do some heavy tasks such as accessing disk (usually I hear disk sound a while after starting an engine) to write down cache, memory of other sw, optimize memory... so clock tick may not up to date correctly (some events of tick may be delay - overall will be OK but not for first few seconds), leading your measurement incorrect for that period.

Perhaps you may try:

// Warm up only, do not take data
loop 1 to 50 {
benchmark for version A
benchmark for version B
}

// OK now, take data
loop 1 to 50 {
benchmark for version A
benchmark for version B
}

Posted: **Tue Aug 24, 2010 4:43 am**

michiguel wrote:To optimize my engine I was checking the speed with a benchmark of 90 positions (1 second each, see PS). Since I have been improving the speed in small steps, I observed certain inconsistencies that sent me in the wrong direction a couple of times. I realized that the noise was much higher than I expected. What I am doing now to finally test two versions is to run a script that repeats

loop 1 to 50 {
benchmark for version A
benchmark for version B
}

and later analyze the data with another script comparing two sets of 50 numbers.

How are you analyzing the performance data? The calculated statistic can affect the "noise" greatly.
In almost all cases it's best to take the fastest result of the N runs, discarding the other values. I'm more used to measuring the time for a fixed quantity of calculation, and taking the minimum time from N runs. Such a result is often adequately reproducible ("noise-free") with N as small as 3 or 5. Your test is arranged differently, so that you would take the maximum number of nodes from the N runs, but the principle should be the same.

Calculating the mean result of N is much less satisfactory because its variance decreases only slowly with increasing N. The distribution of times for, say, repeated fixed-depth searches has a definite minimum but is long-tailed to the right.

Robert P.

TalkChess.com

Optimization and benchmarking

Optimization and benchmarking

Re: Optimization and benchmarking

Re: Optimization and benchmarking

Re: Optimization and benchmarking

Re: Optimization and benchmarking

Re: Optimization and benchmarking

Re: Optimization and benchmarking

Re: Optimization and benchmarking

Re: Optimization and benchmarking

Re: Optimization and benchmarking