Reliable speed comparison: some math required

mcostalba · Post by **mcostalba** » Tue Feb 27, 2018 9:53 pm

What about this one?

http://www.brendangregg.com/perf.html#CPUstatistics

This could be the final weapon! In particular look at the table in the above link:

Code: Select all

5,649,595,479 cycles

See also https://perf.wiki.kernel.org/index.php/Tutorial

syzygy · Post by **syzygy** » Tue Feb 27, 2018 9:59 pm

Without background noise and running just a single bench at a time:

Code: Select all

$ taskset -c 4 cfish bench 2>&1 >/dev/null | grep second

Results:

Code: Select all

Nodes/second    &#58; 2775955
Nodes/second    &#58; 2778906
Nodes/second    &#58; 2777430
Nodes/second    &#58; 2778906
Nodes/second    &#58; 2780385
Nodes/second    &#58; 2777430
Nodes/second    &#58; 2778906
Nodes/second    &#58; 2780385
Nodes/second    &#58; 2777430
Nodes/second    &#58; 2775955
Nodes/second    &#58; 2777430
Nodes/second    &#58; 2777430

So max speed is 2780385.

lucasart · Post by **lucasart** » Wed Feb 28, 2018 2:56 am

syzygy wrote:

lucasart wrote:
syzygy wrote:Testing in parallel is only more noisy
Did you verify this hypothesis of yours with empirical data ? I suggest you try…

I took the liberty to create some noise:

Code: Select all

run       base       test     diff
  1    1929540    2652016  +722476
  2    1929540    2657409  +727869
  3    2645305    1925985  -719320
  4    2670988    2639961   -31027
  5    2665540    2669624    +4084
  6    2625376    2666900   +41524
  7    2604446    2604446       +0
  8    2673720    2664181    -9539
  9    2637297    2006573  -630724
 10    1930965    2662824  +731859
 11    2670988    2670988       +0
 12    2672353    2677829    +5476
 13    2668261    2672353    +4092
 14    2660113    2666900    +6787
 15    2670988    2639961   -31027
 16    2444866    2460981   +16115
 17    2574937    2656058   +81121
 18    1926695    2653362  +726667
 19    1921736    2669624  +747888
 20    1926695    2656058  +729363

Result of  20 runs
==================
base &#40;cfish          ) =    2422517  +/- 147234
test &#40;cfish          ) =    2578702  +/- 94082
diff                   =    +156184  +/- 191677

speedup        = +0.0645
P&#40;speedup > 0&#41; =  0.9446

CPU&#58; 6 x Intel&#40;R&#41; Core&#40;TM&#41; i7-3930K CPU @ 3.20GHz
Hyperthreading&#58; on

And ?

I don't see that you have proven anything here.

You need to compare 2 methods. Running one sample of one method achieves nothing. You should do (severalruns of) each method (sequential vs. concurrent) and compare.

And the only thing you should compare is the standard deviation of the "diff" vector. We don't care about the master and test: of course those are noisier in concurrent vs. sequential, that's trivial and missing the point. Point is more noise but more *correlated* such that the stdev of diff is lower in concurrent testing.

And you should definitely not take any liberty with the testing protocol. Should be all else equal between both (as much as possible). No external noise, other than the bare minimum possible on your machine/OS.

lucasart · Post by **lucasart** » Wed Feb 28, 2018 2:59 am

syzygy wrote:Without background noise and running just a single bench at a time:

Code: Select all

$ taskset -c 4 cfish bench 2>&1 >/dev/null | grep second

Results:

Code: Select all

Nodes/second    &#58; 2775955
Nodes/second    &#58; 2778906
Nodes/second    &#58; 2777430
Nodes/second    &#58; 2778906
Nodes/second    &#58; 2780385
Nodes/second    &#58; 2777430
Nodes/second    &#58; 2778906
Nodes/second    &#58; 2780385
Nodes/second    &#58; 2777430
Nodes/second    &#58; 2775955
Nodes/second    &#58; 2777430
Nodes/second    &#58; 2777430

So max speed is 2780385.

Same comment as my previous post. This data proves nothing.

bob · Post by **bob** » Wed Feb 28, 2018 3:26 am

lucasart wrote:
syzygy wrote:Testing in parallel is only more noisy
Did you verify this hypothesis of yours with empirical data ? I suggest you try…

It has been verified MANY times in the past. Even worse, testing in parallel can result in some CPUS throttling back due to heat or power consumption, which adds MORE noise.

The concept of pinning a process to a specific CPU is always a good place to start. If a process bounces to a different CPU, that CPU has to "throttle up". The more bounces, the slower it runs.

My advice: (1) disable hyper-threading if possible; (2) disable turbo-boost if possible; (3) run in single-user mode (Linux) to reduce system noise; (4) You can even unplug/disable network which can cause overhead.

Then run a quick test to "warm up" the CPU and cache (get important stuff loaded), then run real test. You can get pretty consistent timing. If you run tests that run for a few minutes, you will get repeatability to a second nearly every time, within a second every time. Measuring "short things" is a basic no-no for benchmarking. If you want to go as far as I did, run a lightweight kernel. No virtual memory or anything there, rock-solid repeatability.

Jouni · Post by **Jouni** » Wed Feb 28, 2018 7:29 am

I think the problem is not math, but too many things to consider: linux/windows, AMD/Intel, popcount, bmi, 64 bit single/multi etc!?

Fulvio · Post by **Fulvio** » Wed Feb 28, 2018 9:46 am

mcostalba wrote:Given N noisy measures of speed of NEW and N noisy measures of MASTER, we want to know:

1. If NEW is faster than MASTER with 95% reliability

2. Maximum speed up of NEW, defined as (Speed NEW - Speed Master) / Speed MASTER that is guaranteed with 95% reliability

The second point needs a clarification. Suppose NEW is faster then MASTER of 0.5% with 95% guarantee), then it will be faster also of at least 0,4% and lower. On the contrary it will not be faster than 0.8% with 95% reliability, maybe just with 30% reliability. We want to find the max speed-up (in this case 0.5%) that has 95% reliability.

What's wrong with the simple approach?
Measure alternatively MASTER, NEW, MASTER, etc, let's say 111 times (the number of repetitions depends on how noisy is the system).
Then calculate the derivative values comparing each NEW measurement with the MASTER measurement immediately before and after it.

Code: Select all

for &#40;int i = 0; i < 110; i++) &#123;
	v&#91;i&#93; -= v&#91;i + 1&#93;;
	if (&#40;i & 1&#41; == 0&#41;
		v&#91;i&#93; = -v&#91;i&#93;;
&#125;

Remove the last element and the first 10 elements, then sort the vector in descending order.
v[0] will be the maximum gain and v[99] the minimum gain.
v[94] will be the minimum gain obtained in at least 95% measurements.
The index of the first negative element is the percentage of measurements with a gain.
The middle element of the vector will be the median, etc...

mcostalba · Post by **mcostalba** » Thu Mar 01, 2018 5:23 pm

I have done it !

Finally I was able to enable performance counters on my VMware virtual Linux and run the amazing perf tool.

This is the magical command line:

Code: Select all

sudo perf stat -r 5 -a -B -e cycles&#58;u ./stockfish bench > /dev/null

It _counts_ the CPU instructions of a bench run, repeating the measure 5 time and averaging the result, that is the one below:

Code: Select all

 Performance counter stats for 'system wide' &#40;5 runs&#41;&#58;

    11.694.771.724      cycles&#58;u                                                      ( +-  0,08% )

       4,370493245 seconds time elapsed                                          ( +-  0,23% )

Please not the amazing precision and repeatability of the measure, just +- 0,08%, moreover there is no influence of other OS activity, time , disk, etc...it just counts instructions (userspace instructions to be precise).

BeyondCritics · Post by **BeyondCritics** » Fri Mar 02, 2018 1:26 am

bob wrote:...
If you want to go as far as I did, run a lightweight kernel. No virtual memory or anything there, rock-solid repeatability.

I have googled it, but it appears you can not disable virtual memory in Linux. What kernel did you run?
What test metrics do you use for testing?

bob · Post by **bob** » Fri Mar 02, 2018 2:45 am

Jouni wrote:I think the problem is not math, but too many things to consider: linux/windows, AMD/Intel, popcount, bmi, 64 bit single/multi etc!?

Certainly isn't math. It takes a little thought and preparation to get good benchmarking results...

Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required