gperft

sje · Post by **sje** » Mon Jul 22, 2013 8:05 am

Symbolic takes 11 seconds with a small transposition table, and 326 seconds without:

&#91;&#93; sf 8/2p5/3p4/KP5r/1R3p1k/8/4P1P1/8 w - - 0 1
&#91;&#93; pctran 9
Path&#40;9&#41;&#58; 50,086,749,815   Pt&#58; 40.811   Wt&#58; 10.809   F/P&#58; 4.63361e+09/2.15814e-10
&#91;&#93; pcbulk 9
Path&#40;9&#41;&#58; 50,086,749,815   Pt&#58; 21&#58;01.865   Wt&#58; 5&#58;26.365   F/P&#58; 1.53468e+08/6.516e-09

The ply 1 subtotals appear in the log:

Code: Select all

&#91;2013.07.22 05&#58;53&#58;56.508&#93; < pcbulk 9
&#91;2013.07.22 05&#58;55&#58;46.649&#93; Rb3 4,258,569,600
&#91;2013.07.22 05&#58;55&#58;49.857&#93; Rd4 4,521,736,449
&#91;2013.07.22 05&#58;56&#58;01.832&#93; Rc4 4,966,212,754
&#91;2013.07.22 05&#58;56&#58;17.117&#93; Rb1 5,688,812,201
&#91;2013.07.22 05&#58;57&#58;30.018&#93; Re4 3,920,504,030
&#91;2013.07.22 05&#58;57&#58;38.389&#93; g4 3,628,220,761
&#91;2013.07.22 05&#58;57&#58;42.472&#93; Ka6 4,580,161,609
&#91;2013.07.22 05&#58;57&#58;59.870&#93; Ka4 4,031,452,737
&#91;2013.07.22 05&#58;58&#58;51.319&#93; e3 3,087,382,842
&#91;2013.07.22 05&#58;59&#58;01.119&#93; e4 2,321,736,882
&#91;2013.07.22 05&#58;59&#58;09.019&#93; Ra4 3,399,463,489
&#91;2013.07.22 05&#58;59&#58;10.913&#93; Rb2 3,576,462,883
&#91;2013.07.22 05&#58;59&#58;22.763&#93; g3+ 1,220,138,764
&#91;2013.07.22 05&#58;59&#58;22.873&#93; Rxf4+ 885,894,814
&#91;2013.07.22 05&#58;59&#58;22.874&#93; > Path&#40;9&#41;&#58; 50,086,749,815   Pt&#58; 21&#58;01.865   Wt&#58; 5&#58;26.365   F/P&#58; 1.53468e+08/6.516e-09

Macintosh · Post by **Macintosh** » Mon Jul 22, 2013 6:22 pm

Thanks, Paul.

Did you only change this option thingy (multiple threads - no hash), because I already did some of the scaling tests on gperft 1.0.2? (Just to make sure I compare basically the same code.)

Hopefully, I will find some time during this week, if not, at latest on the weekend. I will post the raw scaling results then here. Just for those that may be interested in Intel's hyper-threading and turbo-boost.

PS: I do these tests basically for myself, to get a feeling on the different factors, that might blurr speed tests. If I need to compare raw computation speed, I assume using no hash at all gives a better estimate.

Regards

Marcus

ibid · Post by **ibid** » Thu Dec 04, 2014 4:58 am

I thought gperft was pretty well optimized. I was mistaken. So it is time for a new version.
gperft 1.1 is about 12-16% faster than 1.0.3 with no hash tables, 9-12% with hash tables.

Overall, run times for gperft have dropped about 30% since the original release without
hash tables and 23% with hash tables.

In addition, it appears gettimeofday() has a resolution of about 16 ms with MinGW so I've
changed to QueryPerformanceCounter() for the windows build. Although run times under
windows have a fair bit of variation anyhow so it makes little real difference.

A new flag has been added, -quiet, which reduces the output to a single line containing only
the result and the time required:

Code: Select all

$ gperft -quiet -threads 6 "" 8
84998978956 27.968

I find this useful for testing and benchmarking scripts. If someone wants to try to use gperft
to assist with Steven Edwards' perft(14) computation, perhaps this will help.

The link remains the same:
https://www.dropbox.com/s/bxoogxwkbncxd68/gperft.zip

I'll leave you with 1/400th of perft(14). Although it is definitely a small 1/400th... :)

Code: Select all

$ gperft -memory 6000 -threads 6 "rnbqkbnr/ppppp1pp/5p2/8/8/5P2/PPPPP1PP/RNBQKBNR w KQkq -" 12
gperft 1.1 &#40;linux&#41;
Low hash table ready &#40;4096 MB, 2-3 ply&#41;.
High hash table ready &#40;2048 MB, 4-9 ply&#41;.
Using 6 threads &#40;split after 3 ply&#41;.
Depth is 12.

rnbqkbnr
ppppp-pp
-----p--
--------  w KQkq -
--------
-----P--
PPPPP-PP
RNBQKBNR

Na3        1,173,592,538,222,169
Nc3        1,557,766,236,902,564
Nh3        1,542,185,762,012,030
a3         1,006,176,524,547,376
b3         1,345,384,338,086,527
c3         1,541,303,404,928,247
d3         2,731,836,553,038,204
e3         2,585,982,055,662,217
g3         1,342,651,888,215,425
h3           970,387,210,701,079
f4         1,464,824,328,258,908
a4         1,416,051,275,269,845
b4         1,355,469,574,643,311
c4         1,773,485,132,081,481
d4         3,661,868,648,225,375
e4         2,917,944,214,756,015
g4         1,329,054,816,289,657
h4         1,426,383,480,447,864
Kf2        1,663,119,461,550,316
TOTAL     32,805,467,443,838,610
56713.204 seconds

-paul

Tom Likens · Post by **Tom Likens** » Thu Dec 04, 2014 2:19 pm

ibid wrote:New version. A few small changes, mostly trying to get things more cache-friendly. Good for perhaps 1%. And a bigger speedup from a change in compiler. The linux build is about 7-8% faster with gcc 4.8.1 compared to my old 4.4.5 (mostly because I added -flto and PGO). The windows build gets a heftier 10-14% with MinGW (4.8.1 again) compared to Visual Studio 2010 (Release build + PGO).

Despite the fact they are both now built with basically the same compiler, the linux build is still 7-10% faster. Is this typical for a MinGW build? It seems odd to me, since the basic perft is, except for a handful of printf's, pure computation. And yet is 7% slower -- seems it should be very similar for linux/windows. The multithreaded search adds spinlocks and such, so I can see why there might be some difference in that case.

[--snip--]

Unfortunately, I run into this issue every time I release a version of my chess engine. The MingW Windows release is 10-20% slower than the Linux release using the exact same version of the GCC compiler, with the exact same compiler settings. I've tried a lot of different things but I've never been able to get that speed back. If you ever manage to figure out the speed deficit please let me know. It would be nice to reclaim those elo points, however small.

regards,
--tom

ibid · Post by **ibid** » Thu Dec 04, 2014 8:41 pm

Tom Likens wrote:Unfortunately, I run into this issue every time I release a version of my chess engine. The MingW Windows release is 10-20% slower than the Linux release using the exact same version of the GCC compiler, with the exact same compiler settings. I've tried a lot of different things but I've never been able to get that speed back. If you ever manage to figure out the speed deficit please let me know. It would be nice to reclaim those elo points, however small.

regards,
--tom

As Richard Delorme pointed out, the different ABI's are a likely suspect. Since well over half
the CPU time for gperft is spent in one (rather complex) function, I have been looking at the
assembly produced for it to get hints on how things might be optimized. I should point out
that I am definitely *not* an expert in this stuff, this is all pure speculation!

The function in question is quite complex and uses all 15 available registers. For linux, this
involves pushing 6 registers onto the stack then restoring them before it returns. For windows,
8 registers are saved this way. So it appears any function complex enough to need a lot
of registers will need to save more registers for windows. Not a big overhead I am sure, but
a contributing factor.

The bigger problem seems to be that MinGW-x64 produces position-independant code, which is
adding some overhead accessing global variables (all those knight_attacks arrays, magic arrays,
and such). So while linux can access the bits[] array by directly copying the entry in the array:
movq bits(,%rax,8), %rsi
The windows build needs to first figure out where bits[] is located:
movq .refptr.bits(%rip), %rdi
movq (%rdi,%rax,8), %rsi
Which of course adds an instruction each time one of the many global arrays is needed. Worse,
it is requiring a temporary register, %rdi here, in a function already under register pressure.
Also, since bits[] is used a lot, it would like to keep the location of bits[] in %rdi to avoid the extra
instruction every time you use it. If you have several commonly used globals like this, you are
losing several registers keeping track of them. To my amateur eye, the resulting assembly is
dumping things onto the stack more and is considerably more convoluted than the more elegant
linux assembly.

Can position independant code be turned off? All I can say is that -fno-pic didn't do it. And I have
had no luck getting understandable information on this via google. If anyhow knows how this can
be done, I'd love to know...

-paul

gperft

Re: gperft 1.0.3

Re: gperft 1.0.3

gperft 1.1

Re: gperft 1.0.2

Re: gperft 1.0.2