gperft

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

ibid
Posts: 89
Joined: Mon Jun 13, 2011 12:09 pm

gperft

Post by ibid »

Does the world need yet another perft program? Absolutely not.
It is getting one anyhow. :)

gperft is a reasonably fast multithreaded perft program. There
are binaries for 64 bit linux and windows, with and without POPCNT.
The linux build is about 10% faster than the windows. Both
POPCNT-free builds are much slower, 25-30% slower than the linux
executable.

gperft is a pretty conventional program with hash tables -- this
is NOT the program using unique positions that I used for perft(13).

gperft can do a single-threaded perft without hash tables:

Code: Select all

gperft 1.0 (linux)
Using simple perft function (one thread, no hash tables).
Depth is 8.

rnbqkbnr
pppppppp
--------
--------  w KQkq -
--------
--------
PPPPPPPP
RNBQKBNR

Na3        3,193,522,577
Nc3        3,926,684,340
Nf3        3,937,354,096
Nh3        3,221,278,282
a3         2,863,411,653
b3         3,579,299,617
c3         3,806,229,124
d3         6,093,248,619
e3         8,039,390,919
f3         2,728,615,868
g3         3,641,432,923
h3         2,860,408,680
a4         3,676,309,619
b4         3,569,067,629
c4         4,199,667,616
d4         7,184,581,950
e4         8,102,108,221
f4         3,199,039,406
g4         3,466,204,702
h4         3,711,123,115
TOTAL     84,998,978,956
239.664 seconds
Or a multithreaded one with hash tables:

Code: Select all

gperft 1.0 (linux)
Low hash table ready (4096 MB, 2-3 ply).
High hash table ready (2048 MB, 4-8 ply).
Using 6 threads (split after 3 ply).
Depth is 11.

rnbqkbnr
pppppppp
--------
--------  w KQkq -
--------
--------
PPPPPPPP
RNBQKBNR

Na3        70,080,800,068,168
Nc3        91,451,554,526,572
Nf3        89,933,046,388,964
Nh3        71,046,267,678,634
a3         60,403,292,887,824
b3         79,510,326,025,357
c3         92,235,553,734,553
d3        151,857,971,385,067
e3        241,074,613,621,302
f3         51,614,296,095,395
g3         82,762,826,570,051
h3         60,097,879,424,719
a4         85,054,341,127,064
b4         80,419,308,561,211
c4        103,605,670,223,681
d4        211,583,204,457,112
f4         68,372,448,303,691
e4        245,841,494,675,197
g4         73,966,186,324,024
h4         86,739,921,618,220
TOTAL   2,097,651,003,696,806
8520.216 seconds
These times are on my Phenom 1090T @ 3.6 GHz. If someone with a fast
intel cpu and/or a lot of memory could run a few tests, I'd love to
see the numbers.

I imagine perft(12) could be done in 2 or 3 days. And perhaps 2 or 3
months for perft(13) (if one has a good power company, as there is no
way to save/restore). Using this program to verify perft(14) does not
seem very likely.

The source is not available (yet!) -- it needs considerable cleanup
before I'd let anyone else see it. How many chess programs require
one to "#define CHESS" in order to get it to function? :)

https://www.dropbox.com/s/dtqdlhaeba3ts ... _linux.zip

https://www.dropbox.com/s/yynchzovz79tb ... indows.zip

-paul
Macintosh
Posts: 13
Joined: Wed Jun 26, 2013 8:23 pm
Location: Jena, Germany

Re: gperft

Post by Macintosh »

Thanks a lot. I downloaded the windows version, works like a charm. Looking forward, when/if you release the sources! I sure could learn a lot there.

I noticed, you use two different hash tables (low/high). May I ask how you select what to overwrite/update if a position is already in the hash table?

Probably, I need to elaborate: I tried Steven Edwards' "two kings" position (http://www.talkchess.com/forum/viewtopi ... 55&t=48423) with my programm, and somehow, I scale exponentially in time around depth 12.

I only use one hash table but separate part thereof for btm/wtm. Here is how I handle collisions (basically due to depth-mismatch):

Code: Select all

				// sub-tree finished, so update & store to hash
				pGameHashEntry = &m_pGameHashPrimary[m_nSideToMove][(m_piTreeStatus[m_nCurrentPly].hkGame) & m_nGameHashMask];
				if (GetOptionUseHash() &&
					((nMaxDepth - m_nCurrentPly) > 0) &&
					(llNodes[m_nCurrentPly] > 0) &&
					(!pGameHashEntry->IsGoodHashKey(m_piTreeStatus[m_nCurrentPly].hkGame) ||
						&#40;pGameHashEntry->GetDepth&#40;) < &#40;nMaxDepth - m_nCurrentPly&#41;)))
				&#123;
					pGameHashEntry->Store&#40;m_piTreeStatus&#91;m_nCurrentPly&#93;.hkGame, BYTE&#40;nMaxDepth - m_nCurrentPly&#41;, llNodes&#91;m_nCurrentPly&#93;);
				&#125;
Comparing directly with gperft and YACE (0.99.87) which both return depth 20 literally instantly, it seems I still have a bug somewhere. I am trying to narrow it down, by counting hash store tries/updates (correct entry, wrong depth)/overwrites (wrong entry), but it's likely I have some beginners bug... :-(

I tried different depths/drafts when to store a sub-tree and in this case it did not matter, if I stored the leaf nodes or only up to one ply before that. I also implemented the trivial approach re-synchronizing all threads when a subtree finishes - all with no solution to the exponential timings. Even disabling multithreading did not help. Surely, I am overlooking something very trivial, but I do not see it (yet).

Maybe you have an idea how to isolate the bug...

Greetings from Jena


Marcus
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: gperft

Post by syzygy »

ibid wrote:These times are on my Phenom 1090T @ 3.6 GHz. If someone with a fast intel cpu and/or a lot of memory could run a few tests, I'd love to see the numbers.
On my i7-3930k at 4.2Ghz, gperft 7 takes 6.233s and gperft 8 takes 168.383s.
My own perft does it in 4.987s and 134.870s (using popcnt to count leaves).
Allard Siemelink's frcperft reports 7.29s and 195.91s (using clock() instead of gettimeofday()).

This is on Linux.

I think there are some positions on which frcperft performs better than my program, but I don't have one at the moment.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: gperft

Post by syzygy »

syzygy wrote:I think there are some positions on which frcperft performs better than my program, but I don't have one at the moment.
Frcperft seems to be quite a bit faster than my perft on positions with fewer pieces:
8/PPP4k/8/8/8/8/4Kppp/8 w - -
8/7p/p5pb/4k3/P1pPn3/8/P5PP/1rB2RK1 b - d3
8/3K4/2p5/p2b2r1/5k2/8/8/1q6 b - -
8/2p5/3p4/KP5r/1R3p1k/8/4P1P1/8 w - -

On the last one gperft is also a bit faster than mine (but is beaten easily by frcperft).
ibid
Posts: 89
Joined: Mon Jun 13, 2011 12:09 pm

Re: gperft

Post by ibid »

Macintosh wrote:Thanks a lot. I downloaded the windows version, works like a charm. Looking forward, when/if you release the sources! I sure could learn a lot there.

I noticed, you use two different hash tables (low/high). May I ask how you select what to overwrite/update if a position is already in the hash table?

Probably, I need to elaborate: I tried Steven Edwards' "two kings" position (http://www.talkchess.com/forum/viewtopi ... 55&t=48423) with my programm, and somehow, I scale exponentially in time around depth 12.

I only use one hash table but separate part thereof for btm/wtm. Here is how I handle collisions (basically due to depth-mismatch):

Code: Select all

				// sub-tree finished, so update & store to hash
				pGameHashEntry = &m_pGameHashPrimary&#91;m_nSideToMove&#93;&#91;&#40;m_piTreeStatus&#91;m_nCurrentPly&#93;.hkGame&#41; & m_nGameHashMask&#93;;
				if &#40;GetOptionUseHash&#40;) &&
					(&#40;nMaxDepth - m_nCurrentPly&#41; > 0&#41; &&
					&#40;llNodes&#91;m_nCurrentPly&#93; > 0&#41; &&
					(!pGameHashEntry->IsGoodHashKey&#40;m_piTreeStatus&#91;m_nCurrentPly&#93;.hkGame&#41; ||
						&#40;pGameHashEntry->GetDepth&#40;) < &#40;nMaxDepth - m_nCurrentPly&#41;)))
				&#123;
					pGameHashEntry->Store&#40;m_piTreeStatus&#91;m_nCurrentPly&#93;.hkGame, BYTE&#40;nMaxDepth - m_nCurrentPly&#41;, llNodes&#91;m_nCurrentPly&#93;);
				&#125;
Comparing directly with gperft and YACE (0.99.87) which both return depth 20 literally instantly, it seems I still have a bug somewhere. I am trying to narrow it down, by counting hash store tries/updates (correct entry, wrong depth)/overwrites (wrong entry), but it's likely I have some beginners bug... :-(

I tried different depths/drafts when to store a sub-tree and in this case it did not matter, if I stored the leaf nodes or only up to one ply before that. I also implemented the trivial approach re-synchronizing all threads when a subtree finishes - all with no solution to the exponential timings. Even disabling multithreading did not help. Surely, I am overlooking something very trivial, but I do not see it (yet).

Maybe you have an idea how to isolate the bug...
I don't see anything obviously wrong with that code, although there could be something hidden in the implementation of the various functions.

One thing I do in gperft which is a major help for positions like that one is incorporate the depth of the subtree into the hash key. The problem is that by 20 ply you are seeing exact same positions at 2, 4, 6, 8... ply and will be constantly overwriting useful results. With the depth as part of the hash key, they will likely land in different slots and there won't be an issue. I just disabled that in gperft as a test and it is taking 3 seconds to complete 12 ply. With the depth included, 20 ply is near-instant regardless of replacement scheme.

In gperft I added the second hash table simply so I could use 6 GB instead of 4 GB on my machine -- it seemed wasteful. :) And it does speed up long-running perfts. The low depth table stores two positions in a slot -- the deepest and the most recent. In practice, with default settings, this means each slot will usually contain the most recent 2 ply position and the most recent 3 ply position. The high depth table was originally set up the same, but I got some improvement (again, on multi-hour perfts) by moving to 4 positions in each slot -- the 3 deepest and the most recent.

Hope that was helpful,
-paul
ibid
Posts: 89
Joined: Mon Jun 13, 2011 12:09 pm

Re: gperft

Post by ibid »

syzygy wrote:
ibid wrote:These times are on my Phenom 1090T @ 3.6 GHz. If someone with a fast intel cpu and/or a lot of memory could run a few tests, I'd love to see the numbers.
On my i7-3930k at 4.2Ghz, gperft 7 takes 6.233s and gperft 8 takes 168.383s.
My own perft does it in 4.987s and 134.870s (using popcnt to count leaves).
Allard Siemelink's frcperft reports 7.29s and 195.91s (using clock() instead of gettimeofday()).

This is on Linux.

I think there are some positions on which frcperft performs better than my program, but I don't have one at the moment.
Thank you for running that, I don't have access to a good intel cpu. Now I know how much of the deficit is my code not my cpu. :)

And I'll add having a look at frcperft to my do-do list...

Thanks,
-paul
Macintosh
Posts: 13
Joined: Wed Jun 26, 2013 8:23 pm
Location: Jena, Germany

Re: gperft

Post by Macintosh »

Thank you for the explanation. I suspected that overwrites/updates might be the problem, and so far, I was thinking of storing two positions for different depths - something I already do for the alpha-beta-search.

I never would have thought of combining the depth into the hash code. That seems even more flexible. I will try this starting on the weekend.

On the other side, I guess these "trivial" positions might gain a lot from this idea; I would expect "normal" middle-game positions might not gain anything or even suffer a bit. My current code takes roughly 17,5h for perft(11) of the starting position with 8GB hash and 4 threads on an i7-3770S, so I think I am not doing so bad for a not-so-specialized perft.

I will post an update on the outcome of this experiment, once I correctly did the implementation. (I still make many slips of the pen when I try something completely new or rushing things.)

Thanks again for your elaborations and

Greetings from Jena


Marcus
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: gperft

Post by Rein Halbersma »

ibid wrote: One thing I do in gperft which is a major help for positions like that one is incorporate the depth of the subtree into the hash key. The problem is that by 20 ply you are seeing exact same positions at 2, 4, 6, 8... ply and will be constantly overwriting useful results. With the depth as part of the hash key, they will likely land in different slots and there won't be an issue. I just disabled that in gperft as a test and it is taking 3 seconds to complete 12 ply. With the depth included, 20 ply is near-instant regardless of replacement scheme.
That sounds clever! Is this something that you also do in the search function? Do other programs use it?
Macintosh
Posts: 13
Joined: Wed Jun 26, 2013 8:23 pm
Location: Jena, Germany

Re: gperft

Post by Macintosh »

Could not wait until the weekend. :-)

Code: Select all

position fen k7/8/8/8/8/8/8/7K w - -
performance tree 20 split
Testing performance of tree ...
  h1g1&#58; 1,162,131,370,259,516N.
  h1g2&#58; 2,081,241,911,804,424N.
  h1h2&#58; 1,162,131,370,259,516N.
Hash&#58; Stores&#58; 54.51k, Hits&#58; 388.3k, Retrieved&#58; 4.405PN
Total number of nodes&#58; 4,405,504,652,323,456. Time&#58; 0.795s.
0ns/N &#40;raw&#58; 0ns/N, CPU load&#58; 92.2%).
Performance test completed.

setoption name Clear Hash
info string Hash cleared
performance tree 20
Testing performance of tree ...
  Ply 1&#58; 3N.
  Ply 2&#58; 9N.
  Ply 3&#58; 54N.
  Ply 4&#58; 324N.
  Ply 5&#58; 1,890N.
  Ply 6&#58; 11,024N.
  Ply 7&#58; 71,758N.
  Ply 8&#58; 466,210N.
  Ply 9&#58; 3,083,208N.
  Ply 10&#58; 20,296,614N.
  Ply 11&#58; 137,688,284N.
  Ply 12&#58; 928,817,526N.
  Ply 13&#58; 6,338,164,714N.
  Ply 14&#58; 43,010,471,992N.
  Ply 15&#58; 294,704,424,538N.
  Ply 16&#58; 2,009,892,516,456N.
  Ply 17&#58; 13,775,845,493,902N.
  Ply 18&#58; 94,070,961,872,736N.
  Ply 19&#58; 644,662,402,680,014N.
  Ply 20&#58; 4,405,504,652,323,456N.
Hash&#58; Stores&#58; 231.7k, Hits&#58; 5.602M, Retrieved&#58; 5.160PN
Total number of nodes&#58; 5,160,368,898,384,712. Time&#58; 12.05s.
0ns/N &#40;raw&#58; 0ns/N, CPU load&#58; 83.8%).
Performance test completed.

hashstats perft
Perft-Hash statistics&#58; &#40;Total&#58; 131,072&#41;
 Pl.  Ply  Min     Max     Ave     StdDev  Usage     Used
  B     1       3      64      33      14  0.02142           2,807
  B     2      12     511     247     112  0.08330          10,918
  B     3      65  3.852k  1.508k     783  0.03012           3,948
  B     4     288  26.72k  11.97k  5.889k  0.09302          12,192
  B     5  1.698k  183.9k  96.32k  42.15k  0.02811           3,685
  B     7  9.232k  1.295M  503.5k  264.9k  0.02960           3,880
  B     8  118.5k  9.086M  3.964M  1.995M  0.00195             255
  B     9  680.5k  62.70M  27.66M  14.38M  0.00896           1,174
  B    10  15.75M  408.4M  170.4M  87.97M  0.00113             148
  B    11  95.98M  2.887G  1.375G  9.223E  0.00533             698
  B    12  2.130G  17.84G  8.207G  9.223E  0.00059              77
  B    13  6.338G  140.8G  67.99G  9.223E  0.00290             380
  B    14  77.73G  678.5G  275.4G  9.223E  0.00027              35
  B    15  294.7G  6.619T  2.831T  9.223E  0.00117             153
  B    16  3.634T  14.78T  7.782T  9.223E  0.00007               9
  B    17  24.96T  215.2T  92.20T  9.223E  0.00023              30
  B    18  169.9T  304.8T  237.3T  9.223E  0.00002               2
  B    19  1.162P  2.081P  1.468P  9.223E  0.00002               3
  W     1       3      64      37      14  0.07090           9,293
  W     2      12     511     253     113  0.06576           8,619
  W     3      65  3.852k  1.700k     810  0.09833          12,888
  W     4     288  26.72k  12.24k  5.916k  0.06973           9,139
  W     5  1.698k  183.9k  81.24k  43.08k  0.05035           6,599
  W     7  17.83k  1.257M  467.4k  232.7k  0.01157           1,517
  W     8  118.5k  9.086M  4.165M  2.055M  0.01751           2,295
  W     9  2.759M  61.00M  25.77M  13.58M  0.00204             268
  W    10  15.75M  431.7M  195.1M  9.223E  0.00880           1,154
  W    11  245.9M  2.887G  1.166G  9.223E  0.00095             124
  W    12  928.8M  20.51G  9.764G  9.223E  0.00484             635
  W    13  11.36G  136.2G  49.10G  9.223E  0.00055              72
  W    14  43.01G  965.6G  473.1G  9.223E  0.00211             277
  W    15  530.5G  5.718T  1.666T  9.223E  0.00021              28
  W    16  2.009T  45.35T  16.13T  9.223E  0.00066              87
  W    17  44.71T  44.71T  44.71T  9.223E  0.00001               1
  W    18  305.4T  980.8T  489.5T  9.223E  0.00007               9
I simply XORed the current depth into the hash key.
Preliminary results show, however, that the "normal" positions suffer. I will investigate further, before quoting any results, just to make sure, I did not miss something.

Greetings from Jena


Marcus
Macintosh
Posts: 13
Joined: Wed Jun 26, 2013 8:23 pm
Location: Jena, Germany

Re: gperft

Post by Macintosh »

I don't think it makes much sense in a "normal" chess search (not perft):

1) In perft, you need the exact depth of the sub-tree searched to store the exaxt number of nodes/leaves in that sub-tree. If you revisit this position later, but maybe at a different depth (e.g. via moving pieces back and forth), you cannot use the node/leaf-count you might find in the hash table. That is why you need a good strategy, not to overwrite/update precious results, that might come in handy later. So the trick to include the depth in the hash key solves elegantly the problem of how to store the same position at different depth.

2) In the "normal" chess search, you will probably store the most valuable information of the current position. However, you are stuck with a dilemma, when using e.g alpha/beta or PVS: You need to store more information, than the node counts in perft (best move, score, bound type, ...). This information might be incomplete, e.g. you only have a best move and a bound, but no exact score.
Usually, you simply store the position with the largest sub-tree (depth) into the hash table for updates. Some might also have a different precedence, e.g. storing exact scores rather than having 2 plies more depth.
The basic problem during search arises, when different positions hash to the same memory location. Now you need some other way of determing the already stored entry should be replaced (overwritten) or stay. I assume, most try to mitigate that dilemma with additional entries, in case both information are valuable (1st and 2nd slot).

Additionally incorporating depth into the hash key only wastes entries and it does not solve any of the collision problems, because usually, larger depth means better information content.

Just my interpretation of things...

Greetings from Jena


Marcus