Perft search speed bottleneck

syzygy · Post by **syzygy** » Fri Jun 07, 2013 1:56 pm

hgm wrote:
syzygy wrote:When you do an alpha/beta search, you will need to make at least all legal moves towards the leaf nodes. Checking for legality before making the move speeds up perft enormously, but will only marginally, if at all, pay off in a regular search (since you'll end up making most moves anyway). It might be worth it to give it a try at some point, but if you're just starting with your engine there are far easier ways to make progress.
I always felt it wast just the other way around. When you have an engine, you generally won't make any moves of those you generate in the leaf nodes. (If you did, they would not be leaf nodes...)

So by making the final ply in a perft you destroy all relationship between the perft and the efficiency of the engine on which it was based. You would be optimizing MakeMove() rather than MoveGen(), while an engine spends much more time in MoveGen() than in MakeMove().

I see your point, but I think this is an inherent problem of using perft to optimise a chess engine. Doing the legality check for all moves is also something you normally don't do during alpha/beta. Each legality check might be cheap, but in a perft you do so many that it will probably be the biggest consumer of time.

Maybe it would be better (when using perft for the purpose of optimising the move generator) to let perft only count the number of generated pseudo-legal moves at the leaves?

syzygy · Post by **syzygy** » Fri Jun 07, 2013 2:03 pm

spambanane wrote:
hgm wrote:Actually I always found it much more efficient to not use difficult-to-unpack layouts in the move format. I.e. uses the low-order 16 bits as 8 + 8 for from and to-square, so they can be accessed by simple byte accesses, rather than load + shift + mask.
hi harm geert, i always wonder how you work efficiently with byte values. i tried to make my data structure more dense, but i always end up with slower code, due to the fact that loading byte values into an 32 bit register is slow (just a guess, you know better probably.). so i use int32 most of the times. how do you (and others) deal with that? any resources about this matter in the web you can recommend?

(i don't want to hijack this topic, in case i have more questions i will open a new topic.)

What in any case should be avoided is alternating between 32-bit and 8-bit operations when addressing the same memory locations.

Evert · Post by **Evert** » Fri Jun 07, 2013 2:03 pm

lucasart wrote: You should have a look at my code:
https://github.com/lucasart/chess/blob/ ... movegen.cc

It may sound arrogant, but I'm quite confident that it blows away _almost_ anyone's perft here by a margin. About 5 million nodes/second: no hash table, no SMP, no cheating by counting leaves (I count interior nodes and do bulk counting at the leaves, if I coult the leaves, then it's 200 million nodes per second).

Something sounds off for those numbers. On my laptop, Jazz does a 22Mnps perft from the opening position (6-ply perft takes a little over 5 seconds), or 90 Mnps if I use a legal move generator rather than the pseudo-legal one (6-ply in 1.3 seconds, using bulk counting at the leaves). I don't think Jazz is particularly fast, so I'd be surprised if your code is slower, as the numbers you posted suggest...

The key idea is that I never have to play/undo to check for legality or even to filter pseudo-legal moves by a function like
Code: Select all
move_is_self_check(pseudo_legal_move)
such a function is inevitable atrocious and full of branches.

This is a big win for perft for which you know in advance that you will make (and check) every move. When I did a quick test with legal moves in the search (instead of pseudo-legal moves) it didn't seem to make much of a difference, really...

One advantage is that with full legal move generation you can do interesting stuff if there is only one legal move (say, extend the search), or more generally with the knowledge of how many legal moves there are at this node. I haven't really explored that though.

lucasart · Post by **lucasart** » Fri Jun 07, 2013 2:27 pm

Evert wrote:
lucasart wrote: You should have a look at my code:
https://github.com/lucasart/chess/blob/ ... movegen.cc

It may sound arrogant, but I'm quite confident that it blows away _almost_ anyone's perft here by a margin. About 5 million nodes/second: no hash table, no SMP, no cheating by counting leaves (I count interior nodes and do bulk counting at the leaves, if I coult the leaves, then it's 200 million nodes per second).
Something sounds off for those numbers. On my laptop, Jazz does a 22Mnps perft from the opening position (6-ply perft takes a little over 5 seconds), or 90 Mnps if I use a legal move generator rather than the pseudo-legal one (6-ply in 1.3 seconds, using bulk counting at the leaves). I don't think Jazz is particularly fast, so I'd be surprised if your code is slower, as the numbers you posted suggest...

You misread my post, yet I was very clear (interior nodes, no leaves, bulk counting). Anyway, in order to compare apples to apples, I compute perft(6) in about 0.6 seconds (with bulk counting).

To come back to the question, my point is that legal move generation is what chose, because:
(i) it makes code cleaner and easier, and there's no post processing. i just straightforwardly compute the target squares so as to never even consider illegal moves
(ii) it is much faster for perft, and (contrary to what is often said) it is not slower for an engine benefit from early cutoffs (there's alost no overhead in (i) and you completely avoid the nightmare of filtering pseudo legal moves before playing them which is a big overhead full of branches).

What was initially a choice for code smplicity, also turned out to be superior in terms of performance... So it's a no brainer, and if I was to rewrite it, I would do it exactly the same.

JarvisXerxes · Post by **JarvisXerxes** » Fri Jun 07, 2013 2:31 pm

lucasart wrote:
Evert wrote:
lucasart wrote: You should have a look at my code:
https://github.com/lucasart/chess/blob/ ... movegen.cc

It may sound arrogant, but I'm quite confident that it blows away _almost_ anyone's perft here by a margin. About 5 million nodes/second: no hash table, no SMP, no cheating by counting leaves (I count interior nodes and do bulk counting at the leaves, if I coult the leaves, then it's 200 million nodes per second).
Something sounds off for those numbers. On my laptop, Jazz does a 22Mnps perft from the opening position (6-ply perft takes a little over 5 seconds), or 90 Mnps if I use a legal move generator rather than the pseudo-legal one (6-ply in 1.3 seconds, using bulk counting at the leaves). I don't think Jazz is particularly fast, so I'd be surprised if your code is slower, as the numbers you posted suggest...

You misread my post, yet I was very clear (interior nodes, no leaves, bulk counting). Anyway, in order to compare apples to apples, I compute perft(6) in about 0.6 seconds (with bulk counting).

Hi Lucas, my code's recursion returns 1 at depth 0, without any bulk counting ... so my perft's actually counting interior nodes?

JarvisXerxes · Post by **JarvisXerxes** » Fri Jun 07, 2013 2:45 pm

Chan Rasjid wrote:
JarvisXerxes wrote: My move representation follows the common approach: a dozen bits for the from/to squares, a dozen bits for the captured/moving pieces and a few more for special flags (castling, EP capture).
When I pass a Move object (basically a 32-int) to make/unmake(), they simply call get() methods on the move object to obtain relevant info.
I think it is best to follow what the top experts(Ippolit, komodo) are doing in storing moves - done right at the very start and you never ever need to make changes to this part of your code:
1) store move with an integer.
2) lower 12 bits for to/from square.
3) next 3 bits for promote type - if any; or it means not a promotion move. Komodo has minor other uses for these 3 bits; pawn push x2 ?
4) remaining 1 bit of lower 16 bits - unused; or for later use to set checking moves when info available.
5) upper 16 bits for other use - order scores for capture moves; or unused for quiet moves.

I you think a little, you would understand why most mature programs follow what's above and nothing more. There is no need to store castling/ep/captured piece, etc as those information are available when we come to making the moves; or we usually have to get those info before making the moves. Undoing a captured piece comes from a stack history where we store the captured piece type.

Ughhh. Thanks Chan, I guess that's one more optimiztion I can make. Currently I'm doing a lot of m.setCapture() and m.getCapture(), because the captured pieces are stored internally in the 32-word.

JarvisXerxes · Post by **JarvisXerxes** » Fri Jun 07, 2013 2:49 pm

Steve Maughan wrote:Hi Jim,

You might find this post interesting:

http://www.chessprogramming.net/compute ... rft-speed/

I had a significant breakthrough why I found the pinned pieces at the start of my GenMoves routine. I then checked for legality before I call MakeMove. As others have pointed out, you only need to do this check of the piece is pinned or the move is a king or en-passant move. This was a win for me.

Thanks Steve. Actually that's the first article I read today.

I'm already on my way to implement that idea.

tpetzke · Post by **tpetzke** » Fri Jun 07, 2013 3:13 pm

Hi Jim,

it is a design decision to make. If you include the captured piece type and also moved piece type in the move data structure you don't need a reference board for some operations (e.g. for sorting the moves with a MVV/LVA scheme)

As long as your moves are not classes that you allocate with new whenever you need one you can't make much wrong here. The speed difference of the several methods will not determine the strength of your engine in this phase of development.

Thomas...

Chan Rasjid · Post by **Chan Rasjid** » Fri Jun 07, 2013 4:24 pm

tpetzke wrote:Hi Jim,

it is a design decision to make. If you include the captured piece type and also moved piece type in the move data structure you don't need a reference board for some operations (e.g. for sorting the moves with a MVV/LVA scheme)

As long as your moves are not classes that you allocate with new whenever you need one you can't make much wrong here. The speed difference of the several methods will not determine the strength of your engine in this phase of development.

Thomas...

A m.setCapture in itself may not seem costly, but it is accumulation of pre-calculation on moves that may not be tried during search that has a cost. Also, almost all program need an external stack to store information(eg. zobrist hash key, ep-square, etc). Having the captured type in the stack may have other uses. We may later want to know if any prior moves of the current line under examination(eg. 2 plys earlier) are captures. If the info is hidden within the move structure, it may be difficult to access. Usually, the authors of Ippolit, Komodo are people who have made many attempts at optimization. For beginners, settling on their method (if clear and intuitive) should be ok.

For ordering MVV/LVA, what I do is I have a move generation phase for QS(captures/promote/ep) separate from the quiet move phase. I put the order scores in the upper 16-bits within movegen(); and sorting is just comparing the moves themselves (my move type is unsigned int). With quiet moves, sorting has to reference the global history[side][pctype][to-sq] array.

Best Regards,
Rasjid.

hgm · Post by **hgm** » Fri Jun 07, 2013 5:24 pm

spambanane wrote:..., due to the fact that loading byte values into an 32 bit register is slow (just a guess, you know better probably.). so i use int32 most of the times. how do you (and others) deal with that?

Shouldn't be any slower than loading 32-bit data in a 32-bit register.

Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck

Re: Perft search speed bottleneck