Profiling Comparisons

Discussion of chess software programming and technical issues.

Moderator: Ras

kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

Re: Profiling Comparisons

Post by kbhearn »

The BITSCAN() refers to calls to a 32-bit DeBruijn solution for BitScanForward. I agree that bitboards are used everywhere, but the question is how much profile time is acceptable for their calculation? Someone out there must have done a profile on their chess engine before.
That entirely depends on how often you use them. It's not a useful metric between different engines.

If i understand correctly that your attackboard() is used mostly for move generation then it sounds like either your move generation is taking too long or your evaluation is exceedingly light. It's normal these days for evaluation (and its subcalls) to be the largest time expenditure in an engine.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Profiling Comparisons

Post by D Sceviour »

Kevin Hearn wrote,
It's not a useful metric between different engines.
It is strange for you to say that because your next statement contradicts this.

Kevin Hearn wrote,
It's normal these days for evaluation (and its subcalls) to be the largest time expenditure in an engine.
Thank you for your opinion. It would be better if it were backed up by a real profile test result.

Kevin Hearn wrote,
If i understand correctly that your attackboard() is used mostly for move generation then it sounds like either your move generation is taking too long or your evaluation is exceedingly light.
No. The moves are generated without a legality test or attack board use. Attack boards are used for legality tests and calculated after each make move. Attack boards are also use for futility tests, lazy evaluation, static evaluation, sorting, SEE tests, and a new idea called gain score pruning (that doesn’t work very well yet). It seems that every time I try to reduce the use of attack boards, updated ones seem to be needed everywhere.

But again, it is not the use of the attack boards that is taking the time. It is the time to calculate the attack boards that might be the problem. I cannot see why this is so difficult to understand.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Profiling Comparisons

Post by bob »

D Sceviour wrote:Bob Hyatt asked,
When you generate legal moves, you use magics. When you evaluate sliding piece mobility, you use magics. So how much of the usage goes to evaluate, and how much to movgen?
Perhaps the error is mine in the use of magics as my algorithm uses attack boards (bitboards) and not magics to generate legal moves, and perhaps that is why my ATTACKBOARD() usage is so high. For example, a specific profile call to generate a Rook attack (in pseudo code):

ulonglong Rank_Attack(sq,occupancy)
{
occupancy AND= rookmask(sq);
Index = (occupancy * magicrook(sq)) SHR ROOK_SHIFT;
j = R_Index(sq,index);
return (R_Moves(j));
}

This returns a bitboard of the rook attack.

Likewise, there are profiled routines for Bishop, Queen, pawns, etc. These are added to produce 23.91% of the cumulative time. The profile above does not give the percentage breakdown of how many calls to generate attacks are made from the move generator.
_______________________________________________________

Bob Hyatt wrote,
I'd imagine ALL of the popcnt's are in evaluation, but maybe not...
That does not address the question being asked in the OP. The popcnt profile in question would be the equivalent of Crafty's 32-bit representation:

int PopCnt(register BITBOARD a)

The BITSCAN() refers to calls to a 32-bit DeBruijn solution for BitScanForward. I agree that bitboards are used everywhere, but the question is how much profile time is acceptable for their calculation? Someone out there must have done a profile on their chess engine before.
The whole point of magics is to compute the sliding piece attacks at the point they are needed. NOTHING gets updated, you just compute them from scratch..

I don't think anyone can answer that bit board question. We allow the compiler to inline trivial functions, which means the bit board stuff is spread across all the chess procedures that use them. It really would not help at all to know that PopCnt() is burning 20% of the time. Question is WHERE is it being done so often to burn that much time...

Do you track int multiply instruction usage? Most of us consider bit board operations to be that same level of simplicity. MOST bit board operations turn into AND, OR, XOR, SHIFT and occasionally popcnt, but more frequently BSF/BSR. Doesn't tell me very much to tell me popcnt is using N%. I'm interested WHERE that is being done to see if there is a way to reduce the number done there.

For Crafty, on modern hardware, popcnt turns into one instruction (popcnt obviously). Not much information there, more in the time used by the procedures that use that instruction.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Profiling Comparisons

Post by bob »

D Sceviour wrote:Kevin Hearn wrote,
It's not a useful metric between different engines.
It is strange for you to say that because your next statement contradicts this.

Kevin Hearn wrote,
It's normal these days for evaluation (and its subcalls) to be the largest time expenditure in an engine.
Thank you for your opinion. It would be better if it were backed up by a real profile test result.

Kevin Hearn wrote,
If i understand correctly that your attackboard() is used mostly for move generation then it sounds like either your move generation is taking too long or your evaluation is exceedingly light.
No. The moves are generated without a legality test or attack board use. Attack boards are used for legality tests and calculated after each make move. Attack boards are also use for futility tests, lazy evaluation, static evaluation, sorting, SEE tests, and a new idea called gain score pruning (that doesn’t work very well yet). It seems that every time I try to reduce the use of attack boards, updated ones seem to be needed everywhere.

But again, it is not the use of the attack boards that is taking the time. It is the time to calculate the attack boards that might be the problem. I cannot see why this is so difficult to understand.
Here is a fairly recent one from Crafty:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
8.91 20.57 20.57 137350661 0.00 0.00 Quiesce
8.14 39.36 18.79 197608464 0.00 0.00 GenerateCaptures
7.98 57.78 18.42 494 0.04 0.47 Search
7.44 74.93 17.16 551972714 0.00 0.00 MakeMove
--------------------------------------------------------------------------------
5.90 88.54 13.61 971982467 0.00 0.00 Attacks
5.65 101.58 13.04 279482293 0.00 0.00 Evaluate
5.48 114.23 12.65 324437326 0.00 0.00 Swap
4.97 125.70 11.47 224410340 0.00 0.00 EvaluateRooks
4.52 136.13 10.44 551972714 0.00 0.00 UnmakeMove
4.18 145.78 9.65 98180809 0.00 0.00 HashProbe
4.06 155.14 9.37 224410340 0.00 0.00 EvaluateBishops
4.04 164.46 9.32 355604008 0.00 0.00 AttacksTo
3.63 172.82 8.37 377016019 0.00 0.00 NextMove
3.20 180.20 7.38 15932322 0.00 0.00 EvaluatePawns
3.06 187.27 7.07 224410340 0.00 0.00 EvaluateKnights
2.50 193.03 5.76 102103945 0.00 0.00 EvaluatePassedPawns
1.90 197.42 4.39 346944856 0.00 0.00 EvaluateDevelopment
1.69 201.32 3.90 53958582 0.00 0.00 GenerateChecks
1.66 205.15 3.83 11547617 0.00 0.00 GenerateNoncaptures
1.57 208.77 3.62 224410340 0.00 0.00 EvaluateQueens
1.40 211.99 3.22 173472499 0.00 0.00 EvaluateMaterial
1.33 215.06 3.08 80327586 0.00 0.00 HashStore
1.19 217.81 2.75 224410340 0.00 0.00 EvaluateKings
1.16 220.49 2.68 63729288 0.00 0.00 EvaluateKingsFile
1.12 223.08 2.59 17326486 0.00 0.00 GenerateCheckEvasions
0.79 224.89 1.82 34273662 0.00 0.00 NextEvasion
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Profiling Comparisons

Post by D Sceviour »

Bob Hyatt wrote,
The whole point of magics is to compute the sliding piece attacks at the point they are needed.
That is very interesting! Therefore, the attack boards are not recalculated unless they affect a particular intersection point. I assume that point is anything which intersects the from/to square of the move to be legally tested. Correct? I tried a temporary test of this and found a minimum 20% increase in the speed of overall attack board calculation. That is significant but not enough to explain the dominant position in the profile list. The cautionary issue here is to be careful that the calculation of points as "they are needed" does not exceed the amount of time to calculate the attack boards from scratch.

However, even with some fine-tuning, there is still nothing to compare the percentage time to be expected for calculating the attack boards from scratch. It would be nice to have a profile comparison.

Bob Hyatt wrote,
We allow the compiler to inline trivial functions, which means the bit board stuff is spread across all the chess procedures that use them.
It is not that clear. The "spread out" procedures are the sort of thing expected for a macro expansion. The profiling output from my compile produced information for 107 different functions. However, that does not mean I know how to compile correctly. There must be some special reasons that some programs produce 300,000 nps while other programs produce 7,000,000 nps. Can these inline functions be identified by the compiler for special cache loading or something?

I understand your point about not tracking hardware instructions. The profile was intended for software substitutions.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Profiling Comparisons

Post by D Sceviour »

Thank you for the profile Bob. :D

Interesting is the total evaluation routines = 24.9%, and the total Attacks = 9.94%. That gives an aspiring goal to look forward. It is one-half of the battle towards creating an optimized program.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Profiling Comparisons

Post by bob »

D Sceviour wrote:Thank you for the profile Bob. :D

Interesting is the total evaluation routines = 24.9%, and the total Attacks = 9.94%. That gives an aspiring goal to look forward. It is one-half of the battle towards creating an optimized program.
Beware of false conclusions. I can't say anything about what that profiled. IE several positions, one position, what kind of position. I just had the profile output laying around and copied it. Generally Crafty's eval hangs around 50% or close to it. I've been playing with an eval hash/cache, which is a bit on the tricky side when done in conjunction with lazy eval. Once I get the kinks worked out so that it is reliable, I will post a real profile output.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Profiling Comparisons

Post by bob »

D Sceviour wrote:Bob Hyatt wrote,
The whole point of magics is to compute the sliding piece attacks at the point they are needed.
That is very interesting! Therefore, the attack boards are not recalculated unless they affect a particular intersection point. I assume that point is anything which intersects the from/to square of the move to be legally tested. Correct? I tried a temporary test of this and found a minimum 20% increase in the speed of overall attack board calculation. That is significant but not enough to explain the dominant position in the profile list. The cautionary issue here is to be careful that the calculation of points as "they are needed" does not exceed the amount of time to calculate the attack boards from scratch.

However, even with some fine-tuning, there is still nothing to compare the percentage time to be expected for calculating the attack boards from scratch. It would be nice to have a profile comparison.

Bob Hyatt wrote,
We allow the compiler to inline trivial functions, which means the bit board stuff is spread across all the chess procedures that use them.
It is not that clear. The "spread out" procedures are the sort of thing expected for a macro expansion. The profiling output from my compile produced information for 107 different functions. However, that does not mean I know how to compile correctly. There must be some special reasons that some programs produce 300,000 nps while other programs produce 7,000,000 nps. Can these inline functions be identified by the compiler for special cache loading or something?

I understand your point about not tracking hardware instructions. The profile was intended for software substitutions.
No. If you want bishop/rook/queen attacks, you directly generate them with a magic multiply lookup. Right at the point where the attacks are needed. No incrementally updated stuff, nothing kept around for later use.

As far as inlining goes, the -O2/-O3 takes care of that automatically. All really small functions get inlined, and some larger ones will if they are not called in a bunch of different places which would bloat the executable and really beat on cache badly...
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: Profiling Comparisons

Post by D Sceviour »

Bob Hyatt wrote,
No. If you want bishop/rook/queen attacks, you directly generate them with a magic multiply lookup. Right at the point where the attacks are needed. No incrementally updated stuff, nothing kept around for later use.
Looking at Crafty's method of attack, it radiates outward from the attacked square rather than radiating inward from the attacking piece. The result (as I see it it) is a single dirty bitboard which gives the minimum required information to test the legality of a position. It cannot be used for anything else. Although I have not specifically tested that method, it may be that is what is necessary to maximize the speed of attack board functions.

However, I use a general attack board for a multiple of uses (futility, checking, legality, sorting, etc.) and that requires an exact attack board of all the pieces and pawns - for both sides on the same move. If this method is correct, then twice the amount of calculation time would be required for the Attacks().

The ultimate question is of course does it work? Arnaud lohéac posted these results for Schooner:

http://talkchess.com/forum/viewtopic.php?t=57496
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Profiling Comparisons

Post by cdani »

bob wrote: No. If you want bishop/rook/queen attacks, you directly generate them with a magic multiply lookup. Right at the point where the attacks are needed. No incrementally updated stuff, nothing kept around for later use.
In Andscacs I keep an array of attacks of sliders once they are calculated with magics on attack generation function. This showed as a little speed gain as they are used at least in evaluation function and move generation.