if statement and calculation faster than constant

mvanthoor · Post by **mvanthoor** » Fri Apr 10, 2020 1:41 am

A few weeks ago there was a discussion about someone pushing to replace large table-arrays with a calculation in Stockfish where possible, because it was deemed faster. I've been noticing this myself several times in my optimization rounds for Rustic's move generator, make, and unmake function.

In make_move, the en passant square needs to be set. At first, I did it like this:

Code: Select all

// A double-step is the only move that sets the ep-square.
if double_step {
		let ep_square = if us == WHITE { to - 8 } else { to + 8 };
		board.set_ep_square(ep_square);
}

This works fine, but the profiler flags the "let ep_square" line to be expensive.
So, I thought to be smart, and replace this with a lookup-array:

Code: Select all

const DOUBLE_STEP_SQUARE_MAP: [u8; NR_OF_SQUARES as usize] = [
    0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,
    16, 17, 18, 19, 20, 21, 22, 23,
    40, 41, 42, 43, 44, 45, 46, 47,
    0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,
    0,  0,  0,  0,  0,  0,  0,  0,
];

So, if a white pawn moves to square 24, it'll find 16 as the EP-square.

This gets rid of the if-statement, the calculation "to - 8", and even the entire variable "ep_square".

The result is negative; perft 7 is consistently a second slower (+/- 0.2) with the lookup-array than it is with the the original code. The speed loss is about 1% or so; just a fraction, but consistently measurable. I've noticed the same in square_attacked(), where actually calculating the pawn attack on the fly is faster than getting it from mg.get_pawn_attacks(color, square) which is a function that just copies the bitboard from a lookup table. (Entirely replacing the lookup table with a calculation is negative though; strangely enough, the calculation vs. lookup is only beneficial in square_attacked(); not everywhere else as well.)

If the Stockfish developers are finding the same thing, that replacing a lookup-table with a calculation that then speeds up Stockfish by 1% for each replacement, I can imagine them replacing the tables.

(However... In my case I think the original code is MUCH cleaner and less bulky than the lookup table; it's only when the calculation is hard to understand that I think the lookup table is clearly better, even IF it's 1% slower.)

Terje · Post by **Terje** » Fri Apr 10, 2020 2:17 am

To calculate the en passant square of a double step you can do

Code: Select all

ep_square = to ^ 8

regardless of side to move; for white, 'to' will be 24-31 which has the '8-bit' set and flipping it will thus move you down one rank, while for black 'to' will be 32-39 which doesn't have that bit set so flipping it will move you up one rank

If you try this, does the profiler no longer mark it as expensive?

mvanthoor · Post by **mvanthoor** » Fri Apr 10, 2020 3:49 am

Thanks for the tip, but... believe it or not, it actually is slower. Perft 7 times rose by 1.5 seconds, consistently.

The last few evenings, I've optimized Rustic where possible.
- Removed the move list pool, and reverted to the array + counter in a struct, but using unsafe { } to not zero-init it as Fabian showed.
- Remove a few Some() checks (C-equivalent would be: if x != null.) If there are no mistakes in the board, the values are either the correct ones, or the program will crash or won't play legal chess. As this is not user-facing software, it doesn't have to try and recover from errors. (It probably even can't if you'd try...)
- Removed the vector for the history list. Even when initializing with enough capacity, push() and pop() were still slower than an array backed with a counter and push() and pop() implemented by myself. (pop() actually doesn't even clear the element; another push() will just overwrite.)
- Regrouped some more if-constructions and evaluation orders (put the one that has the greatest chance of short-curcuit the if-statement first.)
- There's one more Some() that I can remove (for the EP-square), but that's going to be for tomorrow.

With these optimizations, I managed to drop perft 7 times from 95.5 seconds to 89 seconds; Rustic's speed is now within a second of Weiss on this system. Rustic perfts at 89.0 - 89.2 seconds, Weiss at 88.4 - 89.5, so it has a somewhat bigger margin.

Code: Select all

Benchmarking perft 1-7:

8   r n b q k b n r
7   i i i i i i i i
6   . . . . . . . .
5   . . . . . . . .
4   . . . . . . . .
3   . . . . . . . .
2   I I I I I I I I
1   R N B Q K B N R

    A B C D E F G H

Zobrist key:        819aa694337673fb
Active Color:       White
Castling:           KQkq
En Passant:         -
Half-move clock:    0
Full-move number:   1

Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (5 ms, 39456200 leaves/sec)
Perft 5: 4865609 (139 ms, 35004381 leaves/sec)
Perft 6: 119060324 (3357 ms, 35466286 leaves/sec)
Perft 7: 3195901860 (89045 ms, 35890862 leaves/sec)
Total time spent: 92546 ms
Execution speed: 35874423 leaves/second

With regard to nodes/second: I only count leaves, not all nodes traversed. I still don't know what the "correct" way is.

The one thing I keep being flabbergasted about with Rust is this:

A:

Code: Select all

let mut current_game_state = board.game_state;
current_game_state.this_move = m;
board.unmake_list.push(current_game_state);

B:

Code: Select all

board.game_state.this_move = m;
board.unmake_list.push(current_game_state);

Sometimes, code becomes like A during refactoring. As I keep testing after each change, I can see the speed of such code.
Obviously, it has an extra copy to current_game_state, so I remove that like in B. And perft times rise by 2 seoncds. How is it even possible that eliminating an extra copy is actually SLOWER than leaving it in?

hgm · Post by **hgm** » Fri Apr 10, 2020 7:29 am

My experience is that you cannot really optimize programs this way (i.e. make very small changes and see how they affect execution time) on these modern multi-scalar out-of-order CPUs. Execution times of instructions just depend too much on each other. E.g. on the order in which instructions appear, at what absolute address they appear. When I was writing Qperft, at some point some of its code had become unreachable: it was in an if-statement with a condition that could not possibly be fulfilled anymore due to other changes. But when I deleted that code, the time for perft(6) became appreciably slower (some 10-20%, IIRC). Very much like what you are reporting here. Apparently deleting the unused code moved other code to a different address, which had a large impact on its execution speed. Or perhaps insertion of an always-taken conditional branch resolved a collision between branch patterns that otherwise would have confused the global branch predictor. So many obscure things are going on in these CPUs that it is impossible to know.

So the way I approach it now is that I just use the code that I know will require the least effort from the CPU, in the hope that on average (i.e. averaged over the many different future versions of the program) the speed difference between a cheap and an expensive group of instructions will be in favor of the cheap group, even to for individual cases it can sometimes be the reverse.

mar · Post by **mar** » Fri Apr 10, 2020 8:48 am

hgm wrote: ↑Fri Apr 10, 2020 7:29 am My experience is that you cannot really optimize programs this way (i.e. make very small changes and see how they affect execution time) on these modern multi-scalar out-of-order CPUs. Execution times of instructions just depend too much on each other. E.g. on the order in which instructions appear, at what absolute address they appear. When I was writing Qperft, at some point some of its code had become unreachable: it was in an if-statement with a condition that could not possibly be fulfilled anymore due to other changes. But when I deleted that code, the time for perft(6) became appreciably slower (some 10-20%, IIRC). Very much like what you are reporting here. Apparently deleting the unused code moved other code to a different address, which had a large impact on its execution speed. Or perhaps insertion of an always-taken conditional branch resolved a collision between branch patterns that otherwise would have confused the global branch predictor. So many obscure things are going on in these CPUs that it is impossible to know.

I have a very similar experience, recently I found a strange regression (40%) in a switch (bytecode VM), reordering one VM opcode (=case inside switch) and the problem was gone. This happened on Ryzen, on Intel no such problem was present. So I suspect either icache/uop cache (not sure how this one really works) or branch predictor, but since it was an indirect jump based on a register that got resolved just before the jump, I don't think that branch predictor was involved.
What puzzles me that the switch fits in 3 4k pages, so that should fit comfortably into L1i, I don't know. These things can happen and there's nothing to do about it, I'm afraid.

Terje · Post by **Terje** » Fri Apr 10, 2020 12:46 pm

The to^8 trick tested well in Weiss but not in Ethereal (likely due to factors discussed by hgm), which is why I was curious about what your profiler would say (and the actual performance).

If you want to see Weiss speed without the eval/move order code you can comment out the lines marked as 'update material', 'update phase', and 'update various piece lists' (this comment needs updating lol), in Add/Clear/MovePiece functions in makemove.c and line 67-75 in movegen.c. On my machine perft 7 drops from 104 to 88 seconds, a ~15% speedup.

I too only count leaf nodes so my perft output is a bit misleading. I'll update this to properly specify that it means leaves not all nodes, as well as leaves/s.

Nice speedup again!

leanchess · Post by **leanchess** » Fri Apr 10, 2020 1:30 pm

I specifically optimised my perft implementation and got it to half those times (albeit with no castling and incomplete promotion implementation). I'm wondering if this has to do with your language of choice. The reason I'm asking is that I'm considering switching to Rust as you advertised it as capable of addressing 128-bit registers, which could be useful in implementing that idea of mine.

Terje · Post by **Terje** » Fri Apr 10, 2020 1:49 pm

leanchess wrote: ↑Fri Apr 10, 2020 1:30 pm I specifically optimised my perft implementation and got it to half those times (albeit with no castling and incomplete promotion implementation). I'm wondering if this has to do with your language of choice. The reason I'm asking is that I'm considering switching to Rust as you advertised it as capable of addressing 128-bit registers, which could be useful in implementing that idea of mine.

If you used hash tables / bulk counting then it isn't comparable to the times we are posting here. Even just given the incomplete implementation it's already pretty borderline.

leanchess · Post by **leanchess** » Fri Apr 10, 2020 2:18 pm

Terje wrote: ↑Fri Apr 10, 2020 1:49 pm If you used hash tables / bulk counting then it isn't comparable to the times we are posting here.

I used no such tricks (or multithreading, in case you're wondering). Much of it is due to some clang magic that I find impossible to explain (e.g. trying to manually inline a function yielded the opposite of what I had expected).

Even just given the incomplete implementation it's already pretty borderline.

I agree. I wasn't bragging (after all, qperft is still about x2 faster on my machine), just trying to provide a baseline.

mvanthoor · Post by **mvanthoor** » Fri Apr 10, 2020 4:35 pm

Terje wrote: ↑Fri Apr 10, 2020 12:46 pm The to^8 trick tested well in Weiss but not in Ethereal (likely due to factors discussed by hgm), which is why I was curious about what your profiler would say (and the actual performance).

It's not a dramatic difference, but "to ^ 8" is measurably slower than "if us == WHITE {to - 8} else {to + 8}" construction.

In another thread I posted that calculating the pawn attacks was faster than getting them from the attack-table.

Normally, from that call, you get the squares they attack, so get_pawn_attacks[WHITE][E2] gives you d3 and f3 in a bitboard. For the square_attacked() function, I needed the squares the attacking pawns are ON, not where they are attacking TO. If us == WHITE and attacker == BLACK, and you want to know which squares are attacked by black, you can't just do get_pawn_attacks[attacker][square] as you do with other pieces, because in case of E2, a black pawn on E2 would give you D1 and F1; in this case, you want to have the square where the black panws ARE, not what they can attack.

The trick is to flip the attacker, and get the pawns from the other side: get_pawn_attacks[attacker ^ 1][E2], which gives you D3 and F3, which are indeed the squares where an opponent pawn must be to attack E2.

As this lookup has "attacker ^ 1" in it, and "to ^ 1" is also much slower than a massive statement, I'm starting to believe the XOR operation is slow; at least in Rust. (I'll see if ! works in this context; as in, !attacker. Could be that this is only for bools though, and side is a u8).

If you want to see Weiss speed without the eval/move order code you can comment out the lines marked as 'update material', 'update phase', and 'update various piece lists' (this comment needs updating lol), in Add/Clear/MovePiece functions in makemove.c and line 67-75 in movegen.c. On my machine perft 7 drops from 104 to 88 seconds, a ~15% speedup.

I too only count leaf nodes so my perft output is a bit misleading. I'll update this to properly specify that it means leaves not all nodes, as well as leaves/s.

Don't need to do that; I also calculate hashes, piece-table updates and material updates.

Nice speedup again!

Thanks

I've switched to the GNU toolchain to test this again. It seems that HGM is right on this; suddenly, the GNU toolchain became much slower than MSVC. Now they are almost equal again, with GNU pulling ahead a little bit.

Rustic's perft 7 from starting position on this system is now 86.xx or 87.xx seconds:

Code: Select all

MSVC toolchain:
===============
Benchmarking perft 1-7:

8   r n b q k b n r
7   i i i i i i i i
6   . . . . . . . .
5   . . . . . . . .
4   . . . . . . . .
3   . . . . . . . .
2   I I I I I I I I
1   R N B Q K B N R

    A B C D E F G H

Zobrist key:        819aa694337673fb
Active Color:       White
Castling:           KQkq
En Passant:         -
Half-move clock:    0
Full-move number:   1

Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (5 ms, 39456200 leaves/sec)
Perft 5: 4865609 (139 ms, 35004381 leaves/sec)
Perft 6: 119060324 (3340 ms, 35646803 leaves/sec)
Perft 7: 3195901860 (87941 ms, 36341431 leaves/sec)
Total time spent: 91425 ms
Execution speed: 36314294 leaves/second

GNU toolchain:
==============
Benchmarking perft 1-7:

8   r n b q k b n r
7   i i i i i i i i
6   . . . . . . . .
5   . . . . . . . .
4   . . . . . . . .
3   . . . . . . . .
2   I I I I I I I I
1   R N B Q K B N R

    A B C D E F G H

Zobrist key:        819aa694337673fb
Active Color:       White
Castling:           KQkq
En Passant:         -
Half-move clock:    0
Full-move number:   1

Perft 1: 20 (0 ms, inf leaves/sec)
Perft 2: 400 (0 ms, inf leaves/sec)
Perft 3: 8902 (0 ms, inf leaves/sec)
Perft 4: 197281 (5 ms, 39456200 leaves/sec)
Perft 5: 4865609 (133 ms, 36583526 leaves/sec)
Perft 6: 119060324 (3272 ms, 36387629 leaves/sec)
Perft 7: 3195901860 (86980 ms, 36742950 leaves/sec)
Total time spent: 90390 ms
Execution speed: 36730107 leaves/second

For my feeling, I've now "proven" that Rust can be as fast as C, even in non-trivial projects like mini-benchmarks, but it does require a bit of special work. I have two arrays backed by a counter, in a struct, which implement the normal vector methods such as push, pop, clear, len, etc, just because it's faster than even a fixed length vector. Also, they are created with non-initialized memory... so with some heart-ache, I had to use a line of unsafe code:

Code: Select all

list: unsafe { mem::MaybeUninit::uninit().assume_init() },

(Where list is an array of 256 Move elements.)

This is basically the equivalent of C's malloc(); it creates a block of memory without initializing. The ".assume_init()" makes sure Rust drops the memory as soon as "list" (or in this case, the struct where "list" is in) goes out of scope. The memory is full with garbage, and it's my responsibility to fill it with usable data, just likein C. So, if I'd try to get "list[0]" without ever putting something in the memory block first, I get some random number.

In this case, it doesn't matter; I create the MoveList struct, and then immediately fill it with a call to gen_moves(). The struct's counter points to the first element that is still free. (Exactly the way it'd be done in C.)

The move list pool did work as well, and was as fast as this, but it required more code, and it's "not standard" with regard to chess programs. This is more understandable.

As I can't optimize any further right now (every change I make, degrades performance), I'm going to make a generic out of those two arrays to avoid code duplication, and call it quits on the speed optimizations. I'll start working on the search and UCI-protocol this weekend.

if statement and calculation faster than constant

if statement and calculation faster than constant

Re: if statement and calculation faster than constant

Re: if statement and calculation faster than constant

Re: if statement and calculation faster than constant

Re: if statement and calculation faster than constant

Re: if statement and calculation faster than constant

Re: if statement and calculation faster than constant

Re: if statement and calculation faster than constant

Re: if statement and calculation faster than constant

Re: if statement and calculation faster than constant