Perft speed optimization (renamed)

JohnWoe · Post by **JohnWoe** » Thu Apr 09, 2020 10:18 pm

mvanthoor wrote: ↑Tue Apr 07, 2020 1:05 pm
abulmo2 wrote: ↑Tue Apr 07, 2020 10:30 am IMHO, the reference program for perft without bulk counting nor hash table should be crafty. In my computer, it does perft 7 in 64.78s vs 3m30s for minic 2.00.
Crafty runs Perft 7 from the starting position in 55 seconds on my system. That's about half the time of Rustic. I assume Crafty has a more advanced move generator. It probably checks for being in check, double check, pinned pieces, etc... i.e., generating less illegal moves, and thus saving a lot of time making and then unmaking illegal moves.

Rustic doesn't do this (yet), so when the king is in check or if pieces are pinned, it still generates huge amounts of illegal moves. If you're in double check, all moves except king moves are illegal. These are optimizations that are going to be done later. First finish the search and uci protocol (and making my Rust code more idiomatic )

This evening I'll post some outputs.

@Terje:

In my case, Rustic is now slower when using the GNU toolchain for Rust. In the beginning when I started perft testing, it didn't matter, but for some reason (there were several Rust updates for example, since I neglected updating for some time), a GNU-toolchain compile now takes 120 seconds for Perft 7 from the starting position, while an MSVC compile takes 114 seconds. As the rustc compiler is the same in both toolchains, it seems MSVC's "link.exe" is more efficient in doing link-time optimization than GNU's ld.exe, on Windows.

Next to that, I probably also lost some speed, due to:
- Fixing two zobrist hash bugs (so there are now more zobrist XOR commands)
- Splitting a very long and ugly make_move() into several functions. To make this possible, it was necessary to take out a reference that went directly into the Board-struct. (Rust doesn't allow you to take mutliple mutable references into the same entity at the same time.) Therefore I now have to decide each time (in the new function) if I need to use the black or white pieces bitboard when setting or removing a piece. That check that's now done a huge amount of times as compared to just checking once and taking the pointer, will probably be responsible for some of the speed loss.

In this case, it's a speed vs. code readability/maintainability issue. The above two points lost about a second (113.2 vs 114.3) in perft 7 from the starting position, which at this point, is a speed loss of about 1%. I can live with that, getting much clearner code and better maintainability in return. In the end, the optimizations to the move generator itself will have a much bigger impact than individual statements in a function.

Also: was it difficult to switch from Fancy Magic bitboards to PEXT bitboards (didn't look into them yet), and was it worth it? Do you do one or more optimizations in the move generator? As Crafty runs slower on your system than it does on mine, I assume you have slower CPU; but still, on that slower CPU, Weiss is 10 seconds faster than Rustic is on my CPU.

I wrote a Chess960 movegenerator tool called LastEmperor some time ago. It has exactly the same movegenerator as in Sapeli. But score and other game playing stuff is cleaned out. It uses hash+bulk counting: I don't even incrementally update hash.

Startpos perft 8:

Code: Select all

lastemperor -hash 1024 -perft 8
[ rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1 ]
Depth               Nodes        Mnps        Time
    0                   1       0.000       0.000
    1                  20       0.000       0.000
    2                 400       0.000       0.000
    3               8,902       8.902       0.001
    4             197,281      10.383       0.019
    5           4,865,609      19.619       0.248
    6         119,060,324      42.041       2.832
    7       3,195,901,860      97.998      32.612
    8      84,998,978,956     188.695     450.456
=================================================
                    Nodes        Mnps        Time
Total      88,319,013,353     181.664     486.168

Kiwipete:

Code: Select all

lastemperor -fen "r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10 " -hash 1024 -perft 6
[ r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1 w - - 0 10  ]
Depth               Nodes        Mnps        Time
    0                   1       0.000       0.000
    1                  46       0.000       0.000
    2               2,079       2.079       0.001
    3              89,890      22.472       0.004
    4           3,894,594      21.758       0.179
    5         164,075,551      58.452       2.807
    6       6,923,051,137     110.546      62.626
=================================================
                    Nodes        Mnps        Time
Total       7,091,113,298     108.068      65.617

Some test too:

Code: Select all

lastemperor -fen "rnbq1k1r/pp1Pbppp/2p5/8/2B5/8/PPP1NnPP/RNBQK2R w KQ - 1 8  " -hash 1024 -perft 6
[ rnbq1k1r/pp1Pbppp/2p5/8/2B5/8/PPP1NnPP/RNBQK2R w KQ - 1 8   ]
Depth               Nodes        Mnps        Time
    0                   1       0.000       0.000
    1                  44       0.000       0.000
    2               1,486       0.000       0.000
    3              62,379      12.476       0.005
    4           2,103,487      17.529       0.120
    5          89,941,194      41.409       2.172
    6       3,048,196,529      68.593      44.439
=================================================
                    Nodes        Mnps        Time
Total       3,140,305,120      67.192      46.736

mvanthoor · Post by **mvanthoor** » Thu Apr 09, 2020 11:18 pm

That is fast

But... in this thread, not really relevant unfortunately.

We / I'm trying to get move generation, make_move, unmake_move and square_attacked to run as fast as possible, using a pseudo-legal move generator, and no tricks like hash or bulk-counting in perft. Also, full hash and piece-list updates and such, because make/unmake use the normal board routines that are also used when playing.

Terje · Post by **Terje** » Mon Apr 20, 2020 1:10 pm

Terje wrote: ↑Tue Apr 07, 2020 12:33 pm Had a go. Here Crafty took 68 sec, Weiss 104 sec, and Minic 154 sec (minic_2.05_mingw_x64_sse4.2.exe).

Went over the make/takemove code and dropped the time from 104 to 96 seconds

mvanthoor · Post by **mvanthoor** » Mon Apr 20, 2020 6:25 pm

Terje wrote: ↑Mon Apr 20, 2020 1:10 pm
Terje wrote: ↑Tue Apr 07, 2020 12:33 pm Had a go. Here Crafty took 68 sec, Weiss 104 sec, and Minic 154 sec (minic_2.05_mingw_x64_sse4.2.exe).
Went over the make/takemove code and dropped the time from 104 to 96 seconds

That's a 7-8% improvement. If it holds on my machine, your engine would run Perft at 82 seconds. Nice improvement. What did you do?

I don't know if I can match that. The only thing I could optimize would be to rip out the castling function and replace it with the castling perms array as Weiss does. I've tried it before, but it didn't improve speed back then, because after the engine castles, that function is never called again. For the rest, my make/unmake are basically carbon copies of yours. (Which isn't too strange; there aren't too many correct ways of doing make/unmake, and after chipping away at it for so long, both functions are bound to converge

)

I now have a weird problem. I wrote a (very simple) command line to assist in "playing" against the engine by putting in algebraic notation, and to run functions such as perft. As soon as I include the function "parse_input", which basically doesn't do anything but start functions based on the word I type, the engine becomes very slow. It is even slow when I don't start perft on the command line. For example:

- I start the engine. The prompt appears:

Rustic > _

- I type perft (enter), and the function starts. It takes 97-98 seoncds (!).

If I strip out the command line and call perft from main(), it takes the normal 88-89 seconds. Even if I put perft() *after* the call to the command line, it's slow. So:

- I start the engine and the prompt appears.
- I type "exit".
- The perft function in main() starts, and it takes 98 seconds.

When I replace parse_input() with perft() (so perft will always start, regardless of what I type), it runs in the normal 88-89 seconds. So even parse_input() just being there and used to parse "exit" makes the rest of the engine very slow after the command line function exits.

For now, the command line / text user interface is just a very basic infinite loop that exists as soon as I type "exit" (which yields 0, as a result, obviously.)

Code: Select all

while console::get_input(&mut board) != 0 {}

I'll have to look into parse_input() and the other functions it calls. If I fail to resolve it, I'll feature-gate the text UI. ("Feature gating" is Rust's version of conditional compilation, where you compile a program with the --features=["...", "..."] flag to include features/modules/functions etc in your program.)

Terje · Post by **Terje** » Mon Apr 20, 2020 7:08 pm

mvanthoor wrote: ↑Mon Apr 20, 2020 6:25 pm
Terje wrote: ↑Mon Apr 20, 2020 1:10 pm
Terje wrote: ↑Tue Apr 07, 2020 12:33 pm Had a go. Here Crafty took 68 sec, Weiss 104 sec, and Minic 154 sec (minic_2.05_mingw_x64_sse4.2.exe).
Went over the make/takemove code and dropped the time from 104 to 96 seconds

That's a 7-8% improvement. If it holds on my machine, your engine would run Perft at 82 seconds. Nice improvement. What did you do?

I made the add/clear/movepiece functions only update hash in MakeMove; in TakeMove the final hash is restored from history anyway so the updates were pointless. I dislike the added parameter, but I don't feel like adding 2 wrapper functions for each of them either.

I still used the SETBIT and CLRBIT macros from VICE, slightly changed, for updating the bitboards. I changed CLRBIT to use ^ instead of & (working under the assumption that it should only be used to unset a bit that is actually set). SETBIT used |. Changing SETBIT to also use ^ (working under the same assumption) resulted in less machine code (in MovePiece where the CLR and SETBITs were then combined by gcc) and faster perft. In the end I decided to get rid of the macros alltogether.

I converted a conditional increment to an unconditional += <condition>.

Details in commits 2-4: https://github.com/TerjeKir/weiss/pull/250

I'd love to hear if it holds on your machine

mvanthoor · Post by **mvanthoor** » Mon Apr 20, 2020 8:20 pm

Terje wrote: ↑Mon Apr 20, 2020 7:08 pm I made the add/clear/movepiece functions only update hash in MakeMove; in TakeMove the final hash is restored from history anyway so the updates were pointless. I dislike the added parameter, but I don't feel like adding 2 wrapper functions for each of them either.

I still used the SETBIT and CLRBIT macros from VICE, slightly changed, for updating the bitboards. I changed CLRBIT to use ^ instead of & (working under the assumption that it should only be used to unset a bit that is actually set). SETBIT used |. Changing SETBIT to also use ^ (working under the same assumption) resulted in less machine code (in MovePiece where the CLR and SETBITs were then combined by gcc) and faster perft. In the end I decided to get rid of the macros alltogether.

I converted a conditional increment to an unconditional += <condition>.

Details in commits 2-4: https://github.com/TerjeKir/weiss/pull/250

I'd love to hear if it holds on your machine

Somewhere in the coming days I'll compile Weiss again and see how it runs. I don't use macro's; I use functions that are inlined. (Probably; they're only one line long.) With regard to the hashing and other states: I don't reverse them. I store them in a history array and then restore them all at once. Therefore I have clear_piece, set_piece and move_piece in the board, but also locally in unmake; the ones in the board do all the incremental updates for the "game_state" struct, while the ones in unmake only reverse the move. The function restores the entire game_state struct at the end.

mvanthoor · Post by **mvanthoor** » Mon Apr 20, 2020 8:34 pm

Found the issue of the slow perft wen running through the text UI. It's indeed in the parse_move() function:

Slow:

Code: Select all

    match command {
        "quit" | "exit" => CMD_QUIT,
        "perft" => cmd_perft(board),
        "suite" => cmd_suite(),
        "clear" => cmd_clear(),
        "t" => cmd_take_move(board),
        _ => cmd_make_move(board, input),
    }

Fast:

Code: Select all

    if command == "perft" {
        cmd_perft(board);
    }

So when I call cmd_perft (which then just calls perft(), as each command is in its own function) from "match", the perft routine loses around 10% speed. When I call it from an if-statement, it runs normally. Do not ask me why. The only thing I can think of is that the processor is still partly 'hogged' or occupied with parts of the match statement or something.

I'll convert the match into a function pointer lookup table and see what the result is.

ps: I don't use the GUI for actual playing; I'm using it for testing the search function, will later use it to test the UCI functionality, etc... as said, I can easily put this behind a feature gate as to not have it compiled into the engine. At some point, I'll probably implement xboard as well, which can then also be included or excluded as a feature.

mvanthoor · Post by **mvanthoor** » Mon Apr 20, 2020 11:11 pm

I found nothing it seems

The problem isn't the match. The if-statement or function pointer solution doesn't work. The only reason that perft is slow when run from the text UI is because of the extra code that's in the executable for actually executing the move.

Code: Select all

    match &input[..] {
        "quit" | "exit" => CMD_QUIT,
        "perft" => cmd_perft(board),
        "suite" => cmd_suite(),
        "clear" => cmd_clear(),
        "t" => cmd_take_move(board),
        _ => cmd_make_move(board, input),  <== problem there
    }

When I remove cmd_make_move and replace it with CMD_CONTINUE (which is a constant that just makes the text ui ask for the next command), the compiler obviously drops all the code for parsing and making the move, and the Perft run is at normal speed. If the code for cmd_make_move is compiled into the executable, perft is 10% slower.

I can't believe that memory layout or whatever can have THAT big an effect. If that's true and a program can become 10% slower just because of code that's in there but not actually executed, then basically all optimizations except for unnecessary memory allocations become completely pointless. It seems to be the case however, because even if I execute perft from main(), it is slower when the code for cmd_make_move() is merely in the binary, but not executed.

Terje · Post by **Terje** » Mon Apr 20, 2020 11:23 pm

mvanthoor wrote: ↑Mon Apr 20, 2020 11:11 pm I found nothing it seems

The problem isn't the match. The if-statement or function pointer solution doesn't work. The only reason that perft is slow when run from the text UI is because of the extra code that's in the executable for actually executing the move.
Code: Select all
    match &input[..] {
        "quit" | "exit" => CMD_QUIT,
        "perft" => cmd_perft(board),
        "suite" => cmd_suite(),
        "clear" => cmd_clear(),
        "t" => cmd_take_move(board),
        _ => cmd_make_move(board, input),  <== problem there
    }
When I remove cmd_make_move and replace it with CMD_CONTINUE (which is a constant that just makes the text ui ask for the next command), the compiler obviously drops all the code for parsing and making the move, and the Perft run is at normal speed. If the code for cmd_make_move is compiled into the executable, perft is 10% slower.

I can't believe that memory layout or whatever can have THAT big an effect. If that's true and a program can become 10% slower just because of code that's in there but not actually executed, then basically all optimizations except for unnecessary memory allocations become completely pointless. It seems to be the case however, because even if I execute perft from main(), it is slower when the code for cmd_make_move() is merely in the binary, but not executed.

Does cmd_make_move use the same makemove function used by perft?

mvanthoor · Post by **mvanthoor** » Mon Apr 20, 2020 11:43 pm

Terje wrote: ↑Mon Apr 20, 2020 11:23 pm Does cmd_make_move use the same makemove function used by perft?

Yes.

If no command matches the input, it is assumed that the user has typed a move such as "e2e4" or "b7b8q", and cmd_make_move() is called.

cmd_make_move uses:
-- "parse_move()" to convert the algebraic move into square and promotion piece numbers. If the squares and promotion piece numbers are all valid, we have a possible move. (The input could be "d2g8r", which are all legal squares and a legal promotion piece, but is never a legal move.)
-- "try_move()" to actually try and play the move that comes out of parse_move(). It uses the same make() function as perft() does. If this make() function returns "true", the move is legal, and made. If it returns "false", the move was illegal and has already been reversed.

All of this code to parse and try to make a move is not used when I type "perft" (enter). Then cmd_perft() is executed, which just calls "perft()".

If I change nothing else but just take out cmd_make_move() (by replacing it with either Rust's "do nothing", or CMD_CONTINUE), perft runs at its normal speed after compilation. The difference is that if I take out cmd_make_move(), then the code in cmd_make_move(), parse_move(), and try_move() is not compiled into the exe. When I type "perft" to run perft, that code _is_ in the exe (but not called touched during the perft run).

I wonder if the CPU could go on a tangent, trying to "predict" all of the parsing and move making and such from the match statement, and is then completely choked when I finally decide to type "perft" and it has to go a certain way. I don't know.

Perft speed optimization (renamed)

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?

Re: Can't get Minic to run perft?