Perft function too slow because of vcruntime140.dll

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Perft function too slow because of vcruntime140.dll

Post by mvanthoor »

Hi :)

I've been testing my engine's functionality by running some of the tests from testsuite.epd. Everything seems to check out with regard to correct moves being found, but I find the speed to be very disappointing.

Earlier this evening it took 10.8 seconds to run perft 6 from the starting position; the reason being that I was using a Vector to store my move list. The allocation of this vector in the perft function took a huge amount of time. I've removed it and replaced it with an array, dropping the perft time from 10.8 to 8.4 seconds.

8.4 seconds is still slow.

I switched my Rust toolchain from windows-gnu to windows-msvc (microsoft), to make sure it wasn't a problem with gnu-part of the toolchain. The Rust compiler uses LLVM to compile (just like CLANG does), and then links the files using either GNU's or Microsoft's linker. (Rust doesn't have its own linker.)

No difference; when using the MSVC toolchain, the perft 6 run from the starting position still took 8.4 seconds.

Then I installed Intel VTune to profile the program (when linked with MSVC; VTune under Windows doesn't seem to understand the GNU debug symbols so all functions are mangled.)

See the attached screenshot. (edit: can't attach anything...)

Short explanation then:

Some function in the MSVC runtime is taking half the time of this run, or 4.2 seconds. When clicking through the results and opening the source code in the profiler, the next most used functions are the make/unmake functions. They take the most amount of time, followed by square_attacked. (After the MSVC function...) High usage of make_move, unmake_move and square_attacked() is to be expected of course. (I still find it strange that make_move() doesn't make an appearance, but perft() does; when clicking the result for perft(), it highlights the make_move() line as the one used most.)

The only thing I can imagine that is taking so much time is the memory allocation of the array in the perft function:

Code: Select all

let mut move_list = MoveList::new();
The "new()" function is just an (inlined) associated function which returns a struct holding the array and a count variable for the index. The array is 256 elements long (0-255).

Does anyone have any idea how I can speed up this perft function by bringing down the time spent in code that is not mine? If half of the time is going to be spent in MSVC libraries (in particular that one function shown in the VTune screenshot), this is going to become one slow engine.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Perft function too slow because of vcruntime140.dll

Post by jdart »

The general rule is, don't do dynamic memory allocation (on the heap) for anything. Especially not anything that might be time-critical.

--Jon
JohnWoe
Posts: 491
Joined: Sat Mar 02, 2013 11:31 pm

Re: Perft function too slow because of vcruntime140.dll

Post by JohnWoe »

clang has never been fast for Sapeli on my Linux boxes. Something is slowing down "clang -> LLVM -> asm -> elf" at least for Sapeli.

gcc has 7x faster perft compared to clang.

clang + Ofast + strip

Code: Select all

./sapeli -bench
...
= nodes 26975836 mnps 0.899 time 30.001

./sapeli -perft 6
= nodes 124132537 mnps 2.802 time 44.301
clang + O2 + strip

Code: Select all

./sapeli -bench
...
= nodes 24047831 mnps 0.801 time 30.006

./sapeli -perft 6
= nodes 124132537 mnps 2.500 time 49.654
gcc + Ofast + strip

Code: Select all

./sapeli -bench
...
= nodes 121353986 mnps 4.045 time 30.000

./sapeli -perft 6
= nodes 124132537 mnps 19.795 time 6.271
abulmo2
Posts: 433
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: Perft function too slow because of vcruntime140.dll

Post by abulmo2 »

JohnWoe wrote: Tue Mar 24, 2020 9:21 am clang has never been fast for Sapeli on my Linux boxes.
On my ryzen 1700x, running fedora 31, Sapeli 1.76 is faster when compiled with clang.

With gcc 9.2.1 and build-profile from your Makefile:

Code: Select all

$ make build-profile
gcc -Wall -pedantic -Wextra -Wshadow -Ofast -march=native -DNDEBUG -fprofile-generate -lgcov Sapeli.c
...
gcc -Wall -pedantic -Wextra -Wshadow -Ofast -march=native -DNDEBUG Sapeli.c -o sapeli -fprofile-use

$ sapeli -perft 6
= nodes 124132537 mnps 35.225 time 3.524

$ sapeli bench
= nodes 246701578 mnps 8.223 time 30.000
with clang 9.0.1

Code: Select all

clang -O3 -flto -march=native Sapeli.c -o sapeli

$sapeli -perft 6
= nodes 124132537 mnps 37.446 time 3.315

$ sapeli bench
= nodes 280247079 mnps 9.342 time 30.000
Richard Delorme
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Perft function too slow because of vcruntime140.dll

Post by mvanthoor »

Thanks for the feedback guys :)

@JohnWoe: With Rust, I don't have a choice with regard to the compiler. "rustc" is the Rust compiler. It's a front-end to LLVM, just like "clang" is. Basically, rustc/LLVM produces code very similar to clang/LLVM. After compilation, there are two possibilities to link the code: msvc (link.exe) or gnu (ld.exe). The difference is that the first option creates an executable that looks just like as if it was compiled and linked with MSVC, and the second option creates a file that looks as if it was compiled with MinGW/gcc.

Unfortunate enough, Intel VTUNE is unable to read the debug symbols in the compile made by the gnu toolchain, even though the manual says it should be able to. (But, the manual talks about MinGW/GCC; maybe Rust is not 100% compatible in that regard.)

I've also resolved the issue. Instead of allocating a new move list array in the perft function, I tried allocating the array outside of the function and then passing it by reference; which obviously doesn't work because each instance of the recursive perft function needs its own move list. (Don't program too late in the evening...)

So I just created a wrapper around my move list called "MoveListPool", which instantiates 128 move lists before the perft function begins. I then pass this pool to the perft function by reference, so each instance (depth) of the perft function can have its own list.

This works: perft speed picked up to around +/- 30 million moves/sec. (That is, counting a correct leaf node as "a move".)

There are no optimizations: full make/unmake for all leaves to depth 0. No bulk counting because this is a pseudo-legal move generator. No hash. Single thread.

Perft 6 from the starting position dropped to 4.2 seconds. Of those 4.2 seconds, only in the order of 0.0xxx seconds are spent in code that is not that of the engine. The most used function, in this order, are make_move, unmake_move, and square_attacked, which is to be expected.

(PS: I also looked at FabChess' perft function, because that engine is written in Rust as welll... and it uses the exact same trick and almost the same names. I'm calling this MoveListPool, FabChess calls it ReservedMoveList, and we both have MAX_DEPTH_SEARCH for the number of lists.)

Speed progression of the perft function was as follows (where a "move" is one of the leaf nodes):

- Initial working version: finds 10 million moves/sec. "Architecture usage" in the profiler: 50%.
- Drop the Vector, replace with array: finds 14 million moves/sec. "Architecture usage": 60%
- Create a move list pool, passed by reference: finds 28 million moves/sec. "Architecture usage": 85%.
- Compile using "target-cpu=skylake" (which is my CPU): finds 30 million moves/sec. "Architecture usage": ~90%.

In that last option, about 99.5% of all time is spent in the engine's functions instead of spending 50% in MSVC++ runtime library code as it was in the first version.

I think I'm going to create a test suite based on perftsuite.epd (built into the "extra" unit), and then leave it be, as I'm wanting to create a chess engine, not a perft tool. The only thing I'll be adding to perft much later is multi-threading, and future improvements will come from improving the move generator. That'll be improvements such as generating less illegal moves by detecting pinned pieces etc, and thus less make/unmake/square_attacked calls. If perft improves because the move generator improves, that's fine; but I'm not going to make changes to the move generator to cater to specific needs of the perft function. (Making a fully legal move generator so I can use bulk counting in perft is one of those.)

Starting to write the alpha/beta-search now...
Last edited by mvanthoor on Tue Mar 24, 2020 6:23 pm, edited 2 times in total.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
JohnWoe
Posts: 491
Joined: Sat Mar 02, 2013 11:31 pm

Re: Perft function too slow because of vcruntime140.dll

Post by JohnWoe »

abulmo2 wrote: Tue Mar 24, 2020 4:09 pm
JohnWoe wrote: Tue Mar 24, 2020 9:21 am clang has never been fast for Sapeli on my Linux boxes.
On my ryzen 1700x, running fedora 31, Sapeli 1.76 is faster when compiled with clang.

With gcc 9.2.1 and build-profile from your Makefile:

Code: Select all

$ make build-profile
gcc -Wall -pedantic -Wextra -Wshadow -Ofast -march=native -DNDEBUG -fprofile-generate -lgcov Sapeli.c
...
gcc -Wall -pedantic -Wextra -Wshadow -Ofast -march=native -DNDEBUG Sapeli.c -o sapeli -fprofile-use

$ sapeli -perft 6
= nodes 124132537 mnps 35.225 time 3.524

$ sapeli bench
= nodes 246701578 mnps 8.223 time 30.000
with clang 9.0.1

Code: Select all

clang -O3 -flto -march=native Sapeli.c -o sapeli

$sapeli -perft 6
= nodes 124132537 mnps 37.446 time 3.315

$ sapeli bench
= nodes 280247079 mnps 9.342 time 30.000
Thanks!
Good to see clang making faster binaries!
So nothing wrong with clang and LLVM.
I have been lately using clang to hunt down bugs in Sapeli. Very good compiler.
I need to upgrade my HW soon.
Ras
Posts: 2487
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Perft function too slow because of vcruntime140.dll

Post by Ras »

mvanthoor wrote: Tue Mar 24, 2020 6:08 pmSo I just created a wrapper around my move list called "MoveListPool", which instantiates 128 move lists before the perft function begins. I then pass this pool to the perft function by reference, so each instance (depth) of the perft function can have its own list.
Sounds like a hack which works around the Rust language. Also, if you ever go multi-threaded, that will go right out of the window - despite a language that was designed to avoid race conditions. You might as well go for unsafe, but clean code. Isn't it possible to have an array right on the stack with the boxed bracket notation? Although there was some issue that something doesn't work for static arrays of more than 32 elements.
Rasmus Althoff
https://www.ct800.net
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Perft function too slow because of vcruntime140.dll

Post by mvanthoor »

Ras wrote: Tue Mar 24, 2020 6:48 pm
mvanthoor wrote: Tue Mar 24, 2020 6:08 pmSo I just created a wrapper around my move list called "MoveListPool", which instantiates 128 move lists before the perft function begins. I then pass this pool to the perft function by reference, so each instance (depth) of the perft function can have its own list.
Sounds like a hack which works around the Rust language. Also, if you ever go multi-threaded, that will go right out of the window - despite a language that was designed to avoid race conditions. You might as well go for unsafe, but clean code. Isn't it possible to have an array right on the stack with the boxed bracket notation? Although there was some issue that something doesn't work for static arrays of more than 32 elements.
FabChess uses this same technique, and it's multi-threaded.

In Rust, a Box<T> is used to create things on the heap, that'd normally be created on the stack. An array (or any other type consisting of one or more primitive types which have size known at compile time) are created on the stack by default.

Creating a move list OUTSIDE the perft function and then referring it doesn't work because perft is recursive.

Normally it does this (simplified pseudo-code):

Code: Select all

create a move_list
for each move in move_list do
	if make_move is legal
		perft at depth - 1
...
So after a legal move, perft calls itself, to iterate over the moves that can be done AFTER the move it just made. Thus it needs a new move list. That is what "create move_list" is for. Passing a move list from outside doesn't work, because when the perft function returns from a call to itself, it needs the move list of the previous call.

That's why the pool of move lists works; it's basically a home-built stack of move lists.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
Ras
Posts: 2487
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Perft function too slow because of vcruntime140.dll

Post by Ras »

mvanthoor wrote: Tue Mar 24, 2020 7:42 pmIn Rust, a Box<T> is used to create things on the heap, that'd normally be created on the stack.
I meant the other notation, with the square brackets and right in the function, with type and static element count. I mean, a self-rolled, manually indexed stack is hardly a clean solution. That's actually a step backwards even from C.
Rasmus Althoff
https://www.ct800.net
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: Perft function too slow because of vcruntime140.dll

Post by mvanthoor »

Ras wrote: Tue Mar 24, 2020 7:52 pm
mvanthoor wrote: Tue Mar 24, 2020 7:42 pmIn Rust, a Box<T> is used to create things on the heap, that'd normally be created on the stack.
I meant the other notation, with the square brackets and right in the function, with type and static element count. I mean, a self-rolled, manually indexed stack is hardly a clean solution. That's actually a step backwards even from C.
You mean an array?

Code: Select all

let mut list = [Move; 256]
That is an array of Moves, 256 elements long, and if you do this, it'll be created in the function, at the beginning of the instance.

That is what I started out with, and assigning new memory over and over again in each instance of the function is brutally slow.

This is the C equivalent:

Code: Select all

Move list [256]
(Which will also assign 256 moves on the stack.)

The memory assignment is just very slow.

Maybe I shouldn't have called my structure a stack. It's just 128 move lists, all assigned before the perft function starts. I could also have used an array of 128 * 256 moves, pass that to perft, and then use an offset of depth * 128 to address it.

The point is, each perft instance needs its own space to put the generated moves, and you *don't* want to assign that space in the perft function itself. It takes too long.
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL