Perft speed optimization (renamed)

mvanthoor · Post by **mvanthoor** » Fri Oct 09, 2020 12:12 pm

LOL. This is becoming completely impossible to understand.

I have my engine, on Rust 1.46, running at speed X.
Then I removed incremental material update, and now the engine is obviously faster... so this is speed Y.

I compile engine Y with Rust 1.47 => the performance takes a hit. The engine becomes slower by about 4%.

Because removing the incremental material update was stupid (don't ask), I reverted that change.

I compile the engine with Rust 1.47, and expect it to be about 4% slower than speed X. Nope... now the engine on Rust 1.47 is about 2% FASTER than it was before on Rust 1.46.

Strange. How can the change from Rust 1.46 -> 1.47 be both slower AND faster with the same code base? It must be that it has something to do with the data layout in memory that was mentioned earlier. It seems that speed changes of up to 5% or thereabouts are just normal as you change either the program or the compiler. (Olivier noted the same thing in another topic: his engine got slower when he removed an unused function; it got faster after he added TWO dummy functions.)

Joost Buijs · Post by **Joost Buijs** » Fri Oct 09, 2020 12:29 pm

mvanthoor wrote: ↑Fri Oct 09, 2020 11:01 am I'm still miffed about the fact that a performance drop always seems to be in the cards when using Rust, because of changing LLVM versions. (AFAIK, they're now focusing on compilation speed and stabilizing features.)

The performance difference you get between different compilers has IMHO nothing to do with the code generation/optimization of the compilers as such. I see this things here all the time, especially with Perft, the things you see you see are for a large part cache-effects and alignment-effects.

Of course, when you optimize everything to the fullest with compiler A and compiler B behaves a bit differently, it could be that you have to start all over again to get the same performance back, this is why it is a complete waste of time to try squeezing a few percent extra performance out of Perft. Perft is a small function and runs largely from the cache, in a full fledged engine things will behave differently anyway.

There is also the difference between processors, when you optimize everything for an Intel processor it is usually not the best for an AMD and vice versa. Optimization is a never ending story.

mvanthoor · Post by **mvanthoor** » Fri Oct 09, 2020 12:44 pm

Joost Buijs wrote: ↑Fri Oct 09, 2020 12:29 pm
mvanthoor wrote: ↑Fri Oct 09, 2020 11:01 am I'm still miffed about the fact that a performance drop always seems to be in the cards when using Rust, because of changing LLVM versions. (AFAIK, they're now focusing on compilation speed and stabilizing features.)
The performance difference you get between different compilers has IMHO nothing to do with the code generation/optimization of the compilers as such. I see this things here all the time, especially with Perft, the things you see you see are for a large part cache-effects and alignment-effects.

Of course, when you optimize everything to the fullest with compiler A and compiler B behaves a bit differently, it could be that you have to start all over again to get the same performance back, this is why it is a complete waste of time to try squeezing a few percent extra performance out of Perft. Perft is a small function and runs largely from the cache, in a full fledged engine things will behave differently anyway.

There is also the difference between processors, when you optimize everything for an Intel processor it is usually not the best for an AMD and vice versa. Optimization is a never ending story.

Yes, I know; I've learned this lesson in the meantime. Therefore I'm not trying to micro-optimize anymore; the only optimizations I'm (still) doing is trying to cut out unnecessary code, try to avoid unneeded memory allocations, or use better algorithms. I'm not shifting and moving variables anymore.

Still, it irritates me if a newer compiler makes a program slower than it was with the old version

BlueStar · Post by **BlueStar** » Fri Oct 09, 2020 11:18 pm

Ok, now I understand partly why the PERFT in NADYA2.0 is so much slower than others'. My move generator is a 100% legal move generator. When I started my engine I decided not to examine any of the open source engine source code (although I spend time reading and researching for each new feature, with is why I have a TT, heuristics, killers, etc... Now I'm reading about psuedo moves, vs legal moves. My engine is close to running under XBoard, so I am pressing forward with my November deadline. NADYA3.0 is going to be so much faster...

mvanthoor · Post by **mvanthoor** » Sat Oct 10, 2020 1:58 am

BlueStar wrote: ↑Fri Oct 09, 2020 11:18 pm Ok, now I understand partly why the PERFT in NADYA2.0 is so much slower than others'. My move generator is a 100% legal move generator. When I started my engine I decided not to examine any of the open source engine source code (although I spend time reading and researching for each new feature, with is why I have a TT, heuristics, killers, etc... Now I'm reading about psuedo moves, vs legal moves. My engine is close to running under XBoard, so I am pressing forward with my November deadline. NADYA3.0 is going to be so much faster...

Hi,

Have you never released Nadya 2.0? I'm now in the process of finishing the search function for Rustic 1. (And apart from the search, basic move ordering and PSQT's, it won't have any other functionality... so it won't be strong yet. But I'm curious to see what the rating of such a basic engine is. Then I'll start adding functionality... with the TT as one of the first ones.)

I think you got this wrong. If you don't do any tricks such as bulk-counting, it doesn't matter if your engine uses a legal or pseudo-legal move generator. The reason is that in perft, you have to make ALL moves, and you have to check them ALL for legality. It doesn't matter if you do this in the move generator, or in make_move().

When playing chess however, the pseudo-legal move generator is faster. The reason is that you just generate all possibly legal moves, and check them when they are made, during the search. If your move ordering is good, many moves will NEVER be made due to cutoffs in the tree. Thus, they don't need to be checked for legality. If you check all moves for legality during move generation, you are also checking moves that possibly will never be considered by the search, and thus you waste time.

The next step (in the chess engine, not for perft) would be staged move generation: generate NO moves in your search, until you actually need them. First try all the moves such as PV, hash, killers, history heuristic moves to see if they generate a cutoff. If they all don't, only then start generating moves. First, only the captures. Then the quiet moves.

PS: If you have a legal move generator, you can make perft much faster by not going all the way to the leaves; you can stop at depth = 1, generate all the moves, count the length of the move list, and add that to your outcome. You don't have to actually make the move (which would put you at the actual leaf node, or depth = 0) to check for legality; you KNOW it's legal because you have a legal move generator. This is a perft trick though; for the chess engine itself, it's actually worse as described above.

===

I've just added incremental PSQT evaluation to Rustic, so the influence of the PSQT tables doesn't have to be calculated from scratch during the evaluation. As only one piece moves, you need to do at most three things: substract the PSQT value for a piece/square for captured pieces, substract the value for a piece/square for the square a piece leaves, and add it for the square where the piece ends up. That saves lots of time in the evaluation function later on, because you don't need to check up to 32 values.

However, keeping this value incrementally necessitates adding the calculation to the functions that actually move pieces, and thus perft will become slower. Perft performance dropped by 10-12%. That doesn't matter though, because during play, the evaluation function is the one that takes the moest time, and incrementally keeping evaluation terms like this gains a huge speedup there. (As said: make 3 additions instead of up to 32; so it cuts down PSQT evaluation time by about 90%.)

===

Maybe at some point I'm going to split off RustPerft from Rustic, where I remove everything that is not needed for perft, and then add all the tricks such as bulk counting, hashing, and multi-threading to RustPerft to see how fast I can make it. If Rustic itself gains a speedup of the move generator (such as check evasion etc), i can then always backport it.

BlueStar · Post by **BlueStar** » Sat Oct 10, 2020 3:00 am

mvanthoor wrote: ↑Sat Oct 10, 2020 1:58 am Have you never released Nadya 2.0? I'm now in the process of finishing the search function for Rustic 1. (And apart from the search, basic move ordering and PSQT's, it won't have any other functionality... so it won't be strong yet. But I'm curious to see what the rating of such a basic engine is. Then I'll start adding functionality... with the TT as one of the first ones.)

Not yet. I've been targeting November since I started. I have to finish my Xboard interface, opening book (my code), and end game database, so it will be close. However, NADYA2.0 clobbers me every time I play it without those features. Nadya also plays completely standalone with a pretty console mode.

It kind of cool, you and I have approached the same problem from two opposite directions. You focused on speed and optimization, when I focused on improving game play. My next direction is to optimize, and yours is to improve game play. It would be funny if after I optimize, and you implement the TT, etc, we ended up in the same place.

mvanthoor wrote: ↑Sat Oct 10, 2020 1:58 am I think you got this wrong. If you don't do any tricks such as bulk-counting, it doesn't matter if your engine uses a legal or pseudo-legal move generator. The reason is that in perft, you have to make ALL moves, and you have to check them ALL for legality. It doesn't matter if you do this in the move generator, or in make_move().

I had bulk-counting, and ripped it back out as my search program was kind of birth out of my perft, and I felt bulk-counting was kind of a gimic. I guess when I'm talking speed and NADYA, I'm more referring to the move generator. I really don't care about "perft" speed per say as long as it is "perfect", as I do the move generator, which does not use things like pseudo moves.

mvanthoor wrote: ↑Sat Oct 10, 2020 1:58 am The next step (in the chess engine, not for perft) would be staged move generation: generate NO moves in your search, until you actually need them. First try all the moves such as PV, hash, killers, history heuristic moves to see if they generate a cutoff. If they all don't, only then start generating moves. First, only the captures. Then the quiet moves.

My engine already does this, and has a customized MVVLVA move sorter.

mvanthoor wrote: ↑Sat Oct 10, 2020 1:58 am PS: If you have a legal move generator, you can make perft much faster by not going all the way to the leaves; you can stop at depth = 1, generate all the moves, count the length of the move list, and add that to your outcome. You don't have to actually make the move (which would put you at the actual leaf node, or depth = 0) to check for legality; you KNOW it's legal because you have a legal move generator. This is a perft trick though; for the chess engine itself, it's actually worse as described above.

My perft makes the leaf call to increment a counter (rather than bulk count the size of the move list) but does not in this case evaluate the board at the leaf, precisely because my move generator only produces legal moves. I've keep this to keep my perft function as close to my search code. For at least my first release, I've wanted to try and keep my perft/search functions as close in structure as possible, so I know that if my perft is perfect, my search will be as well. It will be NADYA3.0 that starts to make bigger changes.

Sounds like we are still so close! I know you are planning on supporting UCI, I'll have XBoard done soon, when are you planning your first release?

Cheers,
CHoibakk

mvanthoor · Post by **mvanthoor** » Sat Oct 10, 2020 3:44 am

BlueStar wrote: ↑Sat Oct 10, 2020 3:00 am Not yet. I've been targeting November since I started. I have to finish my Xboard interface, opening book (my code), and end game database, so it will be close. However, NADYA2.0 clobbers me every time I play it without those features. Nadya also plays completely standalone with a pretty console mode.

It kind of cool, you and I have approached the same problem from two opposite directions. You focused on speed and optimization, when I focused on improving game play. My next direction is to optimize, and yours is to improve game play. It would be funny if after I optimize, and you implement the TT, etc, we ended up in the same place.

There's a good chance we will indeed end up in roughly the same place, assuming the Prolog compiler creates code as fast as the Rust compiler.

I worked as a software engineer in industrial engineering positions for quite some time. If you've been writing software for machinery that have "can package 600 jars per minute", or "seals 5000 pots per hour" as a selling feature, you understand that I can't live with writing slow software or use slow programming languages. My chess engine may not (yet) be the fastest around, but I try.

After implementing the PSQT's, I calculated them each time during eval. I could have been satisfied with that, but I then made them incremental. That gained a lot of speed during eval, but dropped make and unmake performance by about 12%. (Obviously, keeping incremental values during the making and unmaking of moves makes those functions slower.) I could have again left it at that, but by moving the incremental updates into the game_state struct (so I can just copy the values back during unmake instead of calculating them backward), the speed loss in make/unmake (and thus Perft) has been mitigated to 6%.

Even though this iterative way of development slows down progression speed by a lot, I'm of the opinion that this is the only way to keep a chess engine "in check". There's too much stuff going on to just code haphazardly, and one mistake somewhere can make the whole thing blow up in your face. Wrong moves, crashes, or the speed losses here and there slow the engine down to a crawl.

I'm -convinced- that the work I put in now, will earn itself back twice, triple, or more, with the time I do NOT have to debug some sort of weird stuff down the line, and NOT having to go back later, to optimize features I put in two years ago. I try to write code in such a way that the only reason to touch it is to add new features.

Sounds like we are still so close! I know you are planning on supporting UCI, I'll have XBoard done soon, when are you planning your first release?

Actually, I intend to support both UCI and XBoard (but UCI first), and a console mode, by implementing three different communication modules. The console is already able to play a game, if I had a search.

I don't really have a release date planned, but it is in the near future. Somewhere next month should be possible, if nothing strange happens. The things left to do are:

- Alpha/Beta
- Quiescence search

(For the above two, the skeleton functions already exist, and are hooked into the thread/communication system.)

- MVV-LVA move sorting
- (Simple) time management
- Enough UCI to connect to a GUI and play games (shouldn't be too much of a problem)
- Some testing on my own computer.

And maybe look into writing a script to produce 37 different executables for different CPU's (and do a profile-guided compile).

BlueStar · Post by **BlueStar** » Sat Oct 10, 2020 5:13 am

mvanthoor wrote: ↑Sat Oct 10, 2020 3:44 am Even though this iterative way of development slows down progression speed by a lot, I'm of the opinion that this is the only way to keep a chess engine "in check". There's too much stuff going on to just code haphazardly, and one mistake somewhere can make the whole thing blow up in your face. Wrong moves, crashes, or the speed losses here and there slow the engine down to a crawl.

I'm -convinced- that the work I put in now, will earn itself back twice, triple, or more, with the time I do NOT have to debug some sort of weird stuff down the line, and NOT having to go back later, to optimize features I put in two years ago. I try to write code in such a way that the only reason to touch it is to add new features.

I completely agree with this, just from a different perspective since I'm working with a logic language rather than procedural language.

In RL with health issues, even though I'm slowing down, I work with NN's, and interpreting image data real time, so I understand performance issues (I don't use Prolog for that

). With NADYA chess, my focus is on elegance, since I don't have a lot of time. That is why I named it after my daughter

Cheers,
CHoibakk

I'm so itching to play Rustic now lol. I don't care about the outcome even, I just think it is wonderful, we have been on almost the same pace for so long.

hgm · Post by **hgm** » Sat Oct 10, 2020 9:43 am

mvanthoor wrote: ↑Sat Oct 10, 2020 1:58 amI think you got this wrong. If you don't do any tricks such as bulk-counting, it doesn't matter if your engine uses a legal or pseudo-legal move generator. The reason is that in perft, you have to make ALL moves, and you have to check them ALL for legality. It doesn't matter if you do this in the move generator, or in make_move().

That isn't really true. Your move generator can still do 'bulk legality checking' on the moves of non-royal pieces. E.g. when generating Queen moves, you could first test whether removing the Queen puts you in check. If it doesn't, none of the actual Queen moves will put you in check, so you won't have to test those. If it does, you know the Queen is pinned, and the check test you already did probably tells you along which ray it is pinned. You can then resort to a special generator code section that only generates the moves along that ray, and you would not have to test any of these for legality either. You don't only save the effort of the tests, but also the effort of generating the moves that would test negative.

This is roughly how qperft does it: before move generation it determines which pieces are pinned (by running through the slider section of the opponent's piece list, using alignment with the King as a first fast filter to rule out most sliders from being pinners). It then applies the on-ray generation to these pieces, and excludes them from the normal move generation.

Only the King moves have to be tested for stumbling into check. Even here you could do some form of bulk testing; with most board representations a test whether a given piece attacks the square the King is on would with virtually no extra work reveal all squares that are attacked by that piece in the King neighborhood. So instead of testing for attacks on the King after each of its moves you could test for attacks on the King neighborhood in absence of the King, and then only generate the King moves to the safe squares. Qperft doesn't use this. In positions where the King is boxed in by frienly pieces it wouldn't save very much anyway.

This all assumes you are not in check to begin with; for in-check positions other considerations apply, and many engines have a separate move generator for those. A cheap filter there is to test first whether the moves resolve the existing check (which most will not), and don't bother with those that don't.

hgm · Post by **hgm** » Sat Oct 10, 2020 10:08 am

BlueStar wrote: ↑Sat Oct 10, 2020 3:00 amMy perft makes the leaf call to increment a counter (rather than bulk count the size of the move list) but does not in this case evaluate the board at the leaf, precisely because my move generator only produces legal moves. I've keep this to keep my perft function as close to my search code. For at least my first release, I've wanted to try and keep my perft/search functions as close in structure as possible, so I know that if my perft is perfect, my search will be as well. It will be NADYA3.0 that starts to make bigger changes.

The disadvantage of this is that the perft speed no longer is related to the speed of the corresponding engine: in search the leave nodes typically do move generation, but make no moves. So for every MakeMove you do one MoveGen. Your perft, OTOH, will do some 30 MakeMoves for every MoveGen, so that the speed gets heavily dominated by that of MakeMove, which is usually completely irrelevant for the engine.

Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)

Re: Perft speed optimization (renamed)