Hitting a wall at ~1860 Elo

mvanthoor · Post by **mvanthoor** » Wed Jun 02, 2021 2:15 pm

Hi,

As you may have noticed, I've been running tests with Rustic, after adding killer moves, history heuristics, and aspiration windows. I ran tests in two gauntlets: 500 games/engine in a 5s+0.1s gauntlet, and 100 games/engine in a 1m+0.6s.

In the first, the results where promising, in comparison to Alpha 2 (each feature is built on top of the previous):
killer moves: +43 Elo
history heuristic: +7-10 Elo depending on list calibration/anchors (OK; at least it's not negative)
aspiration windows: +20 Elo

Total gain: 70 Elo

My expectation would be that Rustic would score around 1890 Elo in the longer gauntlet, but that was a surprise...
killer moves: +45 Elo
history: 0 Elo
aspiration window: 0 Elo

No gain?? (For history I could believe it, but not the +20 for aspiration windows.)

The engine was searching deeper than Alpha 2, and it had a big impact in the superfast time control, where Alpha 2 caught up with engines that have more features already (all of the above + LMR + Null move for example). Some engines massively tanked in the superfast time control; apart from MinimalChess, Rustic actually became the strongest engine in the gauntlet, outperforming engines that are rated +100 higher in the CCRL list.

Even though the extra speed/search depth is still there in the longer time controls, all the engines are now much closer with regard to the depth they can search, so I expect the impact to be smaller. What I mean is: a difference between an engine that searches 5 ply vs one that searches 7, is massive. The difference between an engine that searches 8 and one that searches 10 ply, is probably less.

Rustic still only has one set of PST's, which I hand-wrote from scratch myself. There is no other evaluation. I see the engine making stupid moves like creating double, triple and quadruple pawns, not using open files, and so on.

Could it be that Rustic is now just finding stupid/weak moves faster, deeper in the tree, so that improving the speed will never gain any more Elo if I don't improve the evaluation?

Lithander already mentioned (in a PM) that I may be unbalancing the engine by improving the search and not the evaluation, and I'm starting to believe it. It doesn't matter how fast I make the engine: MinimalChess scores +100 Elo against Alpha 2, and it still scores +100 against the latest dev version. (In the superfast gauntlet.)

If this is the case, it would be an improvement to do the eval tapering and tuning first, and THEN add the killers, heuristics, aspiration window (and PVS) back in for the version after that.

What do you think?

Joost Buijs · Post by **Joost Buijs** » Wed Jun 02, 2021 5:44 pm

Search and evaluation are closely related to each other. You cannot expect that a search tuned with PST only will be optimal after adding other evaluation features. Beside material and PST you need at least pawn structure evaluation (including passed pawns), king safety and mobility to have the engine play reasonable moves. PST only just doesn't work, this has been proven many times in the past.

In my experience, with a basic search and a simple HCE it is not difficult to get an engine perform at 2500 CCRL level.

amanjpro · Post by **amanjpro** » Wed Jun 02, 2021 5:45 pm

mvanthoor wrote: ↑Wed Jun 02, 2021 2:15 pm Hi,

As you may have noticed, I've been running tests with Rustic, after adding killer moves, history heuristics, and aspiration windows. I ran tests in two gauntlets: 500 games/engine in a 5s+0.1s gauntlet, and 100 games/engine in a 1m+0.6s.

In the first, the results where promising, in comparison to Alpha 2 (each feature is built on top of the previous):
killer moves: +43 Elo
history heuristic: +7-10 Elo depending on list calibration/anchors (OK; at least it's not negative)
aspiration windows: +20 Elo

Total gain: 70 Elo

My expectation would be that Rustic would score around 1890 Elo in the longer gauntlet, but that was a surprise...
killer moves: +45 Elo
history: 0 Elo
aspiration window: 0 Elo

No gain?? (For history I could believe it, but not the +20 for aspiration windows.)

The engine was searching deeper than Alpha 2, and it had a big impact in the superfast time control, where Alpha 2 caught up with engines that have more features already (all of the above + LMR + Null move for example). Some engines massively tanked in the superfast time control; apart from MinimalChess, Rustic actually became the strongest engine in the gauntlet, outperforming engines that are rated +100 higher in the CCRL list.

Even though the extra speed/search depth is still there in the longer time controls, all the engines are now much closer with regard to the depth they can search, so I expect the impact to be smaller. What I mean is: a difference between an engine that searches 5 ply vs one that searches 7, is massive. The difference between an engine that searches 8 and one that searches 10 ply, is probably less.

Rustic still only has one set of PST's, which I hand-wrote from scratch myself. There is no other evaluation. I see the engine making stupid moves like creating double, triple and quadruple pawns, not using open files, and so on.

Could it be that Rustic is now just finding stupid/weak moves faster, deeper in the tree, so that improving the speed will never gain any more Elo if I don't improve the evaluation?

Lithander already mentioned (in a PM) that I may be unbalancing the engine by improving the search and not the evaluation, and I'm starting to believe it. It doesn't matter how fast I make the engine: MinimalChess scores +100 Elo against Alpha 2, and it still scores +100 against the latest dev version. (In the superfast gauntlet.)

If this is the case, it would be an improvement to do the eval tapering and tuning first, and THEN add the killers, heuristics, aspiration window (and PVS) back in for the version after that.

What do you think?

I don't believe in the search/evaluation imbalance. RofChade is a 3000+ engine and it has nothing but Pesto, so is Pesto engine

mvanthoor · Post by **mvanthoor** » Wed Jun 02, 2021 6:02 pm

amanjpro wrote: ↑Wed Jun 02, 2021 5:45 pm I don't believe in the search/evaluation imbalance. RofChade is a 3000+ engine and it has nothing but Pesto, so is Pesto engine

PeSTO has a tapered and heavily tuned set of PST's, one for middlegame and one for the endgame. Therefore it has a _massive_ amount of positional knowledge encoded in its PST's. Rustic only has one set of handcrafted PST's, where there is no difference between the middlegame and the endgame.

MinimalChess had a boost of 280 Elo after tapering and tuning its evaluation (and then some because of staged move generation), and I can see it in its playing style. It has _much_ better positional play than Rustic.

Otherwise there is no reason how I could explain that Rustic suddenly hits a wall against engines like MinimalChess: Rustic clearly becomes faster (I can observe it searching deeper than Alpha 2), but its results don't improve.

My results did improve against both Zahak 0.2.1 and 0.3.0. They were much stronger than Rustic Alpha 2 (especially 0.3.0), but with the killers, history and aspiration windows in place, both Zahak versions get totally clobbered. 0.3.0 dropped from over +100 Elo against Alpha 2 to -30 against the current test version. MinimalChess, on the other hand, just keeps chugging along at around +100 Elo against any Rustic version after Alpha 2.

niel5946 · Post by **niel5946** » Wed Jun 02, 2021 6:06 pm

mvanthoor wrote: ↑Wed Jun 02, 2021 2:15 pm Hi,

As you may have noticed, I've been running tests with Rustic, after adding killer moves, history heuristics, and aspiration windows. I ran tests in two gauntlets: 500 games/engine in a 5s+0.1s gauntlet, and 100 games/engine in a 1m+0.6s.

In the first, the results where promising, in comparison to Alpha 2 (each feature is built on top of the previous):
killer moves: +43 Elo
history heuristic: +7-10 Elo depending on list calibration/anchors (OK; at least it's not negative)
aspiration windows: +20 Elo

Total gain: 70 Elo

My expectation would be that Rustic would score around 1890 Elo in the longer gauntlet, but that was a surprise...
killer moves: +45 Elo
history: 0 Elo
aspiration window: 0 Elo

No gain?? (For history I could believe it, but not the +20 for aspiration windows.)

The engine was searching deeper than Alpha 2, and it had a big impact in the superfast time control, where Alpha 2 caught up with engines that have more features already (all of the above + LMR + Null move for example). Some engines massively tanked in the superfast time control; apart from MinimalChess, Rustic actually became the strongest engine in the gauntlet, outperforming engines that are rated +100 higher in the CCRL list.

Even though the extra speed/search depth is still there in the longer time controls, all the engines are now much closer with regard to the depth they can search, so I expect the impact to be smaller. What I mean is: a difference between an engine that searches 5 ply vs one that searches 7, is massive. The difference between an engine that searches 8 and one that searches 10 ply, is probably less.

Rustic still only has one set of PST's, which I hand-wrote from scratch myself. There is no other evaluation. I see the engine making stupid moves like creating double, triple and quadruple pawns, not using open files, and so on.

Could it be that Rustic is now just finding stupid/weak moves faster, deeper in the tree, so that improving the speed will never gain any more Elo if I don't improve the evaluation?

Lithander already mentioned (in a PM) that I may be unbalancing the engine by improving the search and not the evaluation, and I'm starting to believe it. It doesn't matter how fast I make the engine: MinimalChess scores +100 Elo against Alpha 2, and it still scores +100 against the latest dev version. (In the superfast gauntlet.)

If this is the case, it would be an improvement to do the eval tapering and tuning first, and THEN add the killers, heuristics, aspiration window (and PVS) back in for the version after that.

What do you think?

I just looked through you move ordering code, and noticed the following things:

I don't know if this makes a huge difference - or any at all for that matter - but I see that you're indexing your history table as side to move, piece moved and destination square for that piece. I do it with side to move, origin square and destination square. If everything else fails, try to see if this makes a difference.
You don't bound your history heuristic - or at least not in utils.rs update_history_heuristic. This makes it very likely that it will overflow in deep searches, making history more important than killers. This should be avoided, so try to set the bound to lowest_killer_score - 1, and if an updated history entry exceeds this score, halve all entries in the table. This will still keep the differences between different non-killer quiet moves, but it will never surpass killer moves's importance.
You are not bounding your history increment either. This will result in extremely large scores at high depths, which isn't good. Higher scores for the higher depths are good, but it should be limited. I set the maximum history increment to be 400.
You only give bonuses in the history table. This can be done, but it is not very good. What I do is that if a quiet move beats beta, I update the history table with the entire move list. I increase the history score for the move that failed high, and I decrease all other quiet moves (by the same increment). See my history update.

Please correct me if i missed something.

Regarding the evaluation/history question: Since the update of the history table is largely dependent on the returned score from the search, which as you know is an evaluation score at the end, the former is very sensitive to the evaluation. If your evaluation can't differentiate between good and bad positions, the history table will be updated according to that. This means that bad moves, or sub-optimal moves, will risk being sorted highly, decreasing the move ordering.

What I would do is to invest some time in making the evaluation more detailed. History heuristic will work way better if you implement something like mobility and king safety. All evaluation improvements are going to help the history table do its job, but I think these two are best since quiet moves that beat beta are usually ones that make very big or even unstoppable threats to the opponent. The biggest kinds of threats are checkmate threats and (IMHO) mobility-threats. By the latter, I mean moves that will put the opponent into defensive, passive positions.

P.S. I just noticed you mentioned tapered eval and untuned psqt. Before doing anything with mobility/king safety, I would highly advice you to get a working tapered eval, and fine tune all piece square tables (and material values for that matter).

mvanthoor · Post by **mvanthoor** » Wed Jun 02, 2021 6:15 pm

niel5946 wrote: ↑Wed Jun 02, 2021 6:06 pm I just looked through you move ordering code, and noticed the following things:

I don't know if this makes a huge difference - or any at all for that matter - but I see that you're indexing your history table as side to move, piece moved and destination square for that piece. I do it with side to move, origin square and destination square. If everything else fails, try to see if this makes a difference.

You don't bound your history heuristic - or at least not in utils.rs update_history_heuristic. This makes it very likely that it will overflow in deep searches, making history more important than killers. This should be avoided, so try to set the bound to lowest_killer_score - 1, and if an updated history entry exceeds this score, halve all entries in the table. This will still keep the differences between different non-killer quiet moves, but it will never surpass killer moves's importance.

You are not bounding your history increment either. This will result in extremely large scores at high depths, which isn't good. Higher scores for the higher depths are good, but it should be limited. I set the maximum history increment to be 400.

You only give bonuses in the history table. This can be done, but it is not very good. What I do is that if a quiet move beats beta, I update the history table with the entire move list. I increase the history score for the move that failed high, and I decrease all other quiet moves (by the same increment). See my history update.

Please correct me if i missed something.

Regarding the evaluation/history question: Since the update of the history table is largely dependent on the returned score from the search, which as you know is an evaluation score at the end, the former is very sensitive to the evaluation. If your evaluation can't differentiate between good and bad positions, the history table will be updated according to that. This means that bad moves, or sub-optimal moves, will risk being sorted highly, decreasing the move ordering.

What I would do is to invest some time in making the evaluation more detailed. History heuristic will work way better if you implement something like mobility and king safety. All evaluation improvements are going to help the history table do its job, but I think these two are best since quiet moves that beat beta are usually ones that make very big or even unstoppable threats to the opponent. The biggest kinds of threats are checkmate threats and (IMHO) mobility-threats. By the latter, I mean moves that will put the opponent into defensive, passive positions.

P.S. I just noticed you mentioned tapered eval and untuned psqt. Before doing anything with mobility/king safety, I would highly advice you to get a working tapered eval, and fine tune all piece square tables (and material values for that matter).

Hi, and thanks for the detailed answer. About the history:

- The way I'm doing it now, seemed the strongest with the current implementation. Also, it feels the most logical. side to move, piece, and to-square will make for a history of "at some time, there was a white knight here". side to move, from-square and to-square makes for a history of "at some point, some white piece moved from here to there". This second option just doesn't feel logical, as a queen can do the same moves as any other piece except the knight, so if a rook move increases history, it can cause the queen to move later. "My" version just feels more logical. (But it may not be better.)

- Bounds: indeed, I have none. I actually had them, and then removed them. I'll add both a bound below the killers, and a bound on the increment. 400 would cap out at a depth of 20.

- Also, history is rebuilt with every search, just like the killers. The reason is that I have a search that is prepared to be multi-threaded in the future. If I want to keep the history and killers the entire game, it has to move "up" from SearchInfo (created at each search) to "Search" (the Search object), and thus they become shared between threads and need to be mutex'd. Tried that: failed due to slowness as expected. I'm not (yet) prepared to introduce unsafe "racey" access in the engine as long as it is not multi-threading. I'll try to do an "unsafe share" of the history/killers later, after the engine becomes successfully multi-threaded.

- Yes, I only give a bonus in the history table. Didn't encounter any information about penalizing all the other moves, but I'll look into this. Thanks. There also seem to be conflicting information about where to change history. Many engines only do this when alpha is raised. Some do it only during a bèta cutoff. Some do both. I tried all three, and changing the history both in a beta cutoff and when alpha is raised seemed to be best.

I have to retest all of this when I change how the history is handled.

However, it could be that history messes up the move ordering.... even though I tried to take this into account by moving MVV_LVA and Killers up to _almost_ the max of int32. I actually tested if the history ever exceeds int32, and it doesn't. I'll have to look into this point. I think I'm going to remove history for now, and redo that. I'll test killers, aspiration windows, and pvs, and then try to re-add history.

Also: the first thing after these steps will be adding a tapered and tuned evaluation, before I do anything else.

Thanks for the insights.

Steve Maughan · Post by **Steve Maughan** » Wed Jun 02, 2021 6:48 pm

From my experience with Monarch and Maverick, you should be able to get to about 2000 ELO with quiescent search, PST, killers, basic move ordering, and null move. The next big jumps are by adding passed-pawn detection, mobility and then Late Move Reduction. Each of these is worth 50 ELO to 100 ELO. I suspect the roadblock may be due to bugs. This was the big learning from Fruit 1.0 back in 2004 — "shockingly" bug free code plays stronger chess. As a result I try to be as defensive as possible and use "asserts" as much as possible.

Hope this helps,

Steve

mvanthoor · Post by **mvanthoor** » Wed Jun 02, 2021 7:05 pm

Steve Maughan wrote: ↑Wed Jun 02, 2021 6:48 pm From my experience with Monarch and Maverick, you should be able to get to about 2000 ELO with quiescent search, PST, killers, basic move ordering, and null move. The next big jumps are by adding passed-pawn detection, mobility and then Late Move Reduction. Each of these is worth 50 ELO to 100 ELO. I suspect the roadblock may be due to bugs. This was the big learning from Fruit 1.0 back in 2004 — "shockingly" bug free code plays stronger chess. As a result I try to be as defensive as possible and use "asserts" as much as possible.

Hope this helps,

Steve

Hi Steve

With "PST's", do you mean just one set of PST's, or a tapered and tuned evaluation? I have MVV_LVA ordering (and killers, if I enable them: +45 Elo), but no NULL-move. Maybe it would be useful to implement this before the tuning instead of after.

With regard to bugs: Up until now, I've gotten exactly from each feature (sometimes with some help from the community) what I expected. The only feature Alpha 2 has, on top of PST's and MVV_LVA, is a transposition table. That brings around +150 (+/- 10) Elo, which is similar to what other people are getting. (The author of Dumb, for example, ran a test to assess the TT impact.) And I do the same as you: I use lots of asserts during development of a function. (And Rustic has a test built-in when compiled in debug mode, which checks consistency of all incrementally kept values, against values created from scratch. This test succeeds.)

Therefore I assume that there are no bugs in Alpha 2 that could influence game play. Any bugs in the TT or MVV_LVA would probably massively influence the playing strength, and bugs in the move generator would make the engine crash or play illegal moves.

I think Niels is on the right track: that the lack of bounding in my history is probably messing up move ordering, which negates a lot of the other speed gains. I'm now retesting all Rustic versions to get a consistent 10s+0.1 gauntlet, and then I'll remove/disable everything but killers, and test those seperately. Then I'll re-enable aspiration windows, and add PVS, and test both; before I re-implement the history heuristic.

Joost Buijs · Post by **Joost Buijs** » Wed Jun 02, 2021 7:10 pm

mvanthoor wrote: ↑Wed Jun 02, 2021 6:02 pm PeSTO has a tapered and heavily tuned set of PST's, one for middlegame and one for the endgame. Therefore it has a _massive_ amount of positional knowledge encoded in its PST's. Rustic only has one set of handcrafted PST's, where there is no difference between the middlegame and the endgame.

Really? My HCE engine has a tapered and heavily tuned set of PST's too, when I remove al the other features besides material and PST it plays like an idiot. The only difference is that I use symmetric PST's but this doesn't make much difference, with asymmetric PST's you also have to take castling into account, e.g. 4 different sets of PST's.

mvanthoor · Post by **mvanthoor** » Wed Jun 02, 2021 7:12 pm

Joost Buijs wrote: ↑Wed Jun 02, 2021 7:10 pm Really? My HCE engine has a tapered and heavily tuned set of PST's too, when I remove al the other features besides material and PST it plays like an idiot. The only difference is that I use symmetric PST's but this doesn't make much difference, with asymmetric PST's you also have to take castling into account, e.g. 4 different sets of PST's.

As far as I've always read about PeSTO: yes... it's said it has only search optimizations, and a tapered and tuned evaluation, with no eval terms. Maybe Ronald can confirm (or deny) this.

Hitting a wall at ~1860 Elo

Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo

Re: Hitting a wall at ~1860 Elo