Limiting nodes to reduce engine strength

Patrice Duhamel · Post by **Patrice Duhamel** » Sat Oct 26, 2019 5:22 pm

I'm trying to use the method explained by Ferdinand Mosca here :
viewtopic.php?f=2&t=71199

Limiting nodes and running 1m+1s games based on Cheese 2.1 = 2811 ELO CCRL :

Code: Select all

engine			nodes	elo	elo from formula
Cheese-22-64-test1	500	472	668
Cheese-22-64-test2	1000	693	836
Cheese-22-64-test3	2000	958	1004
Cheese-22-64-test4	4000	1193	1173
Cheese-22-64-test5	8000	1426	1341
Cheese-22-64-test6	16000	1611	1509
Cheese-22-64-test7	32000	1798	1677
Cheese-22-64-test8	64000	1970	1845
Cheese-22-64-test9	128000	2132	2013
Cheese-22-64-test10	256000	2288	2181
Cheese-22-64-test11	512000	2419	2350
Cheese-22-64-test12	1024000	2536	2518
Cheese-22-64-test13	2048000	2639	2686
Cheese-22-64-test14	4096000	2746	2854
Cheese-22-64-test15	8192000	2803	3022

Using logarithmic regression :
maximum nodes = e ^ ((elo + 838.8755655) / 242.5737137)

But when I run games with other engines I have different results :

Code: Select all

Rank Name                 Elo    +    - games score oppo. draws 
   1 Cheese-22-64-test5  1618   16   16   400   66%  1501   18% 
   2 Piranha 0.5         1501   16   16   400   35%  1618   18% 

Rank Name                 Elo    +    - games score oppo. draws 
   1 CDrill 1800         1786   16   16   400   55%  1747   15% 
   2 Cheese-22-64-test6  1747   16   16   400   45%  1786   15%

What engine to use as base to calculate ELO ratings ? or I have to find the rating of each test version one by one ?

The curve from the equation is not perfect compared to the game results, there is a better equation, or should I do a linear interpolation between result at different ELO ratings (ex: 1500->1600, then 1600->1700 ...) ?

bob · Post by **bob** » Sun Oct 27, 2019 6:01 am

This is very difficult. Problem starts with different hardware platforms that will make the same program perform at different Elo levels. Just reducing NPS produces a very odd chess opponent. IE a program might have a GM-level evaluation that really understands pawn structure and such, but it will be tactically less skillful. That doesn't feel right as an opponent. Doing the opposite also doesn't work well at all. IE leave the search alone but phase out chess knowledge. Now you get a program that is positionally weak, but tactically a GM. We (Crafty group) spent a lot of time playing with this. I would LOVE to have a command something like "elo 1500" and suddenly Crafty would play like the average 1500 player. Easy to say. Very difficult to do and avoid the pitfalls above.

It is certainly a worthy topic for investigation, but the solution will likely be much more than a few lines of code. And there are oddities. IE if you take a good search + a purely random evaluation, Don Beal wrote a paper on this. I have label the result "the Beal effect." Bottom line is that on good hardware, a purely random evaluation will still play at a 2000 level or so. Why? This turns into a cheap mobility evaluation. At any node in the tree where you have lots of legal moves, you end up with lots of opportunity to get a larger random score returned. If you have few moves, you have a lower probability of getting a large number. I did exactly this several years ago and got LOTS of complaints from Crafty users saying "Bob, even with a purely random evaluation, I can't beat this thing ever. And as I phase out the random eval, it only gets stronger.

Current approach does both proportionally. I did three things. The command is "skill N" where N is a number between 0 and 100.

1. The skill number is taken as a percentage, so that the evaluation looks like this:

val = Evaluate(...)
return (skill * val / 100 + (100 - skill) * random() / 100);

IE skill 0 is purely random, skill 100 is purely total eval.

2. The skill number is used to lower the NPS. But rather than moving quicker (which would be the result of reducing the nodes searched) I introduced a "busy loop" so that for each node searched, the computational cost would ramp up as skill is lowered, so that as skill reduces, search depth reduces.

3. Finally, skill was applied to the various search extensions and reductions so that the clever search stuff phases out as skill lowers. Overall effect was actually pretty good (still in current Crafty versions, but has to be compiled in with -DSKILL). But the problem was, skill 50 means something different on each different platform since faster hardware will ramp up NPS.

4. And then there is parallel search. I did not disable parallel search, but if you think about it, before playing with the skill command, you would obviously NOT use multiple threads which only ramps search speed up.

I would not begin to claim what I did was optimal. But at least the complaints stopped. Just be aware that if you find a skill setting that gives you a good game, but with chances to win, the next time you change hardware, you will have to adjust skill again. If you do as I do and run on different machines (IE a bigger machine for testing, but my MacBook more frequently since I carry it with me. The same skill can't be used on both machines...

Ras · Post by **Ras** » Sun Oct 27, 2019 10:04 am

bob wrote: ↑Sun Oct 27, 2019 6:01 amThe same skill can't be used on both machines...

As for the search speed dependent part of the throttling: If you implement the waiting loop not to reduce NPS by a percentage, i.e. not relative to the machine's full speed, but instead configure NPS directly in throttled mode, then the same setting works regardless of the machine. Of course, a slow machine won't have the higher end of the node throttling available.

In my throttling implementation, I calibrate the machine speed at engine startup and determine the upper end of the available Elo throttling range that the UCI_Elo command offers.

The NPS throttling works by dividing every second into slices of 10ms each. Since I check the clock every so many nodes that I check about once per millisecond, I know how many nodes have been spent in the current second. If the configured NPS budget has been reached in the current second, I can sleep for the rest of the second. That's actual thread sleep, not busy waiting, because that nicely reduces energy consumption especially on mobiles.

There's one catch for the last part of the thinking time - if that is say half a second, then the node budget here is only half of the NPS. Otherwise, the NPS rate would seem to suddenly jump up.

Patrice Duhamel · Post by **Patrice Duhamel** » Sun Oct 27, 2019 1:45 pm

bob wrote: ↑Sun Oct 27, 2019 6:01 am I would LOVE to have a command something like "elo 1500" and suddenly Crafty would play like the average 1500 player. Easy to say. Very difficult to do and avoid the pitfalls above.

Thanks, I understand it's a difficult problem.

In current released Cheese version, I reduce the speed (making small delay every X nodes) and add randomness in evaluation function for lower levels, but when you give more time to the engine the strength increase.
This is why I'm trying to limit the number of nodes searched instead of the speed, and it should works the same on all computers.

My problem is to choose the right number of nodes to correspond (roughly) to CCRL ratings.

PK · Post by PK » Sun Oct 27, 2019 4:33 pm

Weakening a chess engine in a realistic way might well be the single most difficult task in our domain. What's more, different methods are not really comparable with each other. And even if you succeed in calibrating engine's strength in a way that satisfies you, it's usually enough to change the time control to see significant differences between other approaches emerge.

A few hints: if You add random component to the evaluation, don't use random(). To achieve consistency, use a value derived from transposition table key. To achieve diversity between games, use (positionKey ^ gameKey), gameKey being a random value calculated on the game start.

To achieve reasonable performance accross different platforms, either determining max nodes per second or max nodes allocated for a search is probably best. Here let me state the obvious: first approach lets Your engine play better on longer time controls, the second gives it fixed strength. Not so obvious, though, is which approach is preferable.

I have tried two other approaches. One was "not noticing" certain moves within search, depending on depth, in check status, move properties... I think Phalanx did something like that earlier. This approach is workable, meaning that end product is fine, but in the meantime you go straight into developer's hell: condition after condition and exception after exception. For example, random pruning while in check may lead to a false checkmate evaluation, random pruning of promotions would push any endgame knowledge out of the window etc.

If you expect your users to play only blitz, you may go away with manipulating search depth. The differences between depths are too steep, but something like base depth + probability of searchnig 1 more ply may be workable.

bob · Post by **bob** » Sun Oct 27, 2019 9:06 pm

Ras wrote: ↑Sun Oct 27, 2019 10:04 am
bob wrote: ↑Sun Oct 27, 2019 6:01 amThe same skill can't be used on both machines...
As for the search speed dependent part of the throttling: If you implement the waiting loop not to reduce NPS by a percentage, i.e. not relative to the machine's full speed, but instead configure NPS directly in throttled mode, then the same setting works regardless of the machine. Of course, a slow machine won't have the higher end of the node throttling available.

In my throttling implementation, I calibrate the machine speed at engine startup and determine the upper end of the available Elo throttling range that the UCI_Elo command offers.

The NPS throttling works by dividing every second into slices of 10ms each. Since I check the clock every so many nodes that I check about once per millisecond, I know how many nodes have been spent in the current second. If the configured NPS budget has been reached in the current second, I can sleep for the rest of the second. That's actual thread sleep, not busy waiting, because that nicely reduces energy consumption especially on mobiles.

There's one catch for the last part of the thinking time - if that is say half a second, then the node budget here is only half of the NPS. Otherwise, the NPS rate would seem to suddenly jump up.

Actually sounds pretty promising, if difficult. IE if you could calibrate Elo at a few points based on NPS, say .5M, 1M, 5M, 10M, 20M, and really accurate Elo numbers, you ought to be able to make the program search at that speed by using a variable busy loop that is adjusted periodically to throttle the NPS to the right range. However, the other two parts are also important.

The real problem I had (I wanted to use a sort of Elo command rather than skill) was how to calibrate it. IE how can you really determine that the Elo of the program is X, with {a,b,c} settings? You need to play against a variety of opponents with known Elo, to use the traditional Elo calculations for the program of interest. Most programs have no valid Elo as they don't play in human tournaments. Elo between programs is simply a statistical measure used to predict results between any two players in that rating pool. However, a human might be surprised to find that an Elo setting of 1500 blows him away even though his USCF or FIDE rating is 1700. That's the thing that always frustrated any attempt to vary the Elo of Crafty in the traditional sense. IE I could make the Elo setting do "something" but "something" was really not very close to reality. We could play hundreds of thousands of games on our cluster to make the Elo estimations quite accurate. But, unfortunately, we didn't have accurate Elo ratings for the opponents on the hardware we were using. We didn't even have accurate Elo estimates for ANY hardware, since computers and humans are not equivalent. My intent was accuracy, which I found practically impossible to produce. In the absence of any accuracy, I just accepted the "skill" setting to give roughly 101 different skill levels, even if the Elo comparison was unknown.

But a really interesting topic.

bob · Post by **bob** » Tue Oct 29, 2019 10:59 pm

OK, this topic tweaked my interest, so I am re-visiting this a bit. I eventually will get rid of the skill command and replace it with an "elo xxxx" command. I am currently working to establish the rating at 6M nodes per second, which is what Crafty runs on my MacBook using just one thread. I am going to initially guess that this value will represent an Elo of 2800. I might be a little low as Crafty was clearly playing equally with strong GM players at a speed slower than that, at 40 moves in 60 minutes (we played a round-robin on chess.net with GMs Dzhindi, Christiansen, Karpov and one other I do not remember. Bruce (Ferret) and I were two of the computer entrants. I do not recall the other two. Bottom line was all the computers finished above all the GM players in this 4-round round-robin event. So 2800 ought to be within reason, at least. I am then going to test with skill 90, which says use 10% randomness in the evaluation, and reduces the NPS by 50%. Will run against the same group of opponents to see what that change does to Elo. If I am incredibly lucky, that will drop Elo by 200. I know that is not going to happen, it will be more or less.

So I am going to try to find out a couple of ratings, then use the Elo command to adjust NPS and randomness to fit whatever value is given (hopefully nobody will try 1901 and 1902 and complain that they seem to play the same.

)

More later, these tests will be slow going on my MacBook.

bob · Post by **bob** » Mon Nov 04, 2019 3:45 am

Been playing with this for a couple of days. Using the OPs idea of dynamically adjusting the NPS, plus my original approach of randomizing the evaluation, I have gotten something that works. Nowhere near perfect yet, and calibrating it is a bit difficult. But I took a couple of open-source programs, hacked 'em to make them play much weaker, and then added them to my test opponents. Idea being that it is very difficult to measure Elo if Crafty is well below the worst program in the set, yet I want to make it play successively weaker games as I drop the Elo number. So this has given me some head room, but I am probably going to need more.

Once I get "elo 2000" to produce an elo of 2000 in my testing, the hard part will start, trying to sync up some of the elo points with humans, That is a much harder problem...

D Sceviour · Post by **D Sceviour** » Mon Nov 04, 2019 4:39 am

Certainly, it is time for engines to produce more features for human players rather than increases in elo strength. The Deuterium method is to use an exponential function to calculate the rating based on nodes:

rating = 297.12 x Ln(nodes) - 976.7

This does not seem correct. Rather, the elo performance might more realistically follow a Sigmoid S-curve pattern against ply depths and node counts. It might be an interesting approach to consider. A curve defined as:

#define sigmoid(K, S) ( 1.0 / (1.0 + pow(20.0, -K * S / 2000.0)) )

can produce a scale something like this:

elo ply depth
---------------
1000 1
1500 2
2000 4
2500 9
3000 18
3500 32
4000 50

With adjustment to the constants, the formula can be fined tuned for performance. I might look at this further when the opportunity arises.

PK · Post by PK » Mon Nov 04, 2019 8:30 am

Denis, does Your formula allow to calculate percentage of depth 1 and depth 2 searches needed to achieve, say, 1200 Elo?

Edit: stupid me, it's pretty obvious when you see that Elo maps to fractional depth.

Limiting nodes to reduce engine strength

Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength

Re: Limiting nodes to reduce engine strength