Diminishing returns of increasing search depth

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Diminishing returns of increasing search depth

Post by hgm »

Laskos wrote:Also, the error bars in Elo points would become large for very skewed 99:1 matches. A round robin with different engines would take too much time, I guess, so do it just with Crafty.
Note that I did _not_ suggest to play 1*t vs 16*t; only upto 1*t vs 8*t (N ply vs N+3 ply, assuming an EBF of 2). With the usual assumption of 70 Elo gain per doubling a time odds factor 8 only produces 210 rating point. That should produce a 75% result.

The increased steepness of the Elo vs score curve when you move away from 50% score is partly offset by the fact that the statistical noise is sqrt(N*s*(1-s)) (forgetting about draws for simplicity), which decreases if the score s approaches 0 or 1.

The problem with only playing N vs N+1 is that a single fluke affects all subsequent Elo estimates, and thus has a huge effect on the fit.

As a last remark: Playing different engines does not take anymore time than playing just Crafty. In stead of 30,000 games Crafty vs Crafty you would play 2000 games of each engine against each other (with 6). The aim might be to see how much strength Crafty gains on doubling the time, but te Crafty strength is measured much more accurately by playing it against a mix of other engines than in self play.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Diminishing returns of increasing search depth

Post by Laskos »

hgm wrote:
Laskos wrote:Also, the error bars in Elo points would become large for very skewed 99:1 matches. A round robin with different engines would take too much time, I guess, so do it just with Crafty.
Note that I did _not_ suggest to play 1*t vs 16*t; only upto 1*t vs 8*t (N ply vs N+3 ply, assuming an EBF of 2). With the usual assumption of 70 Elo gain per doubling a time odds factor 8 only produces 210 rating point. That should produce a 75% result.
True, a 75% result will not deprive us of so much Elo precision. But, I have a problem even with 2,4,8, as I do not know if the results are linearly additive, i.e. that 2*2 gives equal results to 1*4 in Elo points. I suspect that we will be even more confused by the issue if we do 2,4,8 multiplicants of "t", compared to simple doubling at each step. Elo curve pretends to have this additivity, but at least Sonas results with human players are disagreeing.
The increased steepness of the Elo vs score curve when you move away from 50% score is partly offset by the fact that the statistical noise is sqrt(N*s*(1-s)) (forgetting about draws for simplicity), which decreases if the score s approaches 0 or 1.
Yes, but it is a square root damping, I do not know if it could catch with the steepness of the Elo curve around 0.99. But it surely is ok at 0.75, the value you propose at N+3. If only 8*t is proposed, I see no problems with error bars. I was worried about crazy 512:1 results and such.


The problem with only playing N vs N+1 is that a single fluke affects all subsequent Elo estimates, and thus has a huge effect on the fit.
I do not understand. The results are not cumulative, each doubling in time is independent from another, and we are not interested in the final Elo, only in Elo difference from doubling at different time controls. A fluke certainly will affect the fit, but a fluke at N+3 will affect the same, or we will make some corrections to be in line with N+1 and N+2?
As a last remark: Playing different engines does not take anymore time than playing just Crafty. In stead of 30,000 games Crafty vs Crafty you would play 2000 games of each engine against each other (with 6). The aim might be to see how much strength Crafty gains on doubling the time, but te Crafty strength is measured much more accurately by playing it against a mix of other engines than in self play.
Yes, I remember one your post about that, but I forgot your argument :(. Crafty will play a total of 10,000 games. Isn't it easier just to play a gauntlet with Crafty playing all 30,000 games (with 5 different engines), as we do not need the "absolute" rating, only that relative to doubling in "t" for Crafty only? Also, the engines should be of similar strength, as we don't like 0.90+ results for Elo error bars concerns.

Kai
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Diminishing returns of increasing search depth

Post by hgm »

My remark aout the additivity was triggered by the fact that in this thread there do appear some plots of total Elo vs depth, and fits of those points. Wth only an (N vs N+1) data set that would be a very misleading way to do it.

You are correct that there is no a-priori guarantee that the Elo curve is correct. OTOH the rating of an engine is not only determined by its score against near equals, but also on how tough a resitance it puts up against its betters, or how adept it is at massacring weaker opponents.

What you propose would work for answering only the question of diminishing returns, but I think that in the way I propose it we in addition get a well-calibrated value for Elo per time doubling, and a measurement of the shape of the Elo curve, both essentially for free.

Note that if you organize the sample as a number of round-robins (1, 2, 4, 8), (2, 4, 8, 16), (4, 8, 16, 32), (8, 16, 32, 64), you will still play 3 times as many t vs 2t games as you will play t vs 8t games. And it is not a very big hassle to set up a number of round-robins.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Diminishing returns of increasing search depth

Post by Laskos »

hgm wrote: What you propose would work for answering only the question of diminishing returns, but I think that in the way I propose it we in addition get a well-calibrated value for Elo per time doubling, and a measurement of the shape of the Elo curve, both essentially for free.
Yes, my goal was more modest, diminishing results, as stated in this thread. But your proposal is really much more further-reaching, and I would be very happy if Bob could pull this out. If not, at least my proposal (I tried to make it easier for Bob, the shape of the curve would have to be heavily analysed).

Kai