When there is not progress

Discussion of chess software programming and technical issues.

Moderator: Ras

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Razoring

Post by mcostalba »

Eelco de Groot wrote: The -different from Stockfish- adjusted alpha and beta parameters for calling qsearch() I had almost forgotten but I think they are important in cases like this were you use a futility margin, well I just hope I got that right.
Yes, I saw them. I did not comment because I am still thinking about them.

The point is that you could return before to do a full qsearch. If you set the limits too low you can easily return from qsearch on a stand pat evaluation (not the first becasue in your implemenation is already verified to be lower then the limit).

In the current SF, with the qsearch limits set so high, at real beta value, I am sure the qsearch will be fully done and we will not return from a stand pat somewhere.

It is still not clear to me if calling qsearch with reduced limits is too risky or not, probably a test will be needed also in this case ;-)


BTW the overhead of calling qsearch and returning just after the first stand pat is much lower (IMHO) than the cost of another dummy evaluation as you do. But of course it is all up to you :-) mine is just an opinion.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Razoring

Post by Michael Sherwin »

Eelco de Groot wrote: I hope I am recalling my considerations a bit correctly, but I did think about it a bit! Maybe I got it wrong... These Razoring experiments were really much the only changes so far to the code in Stockfish it is just a coincidence that Michael Sherwin mentioned razoring in Stockfish.
Hi Michael,

Glaurung is not just my program, it is also yours. I have only a small part of the honor. Without the help from you (and others, of course), Glaurung wouldn't be nearly as strong as it is.
:D

I guess Tord has availed himself of some of my ideas. I just wish that I knew which if any he did use. I am honored if there are some of my ideas in Glaurung/Stockfish! The only ones that I've seen on a cursory examination that may have come from me are late_move_pruning, feedback (different than mine) and now maybe razoring. Although, Tord does things better than I do!
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Razoring

Post by mjlef »

BTW calling evaluation just before qsearch seems redundant to me because the evaluation is the first thing that qsearch will do and if above the limit it will return immendiately, so with your code you end up to call evaluation a second time without any need and without result becasue it will fail low for sure given the pre-qsearch condition.[/quote]

I think the code fragment from Stockfish left out the test of the eval. If it is >=beta, nullsearch is tried. So to get to this futility code, the score would need to be <beta (whis is alpha, in a null window search). So it will not fail high based on an eval at the next ply.

One stupid trick I do is when making a move, my position structure has a variable to hold the score. I set it to a bogus value, like -50000. Whenever I evaluate, it puts the real score there. So at the top of the eval routine, I just see if the stored eval is !=-50000, and return the score. This lets me do a full eval at the ply before, just after making the move on board, and prevent the program from reevaluating it again at the top of qsearch() or search(). Same trick can apply to nullmove. Some program eval, then try a nullmove, where they end up evaling again. Instead, change the score in the position structure to -score, fix for side to move and then the program need not eval again twice in a row with the only difference being side to move.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Razoring

Post by mcostalba »

mjlef wrote: I think the code fragment from Stockfish left out the test of the eval. If it is >=beta, nullsearch is tried. So to get to this futility code, the score would need to be <beta (whis is alpha, in a null window search). So it will not fail high based on an eval at the next ply.

One stupid trick I do is when making a move, my position structure has a variable to hold the score. I set it to a bogus value, like -50000. Whenever I evaluate, it puts the real score there. So at the top of the eval routine, I just see if the stored eval is !=-50000, and return the score. This lets me do a full eval at the ply before, just after making the move on board, and prevent the program from reevaluating it again at the top of qsearch() or search(). Same trick can apply to nullmove. Some program eval, then try a nullmove, where they end up evaling again. Instead, change the score in the position structure to -score, fix for side to move and then the program need not eval again twice in a row with the only difference being side to move.
The condition on the eval is this:

Code: Select all

&&  approximateEval < beta - RazorApprMargins[int(depth) - 2] 
namely the approximate (quick) evaluation of the node should be very low, RazorApprMargins are high, so if condition is met the qsearch is tried.

This is the only condition in Stockfish before the qsearch, instead Eelco does a full evaluation and then checks for the same limit.


Your trick is nice and I think also effective. In case a chess engine evaluates position at each node it is for sure worth to try. I am not so sure for SF given that we don't evaluate position at every node, but we use an approximate evaluation instead that indeed could be also very wrong, but is fast and normally is used as a pre-condition only: if condition is met we go on with more precise evaluation, for instance in this case we go on with a qsearch() that is even more precise then a pure evaluation.
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: Razoring

Post by jwes »

mjlef wrote:BTW calling evaluation just before qsearch seems redundant to me because the evaluation is the first thing that qsearch will do and if above the limit it will return immendiately, so with your code you end up to call evaluation a second time without any need and without result becasue it will fail low for sure given the pre-qsearch condition.
Or you could do the stand pat test before calling qsearch and not call eval the first thing in qsearch. This would save a lot of function calls at the cost of a few lines of code.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: When there is not progress

Post by bob »

Kempelen wrote:Hello,

This post is for ask a little of help. I have been improving Rodin for two years now. It has all major features a engine can have and I have arrived a point where new problems appear. First an introduction; these are the engines I am currently testing my engine with:

Code: Select all

Rank Name            Elo    +    - games score oppo. draws 
   1 Danasah        2481   41   39   200   67%  2362   21% 
   2 RomiChess P3K  2472   42   40   200   64%  2362   13% 
   3 Eeyore         2447   41   40   200   61%  2362   13% 
   4 ThorsHammer22  2418   41   40   200   57%  2362   11% 
   5 Bruja          2391   41   41   200   54%  2362    9% 
   6 Knightx192     2386   39   39   200   53%  2362   18% 
   7 Rodin v2.8b    2362   13   13  2200   51%  2349   14% 
   8 Dirty          2342   39   39   200   47%  2362   19% 
   9 Scidlet        2288   39   39   200   40%  2362   22% 
  10 ZCT            2282   40   41   200   40%  2362   13% 
  11 RattateChess   2229   41   43   200   34%  2362   11% 
  12 BlackBishop    2102   46   50   200   20%  2362    9% 
These are the result of my last test tournament, at 1 min 0 secs. My problem is mainly I have been testing new features for around two months in this format and I have never obtained any gain. What I have been measuring is things like futility margins, passed pawns bonus, king safety adjusments, ..... any tournament I test I get the same result: 51%/52%

I dont know how you deal when this happen to you. For me is quite desesperate and lead me to ask myselft questions like: Do I have any bug which hide good results?, Will those engines be suitable for testing my engine?, Does it means I need to change my testing schema?

Well, This post is only to ask you for advice and tips on have to deal with this situacion. What I mainly fear is something will be break. Many of you have more expertise than me, so you comments will be wellcome.....

Greetings,
Fermin

P.S.: Situacion like what I have just described show how difficult is to arrive to the top of rating list. It is like playing the piano or any other ability.... you can be good, but is impossible one arrive to the top when releasing his first engine. You need training and a lot of experience. Rybka, Ippolit and a few others maybe are good programs, but they are much suspects.... Nobody is born knowing how to play piano, paint or play chess with a considerable training time.
Here's some advice...

If you run tests, and change a feature in different ways, and it makes no difference at all, you absolutely must determine why. For example, when I started with the old "history pruning" idea from Fruit, I got it to working, and then started to play with the "history threshold" after reading Uri report that he had found a better value than the Fruit default. When I tested, the threshold made no difference whatsoever, until you set it so high that you disabled history pruning altogether. What you'd like to see is to try values from 0% to 10% to ... to 100% and see some sort of bell-shaped curve that peaks at one value. Which shows that value is clearly the best. What I saw was random values when I went from 30 to 40 to 50. Once I hit 100 it turned history off and hurt, obviously, but there was no single "best value". I then tested fruit in the same way, and found it behaved in exactly the same way. That ultimately led me to conclude the history counters are completely worthless for this kind of decision, and I removed them completely, and just kept the static rules for when to not reduce. And the program played no better or worse with the counters completely removed.

The bottom line is, if something can't be adjusted, then something is wrong, and it might simply be an idea that is no good, or it might be an implementation that has a bug you have not noticed, so that something else prevents that code from influencing the game results at all. For example, if your "in check" function is broken, that will prevent you from reducing some moves if you have the rule "don't reduce if in check". So those do not get reduced (moves that appear to be escaping check but you really aren't in check to start with) which reduces the effect of other reduction limits.

If you can't tune something, even that should tell you something. Since adjusting a value makes no difference, how can that value be important in the first place? If you are sure it is, then why isn't it in this instance? If you explain all such impossible-to-tune cases, you make progress each time.

In my testing, i have seen four cases happen when tuning something.

(1) nothing makes a difference. Either the code is worthless, or it is broken so that your tuning experiments are flawed.

(2) you find a nice peak, where values smaller than optimal lower the results, and values larger than optimal also lower the results. This is the perfect case as you can very accurately zero in on the best value or whatever.

(3) you find a point where as you increase the value, the results improve, until a point. Going beyond that point produces steady results, no further improvement and no degradation either. Picking the best point is a little harder.

(4) you find a point where as you increase the value, the results get worse. But up to the point where the results start to decline, all values are pretty much equal. This is the opposite of (3) above.

(3) and (4) are the interesting cases. For (3) clearly the scoring term (or whatever it is) works, because increasing the value helps, to a point, and then performance levels off. Do you stop at the point where results level off or go a little further to choose the final value? (4) is more complicated, because smaller values don't help, but bigger values actually hurt. Should you completely remove the code?

Lots of fun... :)
User avatar
WinPooh
Posts: 276
Joined: Fri Mar 17, 2006 8:01 am
Location: Russia
Full name: Vladimir Medvedev

Re: When there is not progress

Post by WinPooh »

Just an idea: try to measure importance of different parts in your program. First, reduce search to plain alpha-beta + simplest qsearch. Measure program's rating. Second, return normal search, but reduce evaluation to material + PSQ. Measure rating again. After that you will see: improvements in search provide say 200 points, and different features in eval provide say 120 points. Maybe this can give some ground for thinking on possible improvements...
User avatar
Kempelen
Posts: 620
Joined: Fri Feb 08, 2008 10:44 am
Location: Madrid - Spain

Re: When there is not progress

Post by Kempelen »

bob wrote:
Kempelen wrote:Hello,

This post is for ask a little of help. I have been improving Rodin for two years now. It has all major features a engine can have and I have arrived a point where new problems appear. First an introduction; these are the engines I am currently testing my engine with:

Code: Select all

Rank Name            Elo    +    - games score oppo. draws 
   1 Danasah        2481   41   39   200   67%  2362   21% 
   2 RomiChess P3K  2472   42   40   200   64%  2362   13% 
   3 Eeyore         2447   41   40   200   61%  2362   13% 
   4 ThorsHammer22  2418   41   40   200   57%  2362   11% 
   5 Bruja          2391   41   41   200   54%  2362    9% 
   6 Knightx192     2386   39   39   200   53%  2362   18% 
   7 Rodin v2.8b    2362   13   13  2200   51%  2349   14% 
   8 Dirty          2342   39   39   200   47%  2362   19% 
   9 Scidlet        2288   39   39   200   40%  2362   22% 
  10 ZCT            2282   40   41   200   40%  2362   13% 
  11 RattateChess   2229   41   43   200   34%  2362   11% 
  12 BlackBishop    2102   46   50   200   20%  2362    9% 
These are the result of my last test tournament, at 1 min 0 secs. My problem is mainly I have been testing new features for around two months in this format and I have never obtained any gain. What I have been measuring is things like futility margins, passed pawns bonus, king safety adjusments, ..... any tournament I test I get the same result: 51%/52%

I dont know how you deal when this happen to you. For me is quite desesperate and lead me to ask myselft questions like: Do I have any bug which hide good results?, Will those engines be suitable for testing my engine?, Does it means I need to change my testing schema?

Well, This post is only to ask you for advice and tips on have to deal with this situacion. What I mainly fear is something will be break. Many of you have more expertise than me, so you comments will be wellcome.....

Greetings,
Fermin

P.S.: Situacion like what I have just described show how difficult is to arrive to the top of rating list. It is like playing the piano or any other ability.... you can be good, but is impossible one arrive to the top when releasing his first engine. You need training and a lot of experience. Rybka, Ippolit and a few others maybe are good programs, but they are much suspects.... Nobody is born knowing how to play piano, paint or play chess with a considerable training time.
Here's some advice...

If you run tests, and change a feature in different ways, and it makes no difference at all, you absolutely must determine why. For example, when I started with the old "history pruning" idea from Fruit, I got it to working, and then started to play with the "history threshold" after reading Uri report that he had found a better value than the Fruit default. When I tested, the threshold made no difference whatsoever, until you set it so high that you disabled history pruning altogether. What you'd like to see is to try values from 0% to 10% to ... to 100% and see some sort of bell-shaped curve that peaks at one value. Which shows that value is clearly the best. What I saw was random values when I went from 30 to 40 to 50. Once I hit 100 it turned history off and hurt, obviously, but there was no single "best value". I then tested fruit in the same way, and found it behaved in exactly the same way. That ultimately led me to conclude the history counters are completely worthless for this kind of decision, and I removed them completely, and just kept the static rules for when to not reduce. And the program played no better or worse with the counters completely removed.

The bottom line is, if something can't be adjusted, then something is wrong, and it might simply be an idea that is no good, or it might be an implementation that has a bug you have not noticed, so that something else prevents that code from influencing the game results at all. For example, if your "in check" function is broken, that will prevent you from reducing some moves if you have the rule "don't reduce if in check". So those do not get reduced (moves that appear to be escaping check but you really aren't in check to start with) which reduces the effect of other reduction limits.

If you can't tune something, even that should tell you something. Since adjusting a value makes no difference, how can that value be important in the first place? If you are sure it is, then why isn't it in this instance? If you explain all such impossible-to-tune cases, you make progress each time.

In my testing, i have seen four cases happen when tuning something.

(1) nothing makes a difference. Either the code is worthless, or it is broken so that your tuning experiments are flawed.

(2) you find a nice peak, where values smaller than optimal lower the results, and values larger than optimal also lower the results. This is the perfect case as you can very accurately zero in on the best value or whatever.

(3) you find a point where as you increase the value, the results improve, until a point. Going beyond that point produces steady results, no further improvement and no degradation either. Picking the best point is a little harder.

(4) you find a point where as you increase the value, the results get worse. But up to the point where the results start to decline, all values are pretty much equal. This is the opposite of (3) above.

(3) and (4) are the interesting cases. For (3) clearly the scoring term (or whatever it is) works, because increasing the value helps, to a point, and then performance levels off. Do you stop at the point where results level off or go a little further to choose the final value? (4) is more complicated, because smaller values don't help, but bigger values actually hurt. Should you completely remove the code?

Lots of fun... :)
Thanks Bob for your answer. I continue trying to figure why my engine is not progressing and as I am re-testing all I am learning a lot....
Fermin Serrano
Author of 'Rodin' engine
http://sites.google.com/site/clonfsp/