Kempelen wrote:Hello,
This post is for ask a little of help. I have been improving Rodin for two years now. It has all major features a engine can have and I have arrived a point where new problems appear. First an introduction; these are the engines I am currently testing my engine with:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Danasah 2481 41 39 200 67% 2362 21%
2 RomiChess P3K 2472 42 40 200 64% 2362 13%
3 Eeyore 2447 41 40 200 61% 2362 13%
4 ThorsHammer22 2418 41 40 200 57% 2362 11%
5 Bruja 2391 41 41 200 54% 2362 9%
6 Knightx192 2386 39 39 200 53% 2362 18%
7 Rodin v2.8b 2362 13 13 2200 51% 2349 14%
8 Dirty 2342 39 39 200 47% 2362 19%
9 Scidlet 2288 39 39 200 40% 2362 22%
10 ZCT 2282 40 41 200 40% 2362 13%
11 RattateChess 2229 41 43 200 34% 2362 11%
12 BlackBishop 2102 46 50 200 20% 2362 9%
These are the result of my last test tournament, at 1 min 0 secs. My problem is mainly I have been testing new features for around two months in this format and I have never obtained any gain. What I have been measuring is things like futility margins, passed pawns bonus, king safety adjusments, ..... any tournament I test I get the same result: 51%/52%
I dont know how you deal when this happen to you. For me is quite desesperate and lead me to ask myselft questions like: Do I have any bug which hide good results?, Will those engines be suitable for testing my engine?, Does it means I need to change my testing schema?
Well, This post is only to ask you for advice and tips on have to deal with this situacion. What I mainly fear is something will be break. Many of you have more expertise than me, so you comments will be wellcome.....
Greetings,
Fermin
P.S.: Situacion like what I have just described show how difficult is to arrive to the top of rating list. It is like playing the piano or any other ability.... you can be good, but is impossible one arrive to the top when releasing his first engine. You need training and a lot of experience. Rybka, Ippolit and a few others maybe are good programs, but they are much suspects.... Nobody is born knowing how to play piano, paint or play chess with a considerable training time.
Here's some advice...
If you run tests, and change a feature in different ways, and it makes no difference at all, you absolutely must determine why. For example, when I started with the old "history pruning" idea from Fruit, I got it to working, and then started to play with the "history threshold" after reading Uri report that he had found a better value than the Fruit default. When I tested, the threshold made no difference whatsoever, until you set it so high that you disabled history pruning altogether. What you'd like to see is to try values from 0% to 10% to ... to 100% and see some sort of bell-shaped curve that peaks at one value. Which shows that value is clearly the best. What I saw was random values when I went from 30 to 40 to 50. Once I hit 100 it turned history off and hurt, obviously, but there was no single "best value". I then tested fruit in the same way, and found it behaved in exactly the same way. That ultimately led me to conclude the history counters are completely worthless for this kind of decision, and I removed them completely, and just kept the static rules for when to not reduce. And the program played no better or worse with the counters completely removed.
The bottom line is, if something can't be adjusted, then something is wrong, and it might simply be an idea that is no good, or it might be an implementation that has a bug you have not noticed, so that something else prevents that code from influencing the game results at all. For example, if your "in check" function is broken, that will prevent you from reducing some moves if you have the rule "don't reduce if in check". So those do not get reduced (moves that appear to be escaping check but you really aren't in check to start with) which reduces the effect of other reduction limits.
If you can't tune something, even that should tell you something. Since adjusting a value makes no difference, how can that value be important in the first place? If you are sure it is, then why isn't it in this instance? If you explain all such impossible-to-tune cases, you make progress each time.
In my testing, i have seen four cases happen when tuning something.
(1) nothing makes a difference. Either the code is worthless, or it is broken so that your tuning experiments are flawed.
(2) you find a nice peak, where values smaller than optimal lower the results, and values larger than optimal also lower the results. This is the perfect case as you can very accurately zero in on the best value or whatever.
(3) you find a point where as you increase the value, the results improve, until a point. Going beyond that point produces steady results, no further improvement and no degradation either. Picking the best point is a little harder.
(4) you find a point where as you increase the value, the results get worse. But up to the point where the results start to decline, all values are pretty much equal. This is the opposite of (3) above.
(3) and (4) are the interesting cases. For (3) clearly the scoring term (or whatever it is) works, because increasing the value helps, to a point, and then performance levels off. Do you stop at the point where results level off or go a little further to choose the final value? (4) is more complicated, because smaller values don't help, but bigger values actually hurt. Should you completely remove the code?
Lots of fun...
