New testing thread

tvrzsky · Post by **tvrzsky** » Tue Aug 12, 2008 12:10 am

Uri Blass wrote:
My point is being able to reproduce moves and not about reproducing results when I wrote that evaluation is different from my point of view.

I was not talking about testing small changes in many games but about answering the qeustion why the program played a specific move in games.

If I do not clear the hash it may be harder to reproduce the move(as was explained in another place in this thread) so it may be harder to fix bugs.

If I clear the hash after every move and see a move that I do not like then I can simply run the program again from the same position to get the same move so it can help to detect bugs.

Uri

I have implemented all the tools for reproducing the game only recently, before that I was too constrained to clearing hashtable. And I have to say that it was really worth the effort, the engine now looks much more intelligent or mature or how to express it and is definitely stronger. Debugging is not _much_ harder, it costs only some additional time (exactly the time your engine has spend thinking in the game until the point in question) and perhaps you do not it so frequently? So I can nothing but recommend you to revisit your decision about it.

Filip

tvrzsky · Post by **tvrzsky** » Tue Aug 12, 2008 12:38 am

MartinBryant wrote:Well I'm afraid Glaurung failed the medical!
Although it appears to support a fixed nodes type of search, when you tell it to search X nodes it actually searches X+a few more, where 'a few more' seems to vary each run. So it was non-deterministic in the fixed nodes game re-runs.

But do not dismay, Twisted Logic has valiantly stepped into the fray and passed all the drugs tests.

The first match is underway!

I walked just into the same problem recently. I tried to implement fixed nodes search in order to exactly reproduce the games played before with time control. So it means that the engine should not break the search in any arbitrary node but only in the node in which it did it yet sometimes before on timer request. I simply put in the command total number of searched nodes (in previous search with time control) and was surprised that engine did not break the search always in the given number of nodes but sometimes it added few and sometimes even few hundreds ones. Obviously it was an unexpected effect of the way the engine was coming back through the search tree to the root after the break request. I was lazy to think about it and did not want to change this because it works fine otherwise so I has to change the node command and put in it number of the node in which the request for search break ocurred. Now it works and total number of searched nodes is the same in both cases.

bob · Post by **bob** » Tue Aug 12, 2008 1:17 am

hgm wrote:
xsadar wrote:Like Bob said, there are too many factors, not just node types. It seems highly unlikely to me that computing the average speed of each node type would get me to within a 10% margin of error for a given position.
Well, then it might be 20%, so what? Have you ever tried how much the Elo of your engine suffers when you give it 20% extra time for half the game, and 20% less for the other half? It is probably less than you can measure.

The important artifact of 'st' type time controls is that you cannot give extra time when the engine really needs it (such as a disastrous fail low of the PV move in an itereation it is not allowed to complete). That is where yo can earn a lot of Elo points with time management. How you globally divide the time over the gme has far less effect, and certainly requires far less intelligence on behalf of the time management. (Adapting a single parameter isusually enough to tune that behavior.)

So the most important parts of your time management are tested well if you play a nodes-based classical time control, unlike when you would just play a fied maximum number of nodes per move. So I would never do the latter, and always recommend the former. It is equally easy to do anyway. Just use th node count that you alrady need for the WB thinking output, and divide it by the given nps value to convert it to time, in stead of reading the clock. You would have to do that in both cases anyway.

bob · Post by **bob** » Tue Aug 12, 2008 1:21 am

bob wrote:
hgm wrote:
xsadar wrote:Like Bob said, there are too many factors, not just node types. It seems highly unlikely to me that computing the average speed of each node type would get me to within a 10% margin of error for a given position.
Well, then it might be 20%, so what? Have you ever tried how much the Elo of your engine suffers when you give it 20% extra time for half the game, and 20% less for the other half? It is probably less than you can measure.

The important artifact of 'st' type time controls is that you cannot give extra time when the engine really needs it (such as a disastrous fail low of the PV move in an itereation it is not allowed to complete). That is where yo can earn a lot of Elo points with time management. How you globally divide the time over the gme has far less effect, and certainly requires far less intelligence on behalf of the time management. (Adapting a single parameter isusually enough to tune that behavior.)

So the most important parts of your time management are tested well if you play a nodes-based classical time control, unlike when you would just play a fied maximum number of nodes per move. So I would never do the latter, and always recommend the former. It is equally easy to do anyway. Just use th node count that you alrady need for the WB thinking output, and divide it by the given nps value to convert it to time, in stead of reading the clock. You would have to do that in both cases anyway.

I agree tha sn=n is a problem. And in Crafty, st=n is also a problem because if you set a specific time, I can't override it on fail lows. But it would take a lot of thought to come up with a sn equivalent that would be fair to all

bob · Post by **bob** » Tue Aug 12, 2008 2:49 am

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:You say:"To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it... "

my response:
Performance penalty from searching only complete iterations is clearly
less than 100 elo(and I guess something near 20 elo) and I believe that there are programmers who do not care about these small numbers and prefer deterministic results in testing so they may be able to reproduce everything(My opinion is that it is better to sacrifice 20 elo for being able to reproduce everything easily).
I play so many games against other computers on ICC, and I see the analysis they whisper/kibitz when I am watching and we both turn it on. Even in CCT events as well. And _everybody_ that I played was producing output claiming that not all root moves were searched each time a move was played. Sometimes they started a new depth and got nothing back, sometimes they were in the middle, and perhaps very rarely they timed out when the iteration ended. But it is easy to watch 'em. I don't believe _anybody_ designs a program in any way other than to make it as strong as possible. And that includes me. I know how to make a perfectly deterministic parallel search. But with absolutely nowhere near the speedup I now produce. So non-determinism is accepted along with better performance. Ditto for time allocation and each other decision we make. But find me any _serious_ chess programmer that would say "sure, I will give up significant Elo just to make testing more deterministic (although no more accurate)"

You can be sure that not all programs at level that is close to Crafty's level exhibit the variability that you talk about.
When I first ran into this, I tried several. Both Glaurungs, fruit, gnuchess, crafty, arasan 9/10, and a couple of others that I won't name but which I could get to run under linux. And they _all_ had this issue. Yes, there could certainly be a program out there that terminates searches on iteration boundaries, that clears the hash between moves, that does anything else needed to minimize variability. But there certainly are not very many of 'em because we all take any rating improvement we can get, and we aren't going to throw away 30 here, 20 there, as much of that produces a pure patzer.

Not all programmers of programs at that level care much about playing strength and you can be sure that there are people who do not care if their program is 500 elo weaker than rybka or 480 elo weaker than rybka.

I did not work lately about Movei but if I come back I will certainly not care at levels that are more than 100 elo weaker than rybka about this small improvement that make result not reproducable and make it harder to find bugs because I see that the program played a move that I cannot reproduce.

Uri
I'm the opposite, trying to take everything I can get. Most are...
stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri
That is an insane approach to designing the strongest possible chess program. Would you then choose to not add major eval blocks of code because those might be harder to debug than if you left it alone? Even though the additions are not going to add 100 Elo, or even 20 Elo, do you add them or not? There is only one sane answer there, and it is the same sane answer to any question about sacrificing Elo for (possibly unnecessary) future debugging...
evaluation is different because the results after adding evaluation terms can be reproduced.

Uri
Are you following this thread? A tiny eval change alters the shape of the game tree. A single node change can change a move or the game result. So no way is that last statement true. Just run a couple of positions, then make a tiny eval change and re-run 'em. And look at the node counts for the same fixed depth. Then this can be put to rest for all time.
My point is being able to reproduce moves and not about reproducing results when I wrote that evaluation is different from my point of view.

I was not talking about testing small changes in many games but about answering the qeustion why the program played a specific move in games.

If I do not clear the hash it may be harder to reproduce the move(as was explained in another place in this thread) so it may be harder to fix bugs.

If I clear the hash after every move and see a move that I do not like then I can simply run the program again from the same position to get the same move so it can help to detect bugs.

Uri

You have to clear more than that. You have to clear _everything_ that is used in the search. hash, pawn hash, etc. killer moves and counts. history counts (if you use them). Anything that can change the order moves are searched...

bob · Post by **bob** » Tue Aug 12, 2008 2:54 am

Dirt wrote:
bob wrote:Unfortunately it won't work if the _other_ player also has the potential for varying.
I don't think you even need the other player, just what moves were made and the node counts from the original run.

If you are just debugging, I agree. If you are trying to test and use the results to evaluate changes, this doesn't particularly strike me as workable...

tvrzsky · Post by **tvrzsky** » Tue Aug 12, 2008 4:12 am

bob wrote:
Uri Blass wrote: My point is being able to reproduce moves and not about reproducing results when I wrote that evaluation is different from my point of view.

I was not talking about testing small changes in many games but about answering the qeustion why the program played a specific move in games.

If I do not clear the hash it may be harder to reproduce the move(as was explained in another place in this thread) so it may be harder to fix bugs.

If I clear the hash after every move and see a move that I do not like then I can simply run the program again from the same position to get the same move so it can help to detect bugs.

Uri

You have to clear more than that. You have to clear _everything_ that is used in the search. hash, pawn hash, etc. killer moves and counts. history counts (if you use them). Anything that can change the order moves are searched...

Do you really have to do so much cleaning? Main hashtable of course, but pawn hash? All what you can get from it (at least it is the case in my engine) you would anyway compute on fly in the evaluation thus it can't transfer any information across the search tree and consequently change it any way unlike the main hash (if not considering some undetected hash collision). I assume that speed isn't important here.
And as concerns killer moves and history, would it have any sense _not_ to clear them between the searches?

bob · Post by **bob** » Tue Aug 12, 2008 4:42 am

tvrzsky wrote:
bob wrote:
Uri Blass wrote: My point is being able to reproduce moves and not about reproducing results when I wrote that evaluation is different from my point of view.

I was not talking about testing small changes in many games but about answering the qeustion why the program played a specific move in games.

If I do not clear the hash it may be harder to reproduce the move(as was explained in another place in this thread) so it may be harder to fix bugs.

If I clear the hash after every move and see a move that I do not like then I can simply run the program again from the same position to get the same move so it can help to detect bugs.

Uri

You have to clear more than that. You have to clear _everything_ that is used in the search. hash, pawn hash, etc. killer moves and counts. history counts (if you use them). Anything that can change the order moves are searched...
Do you really have to do so much cleaning? Main hashtable of course, but pawn hash? All what you can get from it (at least it is the case in my engine) you would anyway compute on fly in the evaluation thus it can't transfer any information across the search tree and consequently change it any way unlike the main hash (if not considering some undetected hash collision). I assume that speed isn't important here.
And as concerns killer moves and history, would it have any sense _not_ to clear them between the searches?

Actually pawn hash would be OK to leave alone. But the other things would be problematic since anything that changes the move order changes the shape of the tree and the resulting node count to reach identically the same point.

I don't clear the killers. The idea is that a move good in position X is probably good in position Y in a full-width search. I'd rather try the killers from the previous search, rather than nothing, any day. So I never clear them.

Uri Blass · Post by **Uri Blass** » Tue Aug 12, 2008 7:18 pm

I clear the killers and I do not think that it is very important.
I am going to get new killers very quickly in the first plies of the search.

Uri

bob · Post by **bob** » Tue Aug 12, 2008 7:49 pm

Uri Blass wrote:I clear the killers and I do not think that it is very important.
I am going to get new killers very quickly in the first plies of the search.

Uri

Suppose you do the logical thing, however, and if you complete a 20 ply search, and start pondering, you start at ply=19 and not ply=1? Why? you made the first move in the PV, you are pondering the second, so you still have the result of an 18 ply search already computed in the remainder of the PV. So starting at 19 avoids repeated work. And now, suddenly, a good killer at ply=2 becomes very significant.

New testing thread

Re: 4 sets of data

Re: Correlated data discussion - Another Experiment...

Re: Correlated data discussion

Re: Correlated data discussion

Re: 4 sets of data

Re: 4 sets of data

Re: 4 sets of data

Re: 4 sets of data

Re: 4 sets of data

Re: 4 sets of data