An objective test process for the rest of us?

bob · Post by **bob** » Wed Sep 26, 2007 10:30 pm

hgm wrote:You after talking about one search tree in one particular situation. That has nothing to do with the strength of an engine. That is determines by the average over billions of search trees of move quality vs time used.

I have measured the strength of many engines over a wide range of time controls. I have played time-odds matches between engines to systematically measure how the strength varies with thinking time.I have made many small changes to my engines, and tested them extensively.

How much have you done of all this? Are your statements based on anything at all, other than that this sounds plausible to you?

I know you are talking to hristo, but I would claim I have played probably 1,000 times as many games as you have played when doing testing. I have played 8,000,000 games in the past 7 months. Have you played 8,000 during that period of time? And I don't mean game in 1 second. I mean games long enough to at least give both sides a chance to tactically understand what is going on...

For the record, when I was testing my cluster code to make sure things looked ok and would run reliably. I played just over 1,000,000 games in one hour. I didn't draw any conclusions from the games, I just wanted to stress-test everything to be sure all matches were played the right number of times, each and every time, that nobody was losing games on time (these were crafty vs crafty games, most programs seem to break at game/1sec). My 8M doesn't count that, which I used to run as a demo on our graphics viswall which showed xboard GUIs for each of the 256 simultaneous games being played...

I too have tried various time controls, attempting to answer the question "Can I get a reliable comparison of A vs A' with very short games?" While the overall results change with time, I found that fast games are still OK when I want to know if A is worse than A', rather than "is A better or worse than each of these 'other' opponents."

None of this is new, none of it is untried. I am simply trying to test my "game-day" engine as best as is possible, which means not stripping out any more than absolutely necessary (which includes the opening book first, and then pondering/SMP next and I still test with those both enable a lot of the time to make sure all is still well after recent changes.

bob · Post by **bob** » Wed Sep 26, 2007 10:33 pm

hgm wrote:
bob wrote:That is certainly true. I don't know about those specific time controls, but I have run a bunch of different time tests on my cluster, to accurately answer the question "do I need longer games" or will short games give me as reliable an answer? Short games are fine. But I have examples where Crafty will roll over a program at very fast time controls, and they even up as the time control becomes more reasonable. The opposite is true also.
And this happens for a mere 10% change in time control? How much Elo per factor two of time is the largest you have ever seen, between two programs?
So yes, changing the time control can change the result. If you play enough games, the change is predictable also. For small numbers of games, the results are quite random.

I do not measure "Elo". In fact, I don't care about Elo since it is not clear to me that it applies directly to computer programs as opposed to humans. I only care about 'better/worse/equal". Which is true of most chess-related changes I make. Again, I trust alpha/beta in my search, and I trust a similar approach in the development. "best" is good enough, I don't always need to know "how much better"...

bob · Post by **bob** » Wed Sep 26, 2007 10:39 pm

hgm wrote:Well, one more reason not to test under SMP.

That you have a crappy SMP implementation that might at any moment reveal new hitherto undiscovered bugs, is another issue. There is no reason at all why you should hunt for those bugs under the same conditions as for testing eval changes.

This seems in fact a very obvious example why testing under tournament conditions is undesirable: under tournament conditions you would play in a configuration that makes it very unlikely that new bugs would reveal themselves, while under testing you would like them to manifest themselves abundantly. To catch the bugs I would play with hundreds of processors (virtual ones, of course, just run 128 processes per core) to force more split points, more communication, more of everything, to stress the system to breaking point. That is how you find bugs, so that they won't surprise you later.

Doesn't even begin to "stress things". In fact, the more "virtual processes" you run, the _less_ you stress things. The stress only comes from real concurrency which that doesn't provide. From experience, I am more likely to find bugs when I have _exactly_ the right number of threads which gives max concurrency and tends to produce those difficult-to-reproduce timing bugs more frequently. I've tried testing like that many times and it just doesn't help. It might find elementary bugs. But not very obscure ones.

You also apparently missed the note that I sometimes change the SMP algorithm itself. And that chance can produce stronger or weaker play itself. And non-SMP issues like breaking move ordering affects the SMP version far more than it affects the non-SMP version. These are not independent parts of the code that don't interact. A change to one can affect other parts. So they all need testing together.

hgm · Post by **hgm** » Wed Sep 26, 2007 11:50 pm

bob wrote:I too have tried various time controls, attempting to answer the question "Can I get a reliable comparison of A vs A' with very short games?" While the overall results change with time, I found that fast games are still OK when I want to know if A is worse than A', rather than "is A better or worse than each of these 'other' opponents."

Well, this is exactly what I claimed, and what Hristo denies. I suppose that when you say 'different time controls' you don't mean 40min vs 41 min, but more something like 5min vs 40min.

I agree that the Elo model might not fit engine behavior very well, but it would be nice to be somewhat more quantitative than "at long time control A is better, and at short time control B". For one, how different are these time controls, and does the score flip from 900-100 to 200-800, or merely from 520-480 to 560-540. I would consider that totally different situations, and from your statements we have no clue at all.

Yes, I have must have played 8,000 games as my standard gauntlet is 480 games (24 opponents, 20-games Nunn match), and I ran it many times, both on uMax and Joker. Longest time control I did was 40/10, though, and I play 40/2 more often (on a 2.4GHz Core 2, though). Joker is somewhat better at 2 min wrt the others than at the 5x slower time control, but my experience is also that the Joker versions that do best at 10' also do best at 2'.

And yes, 8,000 is a lot less than 8,000,000, but good enough to see such trends. And my comment was indeed directed at Hristo, whose evidence I suspect to be entirely virtual.

Well, if you change the SMP, you of course test it. I normally test with ponder off, but if I change the ponder algorithm I of course run a test with ponder on to check if I messed up. But that doesn't mean I will from now on always test with ponder on.

I still would like to know how well your results vor eval changes with SMP correlate with those on a single CPU (for eqal search depth). Will you in the end have to supply a number of totally different evaluations, one for single CPU, one for a dual, and one for quads/octals?

hristo · Post by **hristo** » Thu Sep 27, 2007 1:18 am

hgm wrote:
bob wrote:I too have tried various time controls, attempting to answer the question "Can I get a reliable comparison of A vs A' with very short games?" While the overall results change with time, I found that fast games are still OK when I want to know if A is worse than A', rather than "is A better or worse than each of these 'other' opponents."
Well, this is exactly what I claimed, and what Hristo denies.

H.G,
I guess you have misunderstood me. I said that the relative strength of the engines depends on the time limits. You claimed

hgm wrote:(The strength difference between two engines is hardly a function of the amount of search time given to them.)

I took "strength difference" to indicate #games won vs #games lost, rather than simply "strength relation, where A is stronger than B".

I did try to be perfectly clear in expressing my concern and I believe you understood my point, with which point Bob seems to agree -- The strength difference between engines fluctuates with respect to time (per move).

Right now, I'm not sure what do you claim it is that I'm denying, since you already proposed an example that showed exactly such fluctuation.

hgm wrote: And yes, 8,000 is a lot less than 8,000,000, but good enough to see such trends. And my comment was indeed directed at Hristo, whose evidence I suspect to be entirely virtual.

Here you go again ...
It is not clear that you understand the argument being made and it isn't clear that you are trying to find out ... you "suspect".

At any rate,
I have ran tests -- more than you and significantly less than Bob, although none of them were with the intent to determine how the "strength difference" is influenced by time -- I only now that it is.

bob · Post by **bob** » Thu Sep 27, 2007 1:44 am

hgm wrote:
bob wrote:I too have tried various time controls, attempting to answer the question "Can I get a reliable comparison of A vs A' with very short games?" While the overall results change with time, I found that fast games are still OK when I want to know if A is worse than A', rather than "is A better or worse than each of these 'other' opponents."
Well, this is exactly what I claimed, and what Hristo denies. I suppose that when you say 'different time controls' you don't mean 40min vs 41 min, but more something like 5min vs 40min.

I tried several combinations. 1+0, 1+1, 2+1, 3+2, 5+3, 5+5, 10+10, 20+20, 30+30 and 60+60. There might be one or two others (game in one second is not in there as there is no way to represent that in the time+inc syntax of ICC). I found that all were pretty equal for my purposes. We ran A and A' at all of those time controls and the results were uniform in that A' was always better. However, as we went to shorter time controls, the match results often had significant swings. There are some programs that crafty rolls up in very short time control games, and there are some programs that roll it up in short games. And things get closer in reasonable time limits. But my only interest was "does A' score better than A?" and I didn't find any cases where that flip-flopped (doesn't mean they don't exist however).

I agree that the Elo model might not fit engine behavior very well, but it would be nice to be somewhat more quantitative than "at long time control A is better, and at short time control B". For one, how different are these time controls, and does the score flip from 900-100 to 200-800, or merely from 520-480 to 560-540. I would consider that totally different situations, and from your statements we have no clue at all.

I don't save all of that kind of stuff or else I would be swamped in data. But I do remember one example where at 1+0 (game in one minute) I really waxed one opponent, and by the time we got to 5+5 I was just beating it significantly. Perhaps something like +60 (80 game match) at 1+0, to +30 or +40 at 5+5...

If I were trying to find if A was better than X, I would probably prefer time controls close to what I would expect to see in the kind of tournament we might meet in. But here the faster the better so I can complete the tests quicker...

Yes, I have must have played 8,000 games as my standard gauntlet is 480 games (24 opponents, 20-games Nunn match), and I ran it many times, both on uMax and Joker. Longest time control I did was 40/10, though, and I play 40/2 more often (on a 2.4GHz Core 2, though). Joker is somewhat better at 2 min wrt the others than at the 5x slower time control, but my experience is also that the Joker versions that do best at 10' also do best at 2'.

And yes, 8,000 is a lot less than 8,000,000, but good enough to see such trends. And my comment was indeed directed at Hristo, whose evidence I suspect to be entirely virtual.

Well, if you change the SMP, you of course test it. I normally test with ponder off, but if I change the ponder algorithm I of course run a test with ponder on to check if I messed up. But that doesn't mean I will from now on always test with ponder on.

I still would like to know how well your results vor eval changes with SMP correlate with those on a single CPU (for eqal search depth). Will you in the end have to supply a number of totally different evaluations, one for single CPU, one for a dual, and one for quads/octals?

hgm · Post by **hgm** » Thu Sep 27, 2007 12:29 pm

hristo wrote:At any rate,
I have ran tests -- more than you and significantly less than Bob, although none of them were with the intent to determine how the "strength difference" is influenced by time -- I only now that it is.

Well, than I am not saying too much aren't I? How can you make statements on the influence of time on relative strength if you did not play games to determine this?

I guess you misunderstand the word "hardly" in my statement. Look at what Bob gives when I ask for an extreme example: He says the score drops from 87.5% to 69-75%. On an Elo scale that would mean a drop from +325 Elo to +140-200 Elo. So that is a drop of 150 Elo when the time is varied over a factor 10 (5'+5" = ~10', if you assume an average game lastst 60 moves). I would call that 'hardly a function of time', translated to a 10% time difference (assuming Elo is linear against log(Time)) this is only 6 Elo difference, i.e. unmeasurably small if you don't play at least 10,000 games.

And although this is a bit more than I expect (even for an extreme example picked out of a very large data set), you should realize that we are comparing here a +0 increment time control to a non-zero increment, which really makes it a completely different game. So you don't really measure the pure effect of increasing the time, it could very well be that this particular opponent has an algorithm to devide the time over moves that fails badly at +0 time controls, and that it would get equally whacked at 10+0. I have seen that with my engines: some opponents just used a little bit less time per move, but in games that dragged out to an ending and you really had to play fast, the saved time had accumulated to such an extent that it was large compared to the remaining time. And with a factor 3 more time, he would win every game through a tactical shot.

Anyway, this is for very different engines, and even in a selected extreme example the dependence is not more than 6 Elo for a 10% change. From Bob's post you can see that for versions A and A' he sees no effect on their strength difference while varying the time over a factor 120 (60'+60" = ~120'+0)!

This totally confirms my ioriginal statements: the relative strength difference between two versions of the same engines will not change measurably if you change the time at which you test by 10%. Even if you vary it 50 times more (120 = 1.1^50) Bob is not able to measure the difference with the millions of test games he plays.

So why would your test accuracy suffer when 10% of the search time of the engines gets wasted on redoing the (N-2)-ply search that was already in the hash? Or when 15% gets wasted because once every 3 moves you continue searching 50% longer to finish the iteration while you already had a satisfactory move, and could have interrupted it?

hristo · Post by **hristo** » Thu Sep 27, 2007 3:49 pm

hgm wrote:
hristo wrote:At any rate,
I have ran tests -- more than you and significantly less than Bob, although none of them were with the intent to determine how the "strength difference" is influenced by time -- I only now that it is.
Well, than I am not saying too much aren't I? How can you make statements on the influence of time on relative strength if you did not play games to determine this?

I observed it as a side effect (repeatedly) but it never was important to me.

hgm wrote: I guess you misunderstand the word "hardly" in my statement. ...

hehe ... this is also possible.

hgm wrote: This totally confirms my ioriginal statements: the relative strength difference between two versions of the same engines will not change measurably if you change the time at which you test by 10%.

This is a measurably different statement when compared to the one I initially responded to.

With respect to my experience:
The observed variance in relative playing strength between the engines was at shorter (60/5min, 60/10min, 60/20min and a few at 60/60) time controls. I was also, similar to Bob, looking for a test that would validate "Is A stronger then B?" rather than a test that would determine "How much stronger is A compared to B?". I don't have the ability to run a significant number of tests at longer time controls and the largest time increment that I used is +2.

I don't have any particular reasons to doubt that such variance exists at longer time controls, even though it was obvious that the variance decreased with the increase of available time.

As a game progresses the time available to the engine is reduced, and as a result the relative strength difference between the engines changes towards the end of the game when compared to the one directly after leaving the opening book.

The 10% speed decrease that you speak off can become nearly 50% (or more) towards the end of the game and all of this contributes to the observed variability in the results -- assuming the games are not completely decided before entering the 'blitz' like state at the end of the game. Engines that are roughly equal in strength would often go all the way into the end-game and are more likely to enter into "blitz" mode.
Of course, all of this is contingent upon the time allocation algorithm and linear time allocation would help mitigate the above described situation while at the same time it might make the engine weaker.

Again, all of this is based on my limited ability to perform tests. If you demand exhaustive tests then Bob is the man. Besides, he has a lot more experience than me and can probably respond to some of these issues without having to perform 'new' tests.

bob · Post by **bob** » Thu Sep 27, 2007 9:49 pm

hgm wrote:
hristo wrote:I believe that most strong engines will show fluctuations in their relative strength to one another as a function of time given to them (some engines will perform relatively better depending on the time control).
You believe, you suspect...

Can't you be a bit more concrete?

You claim that there is a pair of engines that plays 54-46 at 40 moves/40 minutes, and that that result will change to 42-58 at 50 moves/45 min?

If so, (or something similar), which engines are this?

We are getting well out of reality. I can produce two engines that will play 54-46 at 40 moves in 40 minutes, then play 42-58 at the same time control the next time it is run...

hgm · Post by **hgm** » Thu Sep 27, 2007 10:12 pm

Of course. If you try often enough, you can get anything. So I would only be impressed if you got this on a first try, not if you were to dig it up from a million games...

The difference in scores I chose is actually ~2*sigma, meaning this would occur once every 50 times you try. So I am not surprised that you can show it from your significantly larger data set.

But I did not want to be too hard on Hristo (and at the same time probe how many games he had true data for).

Of course the fact that you can get it with the same engines, does prove the point that even if he had actually observed this, it would be no prve whatsoever that the strength difference between the engines actually did vary with time control.

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?