I know you are talking to hristo, but I would claim I have played probably 1,000 times as many games as you have played when doing testing. I have played 8,000,000 games in the past 7 months. Have you played 8,000 during that period of time? And I don't mean game in 1 second. I mean games long enough to at least give both sides a chance to tactically understand what is going on...hgm wrote:You after talking about one search tree in one particular situation. That has nothing to do with the strength of an engine. That is determines by the average over billions of search trees of move quality vs time used.
I have measured the strength of many engines over a wide range of time controls. I have played time-odds matches between engines to systematically measure how the strength varies with thinking time.I have made many small changes to my engines, and tested them extensively.
How much have you done of all this? Are your statements based on anything at all, other than that this sounds plausible to you?
For the record, when I was testing my cluster code to make sure things looked ok and would run reliably. I played just over 1,000,000 games in one hour. I didn't draw any conclusions from the games, I just wanted to stress-test everything to be sure all matches were played the right number of times, each and every time, that nobody was losing games on time (these were crafty vs crafty games, most programs seem to break at game/1sec). My 8M doesn't count that, which I used to run as a demo on our graphics viswall which showed xboard GUIs for each of the 256 simultaneous games being played...
I too have tried various time controls, attempting to answer the question "Can I get a reliable comparison of A vs A' with very short games?" While the overall results change with time, I found that fast games are still OK when I want to know if A is worse than A', rather than "is A better or worse than each of these 'other' opponents."
None of this is new, none of it is untried. I am simply trying to test my "game-day" engine as best as is possible, which means not stripping out any more than absolutely necessary (which includes the opening book first, and then pondering/SMP next and I still test with those both enable a lot of the time to make sure all is still well after recent changes.