jwes wrote:As we have mentioned repeatedly, the variability you are quoting is too high to be due to randomness. I believe this can be a result of the events in a trial not being independent, e.g. an engine being stronger or weaker than usual for all the games in a set. Do you keep track of the NPS or total nodes analyzed for each engine ? Another idea is to put a large quantity of data into SYSTAT or SPSS and look for unexpected correlations. Ask your statistics person for ways to analyze your data for statistical anomalies.bob wrote:as I have mentioned repeatedly, you can see what causes the variability by running this test:jwes wrote:What they are saying is that the variances you are quoting are much higher than you would get if it were a stochastic process, e.g if the probabilities of program A against crafty are 40% wins, 30% draws, and 30% losses, and you wrote a program that randomly generated sequences of 100 trials with the above probabilities, you would not have nearly the differences between these sequences that you have been getting. This would strongly suggest problems with the experimental design.bob wrote:
And suppose the first 100 games ends up 80-20, and the second (which you choose to not play) ends up 20-80? Then what.
NPS is very consistent, unless something bad happens. Such as a user logging in and using a node allocated to me. But I have crafty monitor that and it logs a sudden NPS drop (no tablebases are used so there is nothing to slow the program naturally) and then terminates the match and informs the referee to abort the entire run. That happens maybe once every 2-3 months. Otherwise, there is nothing going on. There is certainly a very small random effect caused by the operating system itself, since it has to deal with network traffic, system accounting functions that happen at random times, etc. But that is all. When I run an important test, I simply lock the cluster down so that nothing is going on except for my stuff.
One idea is to set the other engines (but not crafty) to search to a fixed depth. This should reduce variability and make those engines play at a more consistent strength level.bob wrote:If you claim that is a fault of the setup, then feel free to suggest a solution. But the solution has to involve not modifying all the programs to search in a way that is different from how they normally work.
what does that accomplish? It biases the results because there is no one depth that is appropriate for the entire game. I did that and other things trying to understand the randomness, because I originally modified several programs to stop after a fixed number of nodes (far better than fixed depth). And surprise, I found the randomness went away if all players did that (including crafty). But making a change then gives significantly different results because a simple change alters the shape of the tree.
I am trying to test like I have to play. That's the only test that is useful for trying to improve my performance against other programs in real events.