An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

jwes wrote:
bob wrote:
jwes wrote:
bob wrote:
And suppose the first 100 games ends up 80-20, and the second (which you choose to not play) ends up 20-80? Then what.
What they are saying is that the variances you are quoting are much higher than you would get if it were a stochastic process, e.g if the probabilities of program A against crafty are 40% wins, 30% draws, and 30% losses, and you wrote a program that randomly generated sequences of 100 trials with the above probabilities, you would not have nearly the differences between these sequences that you have been getting. This would strongly suggest problems with the experimental design.
as I have mentioned repeatedly, you can see what causes the variability by running this test:
As we have mentioned repeatedly, the variability you are quoting is too high to be due to randomness. I believe this can be a result of the events in a trial not being independent, e.g. an engine being stronger or weaker than usual for all the games in a set. Do you keep track of the NPS or total nodes analyzed for each engine ? Another idea is to put a large quantity of data into SYSTAT or SPSS and look for unexpected correlations. Ask your statistics person for ways to analyze your data for statistical anomalies.

NPS is very consistent, unless something bad happens. Such as a user logging in and using a node allocated to me. But I have crafty monitor that and it logs a sudden NPS drop (no tablebases are used so there is nothing to slow the program naturally) and then terminates the match and informs the referee to abort the entire run. That happens maybe once every 2-3 months. Otherwise, there is nothing going on. There is certainly a very small random effect caused by the operating system itself, since it has to deal with network traffic, system accounting functions that happen at random times, etc. But that is all. When I run an important test, I simply lock the cluster down so that nothing is going on except for my stuff.
bob wrote:If you claim that is a fault of the setup, then feel free to suggest a solution. But the solution has to involve not modifying all the programs to search in a way that is different from how they normally work.
One idea is to set the other engines (but not crafty) to search to a fixed depth. This should reduce variability and make those engines play at a more consistent strength level.


what does that accomplish? It biases the results because there is no one depth that is appropriate for the entire game. I did that and other things trying to understand the randomness, because I originally modified several programs to stop after a fixed number of nodes (far better than fixed depth). And surprise, I found the randomness went away if all players did that (including crafty). But making a change then gives significantly different results because a simple change alters the shape of the tree.

I am trying to test like I have to play. That's the only test that is useful for trying to improve my performance against other programs in real events.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:
"most". I'm more pessimistic having done this for so long. I can't count the number of advantage players that believe they are being cheated. I have seen people post that Norm Wattenberger's CVBJ program is "biased" (some say it gives you better cards than expected, some say worse). All because they assume those "extremely rare events" will never happen. I've lived thru the incredible negative variance streaks too many times to count, as well as the incredible win streaks that also come along.
Well, I just had the first royal flush the other day, and I haven't played that many hands yet. Just shrugged it off...

Perhaps if I had been able to pay someone off with it I would have been more excited...
I think that is once every 40,000 hands. That is what I remember from AP friends that beat up on video poker machines.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote:
bob wrote:
"most". I'm more pessimistic having done this for so long. I can't count the number of advantage players that believe they are being cheated. I have seen people post that Norm Wattenberger's CVBJ program is "biased" (some say it gives you better cards than expected, some say worse). All because they assume those "extremely rare events" will never happen. I've lived thru the incredible negative variance streaks too many times to count, as well as the incredible win streaks that also come along.
Well, I just had the first royal flush the other day, and I haven't played that many hands yet. Just shrugged it off...

Perhaps if I had been able to pay someone off with it I would have been more excited...
I think that is once every 40,000 hands. That is what I remember from AP friends that beat up on video poker machines.
well, this was texas holdem (and there you don't always get to see the River), so the numbers will be different from video poker. Could be higher, could be lower. Perhaps I should really be complaining that I have got fewer than my share, and accuse someone of cheating :-)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote: My point is this. In this context, correlated means that the games are somehow connected, that the result of one is somehow related to the result of another. That is impossible.
No it doesn't mean that.
Let's come back to reality. Define "correlation". To say that two results are correlated means you can use one to predict the other. There are tons of data values that appear to be correlated, but are not except in the pure mathematical sense.

For example, I challenge you to take any game of an 80 game match and predict the outcome of any other single game in a significant way. And then you are going to bet a lot of money on that since you are sure about what "correlation" means.

These games are as independent as two trials can be. Between two equal opponents, using one result to predict another would be dangerous and you'd be flipping a coin.

If you want to say some positions are correlated because of the skill of one of the players in certain types of positions, who cares? I can't control which opening I play when my opponent gets to choose moves on his turn.

I only care about effects that I can control and use...

When I try several different sets of test positions and all say "these two programs are pretty equal" then I trust that result. And when one fairly broad test set gives the same information, I trust it as well after having gone through it with a careful eye...

It would only mean that the results are correlated. I already gave an example besides my extreme illustration:

If one starting position comes from the Sicilian Dragon, and the other occurs in the King's Indian, it is not completely out of the question that Engines that will score well in one position will also do so in the other, and for engines it will be the opposite (I mean they will do relatively better in other positions).

(Feel free to include better examples, please don't latch onto this particular example.)

So if you analyse statistically those positions (and not by qualitative reasoning like the one I used to come up with the example), it is possible that some within the test suite will show such a correlation. I am not saying that it will be the case for sure, just that it is possible.

And if such a correlation can be discovered, one of the positions can be eliminated from the suite, because it does not add significant information in relation to the effort it takes to include it in the test. Perhaps 40 positions is the sweet spot for you, but if it were 4000 positions that had been selected using just the same care but no analysis of the kind I am describing, surely you would have to find a way to reduce the number of positions to make them more manageable? Or is 40 some kind of number passed down from heaven that is guaranteed to be the exact number you need?
I don't buy that at all. I lose two different positions regularly because I don't have a key piece of evaluation. So I cull one of them. Had I kept it, I would have discovered that later, an eval change now changes the result of one of them, because they are not the same identical position. For a primitive program, you might cull 90% of the positions using that kind of analysis.

Statistics is not the end-all when choosing how to test a program. Experience helps a _whole_ lot.



Put another way, if you had to start just with the starting position, or, without loss of generality, any given position from the set of 40:
Then you would argue that you would need more positions to get more variety, for example, one that is "sharp" in nature and one that is "quiet" in nature. And then you would break it down further. And at some level you would find that the next position will not have positive marginal utility, so to speak.
Except for the math. The tree is huge. One doesn't have to travel down the tree too deeply to reach billions of different positions. And choosing a representative sample of those eliminates the need to test with full opening books and dealing with that level of non-determinism./
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: But every program I have personally tested and use in my tests certainly does do this. So for _most_ of us, the number of games needed to predict progress with high confidence is quite large. And all the arguing in the world is not going to change that.
Let me try to follow your logic here:

"every program I have tested does this, therefore for most of us..."

For me there is a huge jump in reasoning that I don't understand.

It would mean that you are somehow equivalent to most of us. just looking at the members active in this discussion would show you that we are most of us, not you.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote:
bob wrote: My point is this. In this context, correlated means that the games are somehow connected, that the result of one is somehow related to the result of another. That is impossible.
No it doesn't mean that.
Let's come back to reality. Define "correlation". To say that two results are correlated means you can use one to predict the other. There are tons of data values that appear to be correlated, but are not except in the pure mathematical sense.

For example, I challenge you to take any game of an 80 game match and predict the outcome of any other single game in a significant way. And then you are going to bet a lot of money on that since you are sure about what "correlation" means.

These games are as independent as two trials can be. Between two equal opponents, using one result to predict another would be dangerous and you'd be flipping a coin.

If you want to say some positions are correlated because of the skill of one of the players in certain types of positions, who cares? I can't control which opening I play when my opponent gets to choose moves on his turn.

I only care about effects that I can control and use...

When I try several different sets of test positions and all say "these two programs are pretty equal" then I trust that result. And when one fairly broad test set gives the same information, I trust it as well after having gone through it with a careful eye...

It would only mean that the results are correlated. I already gave an example besides my extreme illustration:

If one starting position comes from the Sicilian Dragon, and the other occurs in the King's Indian, it is not completely out of the question that Engines that will score well in one position will also do so in the other, and for engines it will be the opposite (I mean they will do relatively better in other positions).

(Feel free to include better examples, please don't latch onto this particular example.)

So if you analyse statistically those positions (and not by qualitative reasoning like the one I used to come up with the example), it is possible that some within the test suite will show such a correlation. I am not saying that it will be the case for sure, just that it is possible.

And if such a correlation can be discovered, one of the positions can be eliminated from the suite, because it does not add significant information in relation to the effort it takes to include it in the test. Perhaps 40 positions is the sweet spot for you, but if it were 4000 positions that had been selected using just the same care but no analysis of the kind I am describing, surely you would have to find a way to reduce the number of positions to make them more manageable? Or is 40 some kind of number passed down from heaven that is guaranteed to be the exact number you need?
I don't buy that at all. I lose two different positions regularly because I don't have a key piece of evaluation. So I cull one of them. Had I kept it, I would have discovered that later, an eval change now changes the result of one of them, because they are not the same identical position. For a primitive program, you might cull 90% of the positions using that kind of analysis.

Statistics is not the end-all when choosing how to test a program. Experience helps a _whole_ lot.



Put another way, if you had to start just with the starting position, or, without loss of generality, any given position from the set of 40:
Then you would argue that you would need more positions to get more variety, for example, one that is "sharp" in nature and one that is "quiet" in nature. And then you would break it down further. And at some level you would find that the next position will not have positive marginal utility, so to speak.
Except for the math. The tree is huge. One doesn't have to travel down the tree too deeply to reach billions of different positions. And choosing a representative sample of those eliminates the need to test with full opening books and dealing with that level of non-determinism./

sigh. I find it really impossible to get my point across. I don't know how to say it any differently. I would like others to try and describe my idea and see if they understand it. If they don't I guess the fault is with me.

You are not getting at all what I'm trying to say. It could be me or it could be you.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:
nczempin wrote:
bob wrote:
"most". I'm more pessimistic having done this for so long. I can't count the number of advantage players that believe they are being cheated. I have seen people post that Norm Wattenberger's CVBJ program is "biased" (some say it gives you better cards than expected, some say worse). All because they assume those "extremely rare events" will never happen. I've lived thru the incredible negative variance streaks too many times to count, as well as the incredible win streaks that also come along.
Well, I just had the first royal flush the other day, and I haven't played that many hands yet. Just shrugged it off...

Perhaps if I had been able to pay someone off with it I would have been more excited...
I think that is once every 40,000 hands. That is what I remember from AP friends that beat up on video poker machines.
well, this was texas holdem (and there you don't always get to see the River), so the numbers will be different from video poker. Could be higher, could be lower. Perhaps I should really be complaining that I have got fewer than my share, and accuse someone of cheating :-)
No. probability is probability here. A royal flush requirea 10-J-Q-K-A of the same suit. If you get one, you will see your 7 cards (I assume you don't fold if you have one after the flop, for example, and after you have one you are at least going to play all-in rather than folding).

royal flush odds are for getting 5 cards, and with video poker you can throw out up to 4 cards and replace them increasing your odds. With holdem it is lower since you only have 7 cards total to choose 5 (including your original two you must use of course).

They are rare. But it is not a once in a lifetime event if you play enough.

A good video poker player can play a round every 2-3 seconds. 20-30 per minute. 1200-1800 per hour. Doesn't take them that long to hit a royal, and in fact that is one of the hands that makes VP a positive expectation game if you find machines with the right payout.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:Statistics is not the end-all when choosing how to test a program. Experience helps a _whole_ lot.
Okay, finally we have come to the conclusion that your experience is worth more than Statistics, so whatever you say will always be right. I am just wasting my time, and probably yours. I give up.

I will try and discuss my ideas in a forum where people understand me, and that seem to have no problem in using standard Statistics methodology to help me solve my problem.

Since you don't have a problem to solve, everybody can be happy.

Your point about people that should be more careful in their statements regarding engines is taken (not that I have ever disputed it), and I will continue to point out such mistakes when I see them, and when they are suppported by standard methodology.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote: But every program I have personally tested and use in my tests certainly does do this. So for _most_ of us, the number of games needed to predict progress with high confidence is quite large. And all the arguing in the world is not going to change that.
Let me try to follow your logic here:

"every program I have tested does this, therefore for most of us..."

For me there is a huge jump in reasoning that I don't understand.

It would mean that you are somehow equivalent to most of us. just looking at the members active in this discussion would show you that we are most of us, not you.
Why? I know and have conversed with many program authors over the years. We _all_ time our searches in the same way, with different tweaks. But nearly every one I have _seen_ did not complete iterations and stop. You could go back thru years of r.g.c.c posts to see the discussions there. And eventually you realize that most do it as I do it, not as you/hgm do it. Most have read the various journal articles on using time, and understand why the ideas work. So concluding that your approach is the center of the universe is wrong. How many different programmers have you discussed timing issues with? I count hundreds in my past... so there is some weight to my statement, and I am not making an unsupported assumption..
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: No. probability is probability here. A royal flush requirea 10-J-Q-K-A of the same suit. If you get one, you will see your 7 cards (I assume you don't fold if you have one after the flop, for example, and after you have one you are at least going to play all-in rather than folding).
Why do you assume away the only thing I am talking about? Could you please try to also credit me with not just talking garbage all the time?

If I have 10-J I could easily fold before the flop. even if then my ace plus rainbow flops, I have many reasons to fold, and if then I get the K and Q after I folded, it would still have been the correct decision against e. g. AA.

Note that I also didn't mention whether it was limit or no-limit.