michiguel wrote:Don wrote:michiguel wrote:Don wrote:I think your comments are insightful. My own test book is limited to a little less than 4000 positions but I have arranged things so that I test a different subset of them each time. In order to get thousands of games I have to have more than just 2 or 3 players - but I think I'm pretty much getting the variety I want.
I think that having 2 or 3 players is not good at all, but I know that we do what we can, not what we want. I also think that positions should not be played with both colors. Both things drop the statistical independence of the set way down. That is why having a large set is important.
That's an interesting thought about both colors. I'm not sure I understand what you are saying or your reasoning on that one.
Statistically, the bigger the sample the better. That is true if the events (games) are completely independent. If each statistical event is not independent, it is like having a smaller sample. Each opening position test among other things, how well the engine behave in that type of position. If you switch colors, the "new" position is technically different but strongly correlated with the previous one. In other words, there are certain positions that certain engines do not get, either playing white or black. So, the result of one correlates with the other one with switched colors. So if you play 2000 games (1000 W and 1000 B), the real standard deviation you get it is not the one you "calculate" with N=2000. It is something between 1000 (perfect correlation=1.00 between B&W games) and 2000 (correlation = 0.00).
I think that switching colors is a big[1] mistake. We do not want "fairness", we want randomness (or pseudo randomness in our case).
I think I agree with what you say next, that in practice this may not be noticeable.
There is not very much correlation between playing the opposite sides of the same opening with a different opponent, even a self-test opponent with minor changes. I will give you a VERY informal proof.
Imagine that your book is 10 ply (like mine is) and I play one of those openings as white against a given opponent, and then on the next game I play the same opening against the same opponent, but I am on the black side of the opening. We can imagine that these games are testing the same exact ideas regardless of the color switch and that for every 1000 games I am wasting half of my testing time. That is your basic premise, although you admit that it may not be that big a deal.
It is very well known (and I don't have figures to back this up, but I think everyone would agree with this) that the most minor of changes has a pretty chaotic affect on the moves you play. For instance if I play the ruy lopez exchange variation with white, then I play it again as black with the same opponent except that the hash table size is doubled, the game will vary almost immediately, certainly within a very few ply unless the position is ridiculously forced. Some here a few days ago mentioned that if you add 1 node to the search the games will start to vary.
Now, you say it's good not to start from the same position as your opponent and I agree. But I would like to suggest that even with my 10 ply book you have not done that. Just pretend that you are using a 20 ply book like Bob Hyatt uses and that the game really started at ply 20 instead of ply 10. If you have 4000 openings, you probably have the equivalent of 8000 unique starting positions, where each player gets a different one for each opponent.
You can work this backwards too. My book is 10 ply deep, but the some of those openings have the first 8 ply in common. And many of those have the first 6 ply in common. They ALL have the same starting position in common, the opening position. So to turn this around one could claim that you really starting all the games from the same position, they just happen to vary early.
There is point where you cross over the line of "specificity", where you want your starting positions to at least resemble the general type of positions that you will see in real games. You could basically just take a chess set and dump it on the board and put all the pieces where they land - giving a kind of random setup and test from that, but it would be too artificial. It would be like spending all your time playing golf to train for a tennis match just so that you don't get into bad habits or some kind of rut.
Miguel
[1] conceptually, in practice may not be noticeable.
When I test, a given player faces each other player on both sides of each given opening eventually. But not consecutively. For instance if you and I were computers and I play the white side of the Ruy Lopez exchange variation, the very next game could be anything we didn't already play. It's not like you now have to play the white side of that exact opening in the very next games.
In practice, unless you exhaust the openings, which is hard to do if you have a large number of players, it's like you are playing a random sampling of openings for white and for black a completely independent random sampling. Of course given enough games you will face both sides of every opening against every opponent.
Miguel
I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.
- Don
Eelco de Groot wrote:Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.
"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.
You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.
Think of it as programming a semi-random number generator. It is really just the same, it is
very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.
I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever
Eelco