different kinds of testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: different kinds of testing

Post by michiguel »

Don wrote:I think your comments are insightful. My own test book is limited to a little less than 4000 positions but I have arranged things so that I test a different subset of them each time. In order to get thousands of games I have to have more than just 2 or 3 players - but I think I'm pretty much getting the variety I want.
I think that having 2 or 3 players is not good at all, but I know that we do what we can, not what we want. I also think that positions should not be played with both colors. Both things drop the statistical independence of the set way down. That is why having a large set is important.

Miguel

I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.

- Don


Eelco de Groot wrote:
Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

michiguel wrote:
Don wrote:I think your comments are insightful. My own test book is limited to a little less than 4000 positions but I have arranged things so that I test a different subset of them each time. In order to get thousands of games I have to have more than just 2 or 3 players - but I think I'm pretty much getting the variety I want.
I think that having 2 or 3 players is not good at all, but I know that we do what we can, not what we want. I also think that positions should not be played with both colors. Both things drop the statistical independence of the set way down. That is why having a large set is important.
That's an interesting thought about both colors. I'm not sure I understand what you are saying or your reasoning on that one.

When I test, a given player faces each other player on both sides of each given opening eventually. But not consecutively. For instance if you and I were computers and I play the white side of the Ruy Lopez exchange variation, the very next game could be anything we didn't already play. It's not like you now have to play the white side of that exact opening in the very next games.

In practice, unless you exhaust the openings, which is hard to do if you have a large number of players, it's like you are playing a random sampling of openings for white and for black a completely independent random sampling. Of course given enough games you will face both sides of every opening against every opponent.

Miguel

I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.

- Don


Eelco de Groot wrote:
Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: different kinds of testing

Post by michiguel »

Don wrote:
michiguel wrote:
Don wrote:I think your comments are insightful. My own test book is limited to a little less than 4000 positions but I have arranged things so that I test a different subset of them each time. In order to get thousands of games I have to have more than just 2 or 3 players - but I think I'm pretty much getting the variety I want.
I think that having 2 or 3 players is not good at all, but I know that we do what we can, not what we want. I also think that positions should not be played with both colors. Both things drop the statistical independence of the set way down. That is why having a large set is important.
That's an interesting thought about both colors. I'm not sure I understand what you are saying or your reasoning on that one.
Statistically, the bigger the sample the better. That is true if the events (games) are completely independent. If each statistical event is not independent, it is like having a smaller sample. Each opening position test among other things, how well the engine behave in that type of position. If you switch colors, the "new" position is technically different but strongly correlated with the previous one. In other words, there are certain positions that certain engines do not get, either playing white or black. So, the result of one correlates with the other one with switched colors. So if you play 2000 games (1000 W and 1000 B), the real standard deviation you get it is not the one you "calculate" with N=2000. It is something between 1000 (perfect correlation=1.00 between B&W games) and 2000 (correlation = 0.00).

I think that switching colors is a big[1] mistake. We do not want "fairness", we want randomness (or pseudo randomness in our case).

Miguel
[1] conceptually, in practice may not be noticeable.

When I test, a given player faces each other player on both sides of each given opening eventually. But not consecutively. For instance if you and I were computers and I play the white side of the Ruy Lopez exchange variation, the very next game could be anything we didn't already play. It's not like you now have to play the white side of that exact opening in the very next games.

In practice, unless you exhaust the openings, which is hard to do if you have a large number of players, it's like you are playing a random sampling of openings for white and for black a completely independent random sampling. Of course given enough games you will face both sides of every opening against every opponent.

Miguel

I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.

- Don


Eelco de Groot wrote:
Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

michiguel wrote:
Don wrote:
michiguel wrote:
Don wrote:I think your comments are insightful. My own test book is limited to a little less than 4000 positions but I have arranged things so that I test a different subset of them each time. In order to get thousands of games I have to have more than just 2 or 3 players - but I think I'm pretty much getting the variety I want.
I think that having 2 or 3 players is not good at all, but I know that we do what we can, not what we want. I also think that positions should not be played with both colors. Both things drop the statistical independence of the set way down. That is why having a large set is important.
That's an interesting thought about both colors. I'm not sure I understand what you are saying or your reasoning on that one.
Statistically, the bigger the sample the better. That is true if the events (games) are completely independent. If each statistical event is not independent, it is like having a smaller sample. Each opening position test among other things, how well the engine behave in that type of position. If you switch colors, the "new" position is technically different but strongly correlated with the previous one. In other words, there are certain positions that certain engines do not get, either playing white or black. So, the result of one correlates with the other one with switched colors. So if you play 2000 games (1000 W and 1000 B), the real standard deviation you get it is not the one you "calculate" with N=2000. It is something between 1000 (perfect correlation=1.00 between B&W games) and 2000 (correlation = 0.00).

I think that switching colors is a big[1] mistake. We do not want "fairness", we want randomness (or pseudo randomness in our case).
I think I agree with what you say next, that in practice this may not be noticeable.

There is not very much correlation between playing the opposite sides of the same opening with a different opponent, even a self-test opponent with minor changes. I will give you a VERY informal proof.

Imagine that your book is 10 ply (like mine is) and I play one of those openings as white against a given opponent, and then on the next game I play the same opening against the same opponent, but I am on the black side of the opening. We can imagine that these games are testing the same exact ideas regardless of the color switch and that for every 1000 games I am wasting half of my testing time. That is your basic premise, although you admit that it may not be that big a deal.

It is very well known (and I don't have figures to back this up, but I think everyone would agree with this) that the most minor of changes has a pretty chaotic affect on the moves you play. For instance if I play the ruy lopez exchange variation with white, then I play it again as black with the same opponent except that the hash table size is doubled, the game will vary almost immediately, certainly within a very few ply unless the position is ridiculously forced. Some here a few days ago mentioned that if you add 1 node to the search the games will start to vary.

Now, you say it's good not to start from the same position as your opponent and I agree. But I would like to suggest that even with my 10 ply book you have not done that. Just pretend that you are using a 20 ply book like Bob Hyatt uses and that the game really started at ply 20 instead of ply 10. If you have 4000 openings, you probably have the equivalent of 8000 unique starting positions, where each player gets a different one for each opponent.

You can work this backwards too. My book is 10 ply deep, but the some of those openings have the first 8 ply in common. And many of those have the first 6 ply in common. They ALL have the same starting position in common, the opening position. So to turn this around one could claim that you really starting all the games from the same position, they just happen to vary early.

There is point where you cross over the line of "specificity", where you want your starting positions to at least resemble the general type of positions that you will see in real games. You could basically just take a chess set and dump it on the board and put all the pieces where they land - giving a kind of random setup and test from that, but it would be too artificial. It would be like spending all your time playing golf to train for a tennis match just so that you don't get into bad habits or some kind of rut.

Miguel
[1] conceptually, in practice may not be noticeable.

When I test, a given player faces each other player on both sides of each given opening eventually. But not consecutively. For instance if you and I were computers and I play the white side of the Ruy Lopez exchange variation, the very next game could be anything we didn't already play. It's not like you now have to play the white side of that exact opening in the very next games.

In practice, unless you exhaust the openings, which is hard to do if you have a large number of players, it's like you are playing a random sampling of openings for white and for black a completely independent random sampling. Of course given enough games you will face both sides of every opening against every opponent.

Miguel

I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.

- Don


Eelco de Groot wrote:
Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: different kinds of testing

Post by michiguel »

Don wrote:
michiguel wrote:
Don wrote:
michiguel wrote:
Don wrote:I think your comments are insightful. My own test book is limited to a little less than 4000 positions but I have arranged things so that I test a different subset of them each time. In order to get thousands of games I have to have more than just 2 or 3 players - but I think I'm pretty much getting the variety I want.
I think that having 2 or 3 players is not good at all, but I know that we do what we can, not what we want. I also think that positions should not be played with both colors. Both things drop the statistical independence of the set way down. That is why having a large set is important.
That's an interesting thought about both colors. I'm not sure I understand what you are saying or your reasoning on that one.
Statistically, the bigger the sample the better. That is true if the events (games) are completely independent. If each statistical event is not independent, it is like having a smaller sample. Each opening position test among other things, how well the engine behave in that type of position. If you switch colors, the "new" position is technically different but strongly correlated with the previous one. In other words, there are certain positions that certain engines do not get, either playing white or black. So, the result of one correlates with the other one with switched colors. So if you play 2000 games (1000 W and 1000 B), the real standard deviation you get it is not the one you "calculate" with N=2000. It is something between 1000 (perfect correlation=1.00 between B&W games) and 2000 (correlation = 0.00).

I think that switching colors is a big[1] mistake. We do not want "fairness", we want randomness (or pseudo randomness in our case).
I think I agree with what you say next, that in practice this may not be noticeable.

There is not very much correlation between playing the opposite sides of the same opening with a different opponent, even a self-test opponent with minor changes. I will give you a VERY informal proof.

Imagine that your book is 10 ply (like mine is) and I play one of those openings as white against a given opponent, and then on the next game I play the same opening against the same opponent, but I am on the black side of the opening. We can imagine that these games are testing the same exact ideas regardless of the color switch and that for every 1000 games I am wasting half of my testing time. That is your basic premise, although you admit that it may not be that big a deal.

It is very well known (and I don't have figures to back this up, but I think everyone would agree with this) that the most minor of changes has a pretty chaotic affect on the moves you play.
Yes, I agree. The game will vary and you may lose both times in a different way if you do not understand the position :-). Seriously, you are right for most of the positions, where the chaos introduced in bigger than consistent problems. BUT, there are positions in which the chaos introduced is not enough to overcome another factors. Those positions maybe few, but it does not mean they do not exist. For instance, I had a position (Sicilian Sveshnikov) that my engine insisted to sacrifice a bishop in b5. It did not understand it and it go hammered with both, white and black. Rarely got a draw against engines of similar strength until I tune a parameter in eval. The score improved in both, white and black games. so, the results of both games were strongly correlated and statistically, they worth only one game. The question is, how many of this you have in a set? Hard to know. I am not saying that switching colors waste half of the time. I am saying that it wastes a fraction of that time. It is possible that the fraction is too small. So, it you play 2x500 games, I am saying that the effective number is 500 < n < 1000, possibly close to 1000.

OTOH, it is good to pay attention so pick potential problems. If the results are correlated, there must be a parameter that needs to be fixed. But you need many players to make this significant. i.e. you lose with both white and black consistently against many players. If you play against 2 or 3, you cannot pick this up.

Miguel
For instance if I play the ruy lopez exchange variation with white, then I play it again as black with the same opponent except that the hash table size is doubled, the game will vary almost immediately, certainly within a very few ply unless the position is ridiculously forced. Some here a few days ago mentioned that if you add 1 node to the search the games will start to vary.

Now, you say it's good not to start from the same position as your opponent and I agree. But I would like to suggest that even with my 10 ply book you have not done that. Just pretend that you are using a 20 ply book like Bob Hyatt uses and that the game really started at ply 20 instead of ply 10. If you have 4000 openings, you probably have the equivalent of 8000 unique starting positions, where each player gets a different one for each opponent.

You can work this backwards too. My book is 10 ply deep, but the some of those openings have the first 8 ply in common. And many of those have the first 6 ply in common. They ALL have the same starting position in common, the opening position. So to turn this around one could claim that you really starting all the games from the same position, they just happen to vary early.

There is point where you cross over the line of "specificity", where you want your starting positions to at least resemble the general type of positions that you will see in real games. You could basically just take a chess set and dump it on the board and put all the pieces where they land - giving a kind of random setup and test from that, but it would be too artificial. It would be like spending all your time playing golf to train for a tennis match just so that you don't get into bad habits or some kind of rut.

Miguel
[1] conceptually, in practice may not be noticeable.

When I test, a given player faces each other player on both sides of each given opening eventually. But not consecutively. For instance if you and I were computers and I play the white side of the Ruy Lopez exchange variation, the very next game could be anything we didn't already play. It's not like you now have to play the white side of that exact opening in the very next games.

In practice, unless you exhaust the openings, which is hard to do if you have a large number of players, it's like you are playing a random sampling of openings for white and for black a completely independent random sampling. Of course given enough games you will face both sides of every opening against every opponent.

Miguel

I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.

- Don


Eelco de Groot wrote:
Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

michiguel wrote: Yes, I agree. The game will vary and you may lose both times in a different way if you do not understand the position :-). Seriously, you are right for most of the positions, where the chaos introduced in bigger than consistent problems. BUT, there are positions in which the chaos introduced is not enough to overcome another factors. Those positions maybe few, but it does not mean they do not exist. For instance, I had a position (Sicilian Sveshnikov) that my engine insisted to sacrifice a bishop in b5. It did not understand it and it go hammered with both, white and black. Rarely got a draw against engines of similar strength until I tune a parameter in eval. The score improved in both, white and black games. so, the results of both games were strongly correlated and statistically, they worth only one game.
That's a good example and it well illustrates your point.

I think we can say with agreement that not all positions have the same statistical relevant due to this phenomenon.

This give me an idea. Why not keep statistics on which specific starting positions are having results that are too consistent? For instance white always wins, or it's always a draw. Then look at the games and see if there is something to be improved in the program (or whether the book ends in a won position for someone or a perpetual.)


The question is, how many of this you have in a set? Hard to know. I am not saying that switching colors waste half of the time. I am saying that it wastes a fraction of that time. It is possible that the fraction is too small. So, it you play 2x500 games, I am saying that the effective number is 500 < n < 1000, possibly close to 1000.

OTOH, it is good to pay attention so pick potential problems. If the results are correlated, there must be a parameter that needs to be fixed. But you need many players to make this significant. i.e. you lose with both white and black consistently against many players. If you play against 2 or 3, you cannot pick this up.

Miguel
For instance if I play the ruy lopez exchange variation with white, then I play it again as black with the same opponent except that the hash table size is doubled, the game will vary almost immediately, certainly within a very few ply unless the position is ridiculously forced. Some here a few days ago mentioned that if you add 1 node to the search the games will start to vary.

Now, you say it's good not to start from the same position as your opponent and I agree. But I would like to suggest that even with my 10 ply book you have not done that. Just pretend that you are using a 20 ply book like Bob Hyatt uses and that the game really started at ply 20 instead of ply 10. If you have 4000 openings, you probably have the equivalent of 8000 unique starting positions, where each player gets a different one for each opponent.

You can work this backwards too. My book is 10 ply deep, but the some of those openings have the first 8 ply in common. And many of those have the first 6 ply in common. They ALL have the same starting position in common, the opening position. So to turn this around one could claim that you really starting all the games from the same position, they just happen to vary early.

There is point where you cross over the line of "specificity", where you want your starting positions to at least resemble the general type of positions that you will see in real games. You could basically just take a chess set and dump it on the board and put all the pieces where they land - giving a kind of random setup and test from that, but it would be too artificial. It would be like spending all your time playing golf to train for a tennis match just so that you don't get into bad habits or some kind of rut.

Miguel
[1] conceptually, in practice may not be noticeable.

When I test, a given player faces each other player on both sides of each given opening eventually. But not consecutively. For instance if you and I were computers and I play the white side of the Ruy Lopez exchange variation, the very next game could be anything we didn't already play. It's not like you now have to play the white side of that exact opening in the very next games.

In practice, unless you exhaust the openings, which is hard to do if you have a large number of players, it's like you are playing a random sampling of openings for white and for black a completely independent random sampling. Of course given enough games you will face both sides of every opening against every opponent.

Miguel

I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.

- Don


Eelco de Groot wrote:
Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: different kinds of testing

Post by michiguel »

Don wrote:
michiguel wrote: Yes, I agree. The game will vary and you may lose both times in a different way if you do not understand the position :-). Seriously, you are right for most of the positions, where the chaos introduced in bigger than consistent problems. BUT, there are positions in which the chaos introduced is not enough to overcome another factors. Those positions maybe few, but it does not mean they do not exist. For instance, I had a position (Sicilian Sveshnikov) that my engine insisted to sacrifice a bishop in b5. It did not understand it and it go hammered with both, white and black. Rarely got a draw against engines of similar strength until I tune a parameter in eval. The score improved in both, white and black games. so, the results of both games were strongly correlated and statistically, they worth only one game.
That's a good example and it well illustrates your point.

I think we can say with agreement that not all positions have the same statistical relevant due to this phenomenon.

This give me an idea. Why not keep statistics on which specific starting positions are having results that are too consistent? For instance white always wins, or it's always a draw. Then look at the games and see if there is something to be improved in the program (or whether the book ends in a won position for someone or a perpetual.)
Exactly my point, but you need many other sparring engines to make it significant without "manual inspection".

Miguel



The question is, how many of this you have in a set? Hard to know. I am not saying that switching colors waste half of the time. I am saying that it wastes a fraction of that time. It is possible that the fraction is too small. So, it you play 2x500 games, I am saying that the effective number is 500 < n < 1000, possibly close to 1000.

OTOH, it is good to pay attention so pick potential problems. If the results are correlated, there must be a parameter that needs to be fixed. But you need many players to make this significant. i.e. you lose with both white and black consistently against many players. If you play against 2 or 3, you cannot pick this up.

Miguel
For instance if I play the ruy lopez exchange variation with white, then I play it again as black with the same opponent except that the hash table size is doubled, the game will vary almost immediately, certainly within a very few ply unless the position is ridiculously forced. Some here a few days ago mentioned that if you add 1 node to the search the games will start to vary.

Now, you say it's good not to start from the same position as your opponent and I agree. But I would like to suggest that even with my 10 ply book you have not done that. Just pretend that you are using a 20 ply book like Bob Hyatt uses and that the game really started at ply 20 instead of ply 10. If you have 4000 openings, you probably have the equivalent of 8000 unique starting positions, where each player gets a different one for each opponent.

You can work this backwards too. My book is 10 ply deep, but the some of those openings have the first 8 ply in common. And many of those have the first 6 ply in common. They ALL have the same starting position in common, the opening position. So to turn this around one could claim that you really starting all the games from the same position, they just happen to vary early.

There is point where you cross over the line of "specificity", where you want your starting positions to at least resemble the general type of positions that you will see in real games. You could basically just take a chess set and dump it on the board and put all the pieces where they land - giving a kind of random setup and test from that, but it would be too artificial. It would be like spending all your time playing golf to train for a tennis match just so that you don't get into bad habits or some kind of rut.

Miguel
[1] conceptually, in practice may not be noticeable.

When I test, a given player faces each other player on both sides of each given opening eventually. But not consecutively. For instance if you and I were computers and I play the white side of the Ruy Lopez exchange variation, the very next game could be anything we didn't already play. It's not like you now have to play the white side of that exact opening in the very next games.

In practice, unless you exhaust the openings, which is hard to do if you have a large number of players, it's like you are playing a random sampling of openings for white and for black a completely independent random sampling. Of course given enough games you will face both sides of every opening against every opponent.

Miguel

I have considered once the possibility of throwing in a few random positions, generated with random (but legal) moves from the opening position. It sound insane, and it probably is. The idea is that it might make the evaluation function more robust. Or you could start with fischer random openings and some descendant positions from each. In principle, a good evaluation function should play well no matter what is thrown at it. In practice, I don't think it works that way. Like it or not I'll bet every chess program is tuned to play "normal" positions well. And it would probably be difficult to have a strong chess program if you didn't do that.

- Don


Eelco de Groot wrote:
Are positions like that intended to be in the test set?
It seems that the majority of the positions are pretty nearly balanced.
I actually think there is totally no need to look for balanced positions only. It conflicts with the need for having a random set, without, as much possible, any bias to a certain type of positions, unless you want to train a program to use its strong points better.

"Training" is not the same as testing I presume and in training you should not ignore the weak points either, so I assume you want no bias.

You just have to make sure you introduce no bias and the best way I can think of doing that is not having a constant set, but periodically pick a new set.

Think of it as programming a semi-random number generator. It is really just the same, it is very easy to introduce bias and, if you are going to test this way, you have to do your utmost as a tester to avoid it.

I think testers do the same in that they do not always test with the same book. The testset has to be a reflection of the types of positions the program will encounter when actually playing tournaments. Unbalanced positions are okay if the program in practice will also encounter unbalanced positions, which you may hope it does or it will only produce draws in every competition ever :)

Eelco
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: different kinds of testing

Post by jwes »

michiguel wrote:
Don wrote:
michiguel wrote: Yes, I agree. The game will vary and you may lose both times in a different way if you do not understand the position :-). Seriously, you are right for most of the positions, where the chaos introduced in bigger than consistent problems. BUT, there are positions in which the chaos introduced is not enough to overcome another factors. Those positions maybe few, but it does not mean they do not exist. For instance, I had a position (Sicilian Sveshnikov) that my engine insisted to sacrifice a bishop in b5. It did not understand it and it go hammered with both, white and black. Rarely got a draw against engines of similar strength until I tune a parameter in eval. The score improved in both, white and black games. so, the results of both games were strongly correlated and statistically, they worth only one game.
That's a good example and it well illustrates your point.

I think we can say with agreement that not all positions have the same statistical relevant due to this phenomenon.

This give me an idea. Why not keep statistics on which specific starting positions are having results that are too consistent? For instance white always wins, or it's always a draw. Then look at the games and see if there is something to be improved in the program (or whether the book ends in a won position for someone or a perpetual.)
Exactly my point, but you need many other sparring engines to make it significant without "manual inspection".

Miguel
If a position is always a win or always a draw, it dosen't add any information, so the position should be discarded from the test. I think a more interesting statistic would be if a particular engine does relatively worse from both sides on a given position. This would indicate that that program has a hole in its evaluation in that position or succeeding positions.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: different kinds of testing

Post by michiguel »

jwes wrote:
michiguel wrote:
Don wrote:
michiguel wrote: Yes, I agree. The game will vary and you may lose both times in a different way if you do not understand the position :-). Seriously, you are right for most of the positions, where the chaos introduced in bigger than consistent problems. BUT, there are positions in which the chaos introduced is not enough to overcome another factors. Those positions maybe few, but it does not mean they do not exist. For instance, I had a position (Sicilian Sveshnikov) that my engine insisted to sacrifice a bishop in b5. It did not understand it and it go hammered with both, white and black. Rarely got a draw against engines of similar strength until I tune a parameter in eval. The score improved in both, white and black games. so, the results of both games were strongly correlated and statistically, they worth only one game.
That's a good example and it well illustrates your point.

I think we can say with agreement that not all positions have the same statistical relevant due to this phenomenon.

This give me an idea. Why not keep statistics on which specific starting positions are having results that are too consistent? For instance white always wins, or it's always a draw. Then look at the games and see if there is something to be improved in the program (or whether the book ends in a won position for someone or a perpetual.)
Exactly my point, but you need many other sparring engines to make it significant without "manual inspection".

Miguel
If a position is always a win or always a draw, it dosen't add any information, so the position should be discarded from the test. I think a more interesting statistic would be if a particular engine does relatively worse from both sides on a given position. This would indicate that that program has a hole in its evaluation in that position or succeeding positions.
Yes, that is exactly what I saw with one position in my test set, which I was lucky that was the first one. Everytime I started the gauntlets, I observed the first handful of quick games and I could not help but notice that Gaviota always lost the first two games (B and W from the same position). I am sure there must be more like this.

Miguel
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: different kinds of testing

Post by bob »

michiguel wrote:
bob wrote:
jesper_nielsen wrote:
bob wrote:
jesper_nielsen wrote:Speaking as a very resource limited (un-)happy amateur, the question of testing for me becomes "how do I get the most bang for my bucks?"

I have recently started using the "cutechess-cli" tool for running very fast games. Great tool by the way! :)

I am currently running 40 noomen positions tests against 8 different opponents at 5+0.4 timecontrol, giving 40*2*8 = 640 games. This is clearly not enough games to make any kind of conclusion except for very big elo jumps.

There are (at least! :) ) three ways to increase the number of games played.

1. Add more repetitions of the test runs.
Do _NOT_ do this. It is fraught with problems. Use more positions or more opponents, but do not play more games with the same positions,
2. Add more opponents.
3. Add more starting positions.

Which one of these options is "better"?
Or are they equal, meaning that the real value is only in the higher number of games?
More positions is easier. Finding more opponents can be problematic, in that some programs do _very_ poorly at fast time controls, which even I use on the cluster to get results faster. So fast games limits the potential candidates for testing opponents. But clearly more is better. And I am continually working on this issue. One important point is that you really want to test against stronger opponents for the most part, not weaker ones. That way you can recognize gains quicker than if you are way ahead of your opponents. That is another limiting factor in choosing opponents.
That to me is an interesting question. :D

Kind regards,
Jesper

P.S.
I would hate to see anyone withdraw there contribution to this forum.
The diverse inputs are, i believe, one of the strengths of this place.
Even if the going gets a bit rough from time to time.
Ok! Thanks!

The reason option 1 looks tasty to me, is that it gives the option of iteratively adding more precision.

So you can run the test, look at the results, and decide if you think the change is good, bad or uncertain. Then if uncertain run the test again.
You can do with with several thousand positions. Just use the first N, and if you want more accuracy, run the next N, and the next N, etc. Much safer than running the same positions over and over. For example, suppose you somehow get perfect reproduction for each move played. If you replay every game, you have 2x as many games, but every pair of results is identical. BayesElo will report a lower error bar, but it is wrong because there is perfect correlation between pairs of games. Using more positions rather than repeating old ones eliminates this.

In this way there is an option to "break off" early, if a good or bad change is spotted, thereby saving some time.

But maybe having a large number of start positions you can break them up into chunks of a manageble number of positions. And them run the tests as needed?!

How to pick the positions to use in the tests?
One idea could be to take the positions where your program left book in the tournaments it has played in.

Another idea could be to randomly generate them by using your own book. So basically let your program pick a move, like it would in a real game, and then follow the line to the end of the book.
The pro is that the positions are biased towards positions your program is likely to pick in a tournament game.
The con is that the testing then inherits the blindspots from the opening book.

Thanks for the ideas! :)

Kind regards,
Jesper
Here is what I did. I took a million or so high-quality games, and had crafty read thru them. At ply=24, which would always be white to move, I spit out the fen from that game, and then go on to the next. I end up with a million (or whatever number of games you have in your PGN collection) positions. I sort 'em and then use "uniq" to eliminate duplicates and add a count for the number of times each position was duplicated. I then sort on this counter field, and take the first N for my test positions, These are the N most popular positions.

Works well enough. You might find a better selection algorithm, but these positions seem to be working quite well for us.
In the first 40 to 50 positions you have a quasi-duplicate. Some time ago I downloaded your positions and went through the first ones out of curiosity. The positions are a french exchange. They have a small difference, which is either one black to move and the other white to move or something small like that (or one position in which one side lost a tempo or so, I can't remember exactly). Maybe it is not terribly important considering the size you are dealing with, but I figure you may like to know. You may have more cases like this. If you like I can try to dig it. The positions are technically different but you will discard them if you see them. Since they are ordered by FEN, they are not exactly together. Now that I say, this, I think one position has a pawn in g5 an the other in g6 (when black should be playing g5).

I know it is very difficult to select this things. I started to choose manually positions from ECO. That will be a high quality selection, but it takes forever. One day I will finish :-)

Miguel
I don't search the positions in the order they are in the file. I randomly extract them. I decided to do this to make me resist the temptation of looking at the results after 500 games and making a quick decision that is wrong. Choosing positions randomly, sometimes things start off well and go south. Sometimes they start off poorly and rise.

There are no perfect duplicates because of how it was generated, but there can be lots of similar positions I suppose. As of right now, those positions are ordered "lexically". I could randomize them and put them back on my ftp box, if someone wants.