Should it work?
I have been thinking about ways to improve testing time results. People usually use tournaments from startup position, or tournaments with a set of very limited set of position (i.e. 32), or tournaments with a lot of random positions. I asumme all people is doing this with a minimum of 1000 to 4000 games.
.... but ....
what about repeating the same tournament, with the same opponents, with the same positions per opponent?. Assuming a set of positions would be very large....
example:
Game 1, agains Crafty, black, posicion from FEN file 'myfenpositions.epd', number of position 540
Game 2, agains Critter, white, position from FEN file 'myfenpositions.epd', number of position 3251
....
etc
the idea is that the number of position would be always the same and not choosed ramdomly, without repeating any FEN, but enought varied.
The tournament file from the tournament manager would always be the same, without the need to recreate the tournament. The test would always repeat the same.
Would be results between tests more accurate than randomly choose the startup position.?
What your opinion about this testing methodology?
Moderators: hgm, Rebel, chrisw
-
- Posts: 620
- Joined: Fri Feb 08, 2008 10:44 am
- Location: Madrid - Spain
-
- Posts: 4840
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: What your opinion about this testing methodology?
Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.
Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.
But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.
But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
-
- Posts: 454
- Joined: Sat Apr 04, 2009 6:44 pm
- Location: Bulgaria
Re: What your opinion about this testing methodology?
I'm somewhere in between. Testing with same tournament with same opponents /4-5/ with position set of 320 positions currently. 5/opp/ x 2/color/ x 320/positions/ = ~4000 games.
-
- Posts: 620
- Joined: Fri Feb 08, 2008 10:44 am
- Location: Madrid - Spain
Re: What your opinion about this testing methodology?
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.
Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.
But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: What your opinion about this testing methodology?
The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.Kempelen wrote:I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.
Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.
But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
Sven
-
- Posts: 4840
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: What your opinion about this testing methodology?
Did I say limited set of opening positions? Do not underestimate when I say start from smaller number of positions, as I don't underestimate when you say large. Even when you say very large there is still a limit to this, how many exactly is very large? Even from your first sentence " improve testing time results", I understand from here that you also have a limitation of resources.Kempelen wrote:I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.
Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.
But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
When I said there is drawback is because your scheme is like this.
vs crafty use pos 1 to 100 or something
vs critter use pos 101 to 200
...
Now if you always use this test, you will probably improve score vs crafty for positions set 1 to 100, same with critter for positions 101 to 200. But the question is will the engine tuned to play vs crafty on positions 1 to 100 is equally good when it plays another engine on same opening test set?
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: What your opinion about this testing methodology?
It seems pretty obvious that it lowers the error bar. In fact the whole estimation model implicitly assumes that you do this.Sven Schüle wrote:The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.Kempelen wrote:I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.
Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.
But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
Sven
Let's say that the score of engine A vs B is distributed under a probablity law P(mu,sigma) with mean mu and stdev sigma. That means that given equal chances from the starting position the distribution of the result should be P(mu,sigma). However if the position is chosen that favors A or B, then the distribution will be sth like Q(position)P(mu,sigma) where Q(position) is centered around 1 and is more or less depending on whether A or B is favored. the fact that E(Q)=1 may still ensure an unbiaised estimator, but with a higher variance...
No need to be an expert in statistics to understand it, at least intuitively. You can write it cleanly too, and it isn't hard!
PS: please no ball busting on the details, I purposly made the math notations oversimplistic.
-
- Posts: 454
- Joined: Sat Apr 04, 2009 6:44 pm
- Location: Bulgaria
Re: What your opinion about this testing methodology?
I forgot to mentions, that mine are not selected randomly. I have Axx Bxx Cxx Dxx Exx openings mixed exactly that way and then again Axx Bxx... but none of them are duplicated for the entire set.
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: What your opinion about this testing methodology?
Your engine gets optimized always for the positions you test with. If you want to just kick butt at say a few Noomen positions then this is the way to test.Kempelen wrote:I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.
Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.
But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
It isn't the holy grail - but if some guy grabs your engine and plays with those positions against other engines, then you'll beat them bigtime.
What you need is a mix of everything and innovate every few years.
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: What your opinion about this testing methodology?
I think this is not about positions favoring either side A or B, the selected positions have to be "balanced". Instead, it is all aboutlucasart wrote:It seems pretty obvious that it lowers the error bar. In fact the whole estimation model implicitly assumes that you do this.Sven Schüle wrote:The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.Kempelen wrote:I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.
Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.
But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
Sven
Let's say that the score of engine A vs B is distributed under a probablity law P(mu,sigma) with mean mu and stdev sigma. That means that given equal chances from the starting position the distribution of the result should be P(mu,sigma). However if the position is chosen that favors A or B, then the distribution will be sth like Q(position)P(mu,sigma) where Q(position) is centered around 1 and is more or less depending on whether A or B is favored. the fact that E(Q)=1 may still ensure an unbiaised estimator, but with a higher variance...
No need to be an expert in statistics to understand it, at least intuitively. You can write it cleanly too, and it isn't hard!
PS: please no ball busting on the details, I purposly made the math notations oversimplistic.
a) always using the same set of starting positions (for each single "test tournament"), or
b) repeating the step of choosing a set of starting positions for each "test tournament".
The statement was then that method a) would result in lower error bars, which is not my own statement but which is what I recall was mentioned by someone else in the past.
Sven