What your opinion about this testing methodology?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Kempelen
Posts: 620
Joined: Fri Feb 08, 2008 10:44 am
Location: Madrid - Spain

What your opinion about this testing methodology?

Post by Kempelen »

Should it work?

I have been thinking about ways to improve testing time results. People usually use tournaments from startup position, or tournaments with a set of very limited set of position (i.e. 32), or tournaments with a lot of random positions. I asumme all people is doing this with a minimum of 1000 to 4000 games.

.... but ....

what about repeating the same tournament, with the same opponents, with the same positions per opponent?. Assuming a set of positions would be very large....

example:
Game 1, agains Crafty, black, posicion from FEN file 'myfenpositions.epd', number of position 540
Game 2, agains Critter, white, position from FEN file 'myfenpositions.epd', number of position 3251
....
etc

the idea is that the number of position would be always the same and not choosed ramdomly, without repeating any FEN, but enought varied.
The tournament file from the tournament manager would always be the same, without the need to recreate the tournament. The test would always repeat the same.

Would be results between tests more accurate than randomly choose the startup position.?
Fermin Serrano
Author of 'Rodin' engine
http://sites.google.com/site/clonfsp/
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: What your opinion about this testing methodology?

Post by Ferdy »

Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
Mincho Georgiev
Posts: 454
Joined: Sat Apr 04, 2009 6:44 pm
Location: Bulgaria

Re: What your opinion about this testing methodology?

Post by Mincho Georgiev »

I'm somewhere in between. Testing with same tournament with same opponents /4-5/ with position set of 320 positions currently. 5/opp/ x 2/color/ x 320/positions/ = ~4000 games.
User avatar
Kempelen
Posts: 620
Joined: Fri Feb 08, 2008 10:44 am
Location: Madrid - Spain

Re: What your opinion about this testing methodology?

Post by Kempelen »

Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
Fermin Serrano
Author of 'Rodin' engine
http://sites.google.com/site/clonfsp/
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: What your opinion about this testing methodology?

Post by Sven »

Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.

Sven
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: What your opinion about this testing methodology?

Post by Ferdy »

Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
Did I say limited set of opening positions? Do not underestimate when I say start from smaller number of positions, as I don't underestimate when you say large. Even when you say very large there is still a limit to this, how many exactly is very large? Even from your first sentence " improve testing time results", I understand from here that you also have a limitation of resources.
When I said there is drawback is because your scheme is like this.
vs crafty use pos 1 to 100 or something
vs critter use pos 101 to 200
...
Now if you always use this test, you will probably improve score vs crafty for positions set 1 to 100, same with critter for positions 101 to 200. But the question is will the engine tuned to play vs crafty on positions 1 to 100 is equally good when it plays another engine on same opening test set?
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: What your opinion about this testing methodology?

Post by lucasart »

Sven Schüle wrote:
Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.

Sven
It seems pretty obvious that it lowers the error bar. In fact the whole estimation model implicitly assumes that you do this.

Let's say that the score of engine A vs B is distributed under a probablity law P(mu,sigma) with mean mu and stdev sigma. That means that given equal chances from the starting position the distribution of the result should be P(mu,sigma). However if the position is chosen that favors A or B, then the distribution will be sth like Q(position)P(mu,sigma) where Q(position) is centered around 1 and is more or less depending on whether A or B is favored. the fact that E(Q)=1 may still ensure an unbiaised estimator, but with a higher variance...

No need to be an expert in statistics to understand it, at least intuitively. You can write it cleanly too, and it isn't hard!

PS: please no ball busting on the details, I purposly made the math notations oversimplistic.
Mincho Georgiev
Posts: 454
Joined: Sat Apr 04, 2009 6:44 pm
Location: Bulgaria

Re: What your opinion about this testing methodology?

Post by Mincho Georgiev »

I forgot to mentions, that mine are not selected randomly. I have Axx Bxx Cxx Dxx Exx openings mixed exactly that way and then again Axx Bxx... but none of them are duplicated for the entire set.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: What your opinion about this testing methodology?

Post by diep »

Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
Your engine gets optimized always for the positions you test with. If you want to just kick butt at say a few Noomen positions then this is the way to test.

It isn't the holy grail - but if some guy grabs your engine and plays with those positions against other engines, then you'll beat them bigtime.

What you need is a mix of everything and innovate every few years.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: What your opinion about this testing methodology?

Post by Sven »

lucasart wrote:
Sven Schüle wrote:
Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.

Sven
It seems pretty obvious that it lowers the error bar. In fact the whole estimation model implicitly assumes that you do this.

Let's say that the score of engine A vs B is distributed under a probablity law P(mu,sigma) with mean mu and stdev sigma. That means that given equal chances from the starting position the distribution of the result should be P(mu,sigma). However if the position is chosen that favors A or B, then the distribution will be sth like Q(position)P(mu,sigma) where Q(position) is centered around 1 and is more or less depending on whether A or B is favored. the fact that E(Q)=1 may still ensure an unbiaised estimator, but with a higher variance...

No need to be an expert in statistics to understand it, at least intuitively. You can write it cleanly too, and it isn't hard!

PS: please no ball busting on the details, I purposly made the math notations oversimplistic.
I think this is not about positions favoring either side A or B, the selected positions have to be "balanced". Instead, it is all about
a) always using the same set of starting positions (for each single "test tournament"), or
b) repeating the step of choosing a set of starting positions for each "test tournament".

The statement was then that method a) would result in lower error bars, which is not my own statement but which is what I recall was mentioned by someone else in the past.

Sven