What your opinion about this testing methodology?

Kempelen · Post by **Kempelen** » Tue Apr 17, 2012 7:28 pm

Should it work?

I have been thinking about ways to improve testing time results. People usually use tournaments from startup position, or tournaments with a set of very limited set of position (i.e. 32), or tournaments with a lot of random positions. I asumme all people is doing this with a minimum of 1000 to 4000 games.

.... but ....

what about repeating the same tournament, with the same opponents, with the same positions per opponent?. Assuming a set of positions would be very large....

example:
Game 1, agains Crafty, black, posicion from FEN file 'myfenpositions.epd', number of position 540
Game 2, agains Critter, white, position from FEN file 'myfenpositions.epd', number of position 3251
....
etc

the idea is that the number of position would be always the same and not choosed ramdomly, without repeating any FEN, but enought varied.
The tournament file from the tournament manager would always be the same, without the need to recreate the tournament. The test would always repeat the same.

Would be results between tests more accurate than randomly choose the startup position.?

Ferdy · Post by **Ferdy** » Wed Apr 18, 2012 8:45 am

Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.

Mincho Georgiev · Post by **Mincho Georgiev** » Wed Apr 18, 2012 9:52 am

I'm somewhere in between. Testing with same tournament with same opponents /4-5/ with position set of 320 positions currently. 5/opp/ x 2/color/ x 320/positions/ = ~4000 games.

Kempelen · Post by **Kempelen** » Wed Apr 18, 2012 9:56 am

Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.

I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.

Sven · Post by **Sven** » Wed Apr 18, 2012 11:40 am

Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.

The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.

Sven

Ferdy · Post by **Ferdy** » Wed Apr 18, 2012 12:12 pm

Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.

Did I say limited set of opening positions? Do not underestimate when I say start from smaller number of positions, as I don't underestimate when you say large. Even when you say very large there is still a limit to this, how many exactly is very large? Even from your first sentence " improve testing time results", I understand from here that you also have a limitation of resources.
When I said there is drawback is because your scheme is like this.
vs crafty use pos 1 to 100 or something
vs critter use pos 101 to 200
...
Now if you always use this test, you will probably improve score vs crafty for positions set 1 to 100, same with critter for positions 101 to 200. But the question is will the engine tuned to play vs crafty on positions 1 to 100 is equally good when it plays another engine on same opening test set?

lucasart · Post by **lucasart** » Wed Apr 18, 2012 1:41 pm

Sven Schüle wrote:
Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.

Sven

It seems pretty obvious that it lowers the error bar. In fact the whole estimation model implicitly assumes that you do this.

Let's say that the score of engine A vs B is distributed under a probablity law P(mu,sigma) with mean mu and stdev sigma. That means that given equal chances from the starting position the distribution of the result should be P(mu,sigma). However if the position is chosen that favors A or B, then the distribution will be sth like Q(position)P(mu,sigma) where Q(position) is centered around 1 and is more or less depending on whether A or B is favored. the fact that E(Q)=1 may still ensure an unbiaised estimator, but with a higher variance...

No need to be an expert in statistics to understand it, at least intuitively. You can write it cleanly too, and it isn't hard!

PS: please no ball busting on the details, I purposly made the math notations oversimplistic.

Mincho Georgiev · Post by **Mincho Georgiev** » Wed Apr 18, 2012 1:46 pm

I forgot to mentions, that mine are not selected randomly. I have Axx Bxx Cxx Dxx Exx openings mixed exactly that way and then again Axx Bxx... but none of them are duplicated for the entire set.

diep · Post by **diep** » Wed Apr 18, 2012 1:51 pm

Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.

Your engine gets optimized always for the positions you test with. If you want to just kick butt at say a few Noomen positions then this is the way to test.

It isn't the holy grail - but if some guy grabs your engine and plays with those positions against other engines, then you'll beat them bigtime.

What you need is a mix of everything and innovate every few years.

Sven · Post by **Sven** » Wed Apr 18, 2012 1:52 pm

lucasart wrote:
Sven Schüle wrote:
Kempelen wrote:
Ferdy wrote:Your methodology is favorable considering your goal. There is no point testing 1. b3 e5 positions if it is not in your engines repertoire. Only use test positions where you want your engine to be. Of course there are drawbacks, but that can be overcomed as you say using large number of selected positions.

Perhaps start from smaller number of positions, and as your engine able to improve from it, then add other positions, to be considered in its repertoire.

But I have a bad feeling about it, to me the engine should be able to handle all positions, it can be blocked, open, full of pinned pieces, etc.
I think you misunderstand my idea. The goal is not test only a limited set of opening positions, but a large and varied set of starting middle-game positions. The point is repeating always the same games with the same positions, but enought positions to say the engine is played a varied.
The point is, the positions are selected once by random but then always the same positions are used for testing. That's exactly what Bob is doing for a long while now, and also lots of other people, so it is not a new method but kind of "de facto standard". I recall there were long discussions about the details few years ago. Doing it that way instead of newly choosing different positions by random each time has been found to result in lower error bars as far as I remember. I guess Bob and the other experts in statistics can explain the exact reasons.

Sven
It seems pretty obvious that it lowers the error bar. In fact the whole estimation model implicitly assumes that you do this.

Let's say that the score of engine A vs B is distributed under a probablity law P(mu,sigma) with mean mu and stdev sigma. That means that given equal chances from the starting position the distribution of the result should be P(mu,sigma). However if the position is chosen that favors A or B, then the distribution will be sth like Q(position)P(mu,sigma) where Q(position) is centered around 1 and is more or less depending on whether A or B is favored. the fact that E(Q)=1 may still ensure an unbiaised estimator, but with a higher variance...

No need to be an expert in statistics to understand it, at least intuitively. You can write it cleanly too, and it isn't hard!

PS: please no ball busting on the details, I purposly made the math notations oversimplistic.

I think this is not about positions favoring either side A or B, the selected positions have to be "balanced". Instead, it is all about
a) always using the same set of starting positions (for each single "test tournament"), or
b) repeating the step of choosing a set of starting positions for each "test tournament".

The statement was then that method a) would result in lower error bars, which is not my own statement but which is what I recall was mentioned by someone else in the past.

Sven

What your opinion about this testing methodology?

What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?

Re: What your opinion about this testing methodology?