search extensions

bob · Post by **bob** » Mon Nov 10, 2014 3:43 pm

mcostalba wrote:
elcabesa wrote: I'm looking at your patch, let me try to verify if I have understood it.

for PV nodes you do null move verification with a different threshold, if it fail low and verification search fail high -> extend by 1 the first move of the search
Yes, I do this.

"first move"? Why the first move? The original idea was to extend the move that fails high or becomes a new PV move...

jdart · Post by **jdart** » Mon Nov 10, 2014 4:39 pm

I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon

Henk · Post by **Henk** » Mon Nov 10, 2014 5:16 pm

jdart wrote:I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon

This argument could also be used if one says that ELO of engines can not be trusted or compared to ELO of humans for engines almost never play against grandmasters.

bob · Post by **bob** » Mon Nov 10, 2014 5:45 pm

Henk wrote:
jdart wrote:I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon
This argument could also be used if one says that ELO of engines can not be trusted or compared to ELO of humans for engines almost never play against grandmasters.

Why would one care about how you do against WEAKER opposition, just so long as you play better against stronger opposition. However, many of us DO test against GMs all the time. Just not in the same controlled environment as cluster testing provides.

However, depending on self-testing only will absolutely lead you down some false-positive/false-negative result pathways. I saw several false positives during debugging self-test runs that were slapped silly by gauntlet testing...

I like self-testing to expose bugs. Not to expose good or bad changes beyond actual broken code.

Henk · Post by **Henk** » Mon Nov 10, 2014 6:30 pm

bob wrote:
Henk wrote:
jdart wrote:I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon
This argument could also be used if one says that ELO of engines can not be trusted or compared to ELO of humans for engines almost never play against grandmasters.
Why would one care about how you do against WEAKER opposition, just so long as you play better against stronger opposition. However, many of us DO test against GMs all the time. Just not in the same controlled environment as cluster testing provides.

For there is no stronger human opposition. It might be that the best engines are not playing 400 ELO better than top five grandmasters but only 150 ELO unless they play often against humans as well.

Self-play also holds for a group. There are only two members the engine-group and the human-group. Humans against humans is self-play. Engines against engines is also self-play.

Perhaps there might be more groups for instance brute force like engines.

Eelco de Groot · Post by **Eelco de Groot** » Mon Nov 10, 2014 6:55 pm

Uri Blass wrote:
bob wrote:
lucasart wrote:
bob wrote: (1) simple singular extensions as found in robbolito/stockfish/who knows what other program. The idea is that (a) a move from the hash table is a candidate singular move. Ignoring a few details, you search every other move (except for the hash move) using an offset window, and if all other moves fail low against that window, this move gets extended. Never thought much of it, and at one point I removed it from Stockfish and my cluster testing (gauntlet NOT self-testing) suggested it was a zero gain/loss idea. I've tested this extensively on Crafty and reached the same conclusion, it doesn't help me at all. When I do self-testing, there were tunings that seemed to gain 5-10 Elo, but when testing against a gauntlet, they were either Elo losing or break-even. I gave up on this.
I also never managed to gain anything by SE in my engine. But in SF the gain is prodigious. More than 20 elo IIRC.

So you are saying that it's only a gain in self-play and just break-even against foreign opponents? I've heard so many claims like that, of patches than behave differently against foreign opponents than in self-play, but never seen any evidence myself.

It would be remarkable if this claim is correct. And would certainly question our whole SF testing methodology!
I tested the SE in stockfish a couple of years back. I found no significant gain, and reported same here.
It may be interesting if you test again because I think that the value of singular extension for stockfish became bigger later(at least based on stockfish-stockfish games).

Here is the latest test from the stockfish framework almost one year ago
Note that the stockfish team stopped the test in the middle so we do not have unbiased estimate but it is probably safe to say that singular extensions give stockfish at least 10 elo in 15+0.05 time control against previous version.

http://tests.stockfishchess.org/tests/v ... 49c4e73429

ELO: -24.37 +-7.2 (95%) LOS: 0.0%
Total: 3342 W: 509 L: 743 D: 2090

The stockfish framework at the moment is empty. Could we not run this test again, maybe in LTC, 60" + 0.05", if Robert requested it? And this time let it finish? If the expected difference is large you could run with a small number of games if necessary and still get a significant result. Lucas has deleted the testing branch, so we don't know exactly how it was tested. But if Bob could run the same testbranch but with a gauntlet and still finds no significant gain and assuming the Stockfish framework does find a significant gain for Stockfish version of singular extensions, there has to either be a significant difference in testing methodology (assuming that Robert will use gauntlet testing, that would be one difference, but there can be other differences) or an error in either one of the tests.

If you don't do this, but you keep talking about apparently very different results way back in the past, for one and the same engine (Stockfish), that makes the rest of the discussion a bit useless I think.

bob · Post by **bob** » Mon Nov 10, 2014 7:19 pm

Henk wrote:
bob wrote:
Henk wrote:
jdart wrote:I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon
This argument could also be used if one says that ELO of engines can not be trusted or compared to ELO of humans for engines almost never play against grandmasters.
Why would one care about how you do against WEAKER opposition, just so long as you play better against stronger opposition. However, many of us DO test against GMs all the time. Just not in the same controlled environment as cluster testing provides.
For there is no stronger human opposition. It might be that the best engines are not playing 400 ELO better than top five grandmasters but only 150 ELO unless they play often against humans as well.

Self-play also holds for a group. There are only two members the engine-group and the human-group. Humans against humans is self-play. Engines against engines is also self-play.

Perhaps there might be more groups for instance brute force like engines.

Humans against humans is anything but "self-play" unless you play yourself. I don't follow that. There is a big difference between two identical opponents playing and a group of "similar" opponents playing. I agree that a larger group is better, but with humans you introduce a lot of noise into the measurement, where computers are stable in their playing level but humans can be wildly variable.

Henk · Post by **Henk** » Mon Nov 10, 2014 8:16 pm

bob wrote:
Henk wrote:
bob wrote:
Henk wrote:
jdart wrote:I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon
This argument could also be used if one says that ELO of engines can not be trusted or compared to ELO of humans for engines almost never play against grandmasters.
Why would one care about how you do against WEAKER opposition, just so long as you play better against stronger opposition. However, many of us DO test against GMs all the time. Just not in the same controlled environment as cluster testing provides.
For there is no stronger human opposition. It might be that the best engines are not playing 400 ELO better than top five grandmasters but only 150 ELO unless they play often against humans as well.

Self-play also holds for a group. There are only two members the engine-group and the human-group. Humans against humans is self-play. Engines against engines is also self-play.

Perhaps there might be more groups for instance brute force like engines.
Humans against humans is anything but "self-play" unless you play yourself. I don't follow that. There is a big difference between two identical opponents playing and a group of "similar" opponents playing. I agree that a larger group is better, but with humans you introduce a lot of noise into the measurement, where computers are stable in their playing level but humans can be wildly variable.

For there are groups of similar opponents a test set should contain members which represent each group. For instance engines are mostly tactical players so you should have positional players too.

bob · Post by **bob** » Tue Nov 11, 2014 5:50 pm

Henk wrote:
bob wrote:
Henk wrote:
bob wrote:
Henk wrote:
jdart wrote:I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon
This argument could also be used if one says that ELO of engines can not be trusted or compared to ELO of humans for engines almost never play against grandmasters.
Why would one care about how you do against WEAKER opposition, just so long as you play better against stronger opposition. However, many of us DO test against GMs all the time. Just not in the same controlled environment as cluster testing provides.
For there is no stronger human opposition. It might be that the best engines are not playing 400 ELO better than top five grandmasters but only 150 ELO unless they play often against humans as well.

Self-play also holds for a group. There are only two members the engine-group and the human-group. Humans against humans is self-play. Engines against engines is also self-play.

Perhaps there might be more groups for instance brute force like engines.
Humans against humans is anything but "self-play" unless you play yourself. I don't follow that. There is a big difference between two identical opponents playing and a group of "similar" opponents playing. I agree that a larger group is better, but with humans you introduce a lot of noise into the measurement, where computers are stable in their playing level but humans can be wildly variable.
For there are groups of similar opponents a test set should contain members which represent each group. For instance engines are mostly tactical players so you should have positional players too.

It's "good in theory, but impossible in reality." You need consistent results. Computers provide that quick easily. Humans, never. Over 30K games, a computer's Elo varies by +/- 3 Elo or so. A human, +/- 200 when you take the edge cases of being tired, depressed, sick, hungry, angry, etc...

Henk · Post by **Henk** » Tue Nov 11, 2014 6:47 pm

bob wrote:
Henk wrote:
bob wrote:
Henk wrote:
bob wrote:
Henk wrote:
jdart wrote:I have been using gauntlet testing almost exclusively for the past couple of years. I think with self-play you are not likely to find some weaknesses because your engine is not tuned to exploit them, while another engine may do that.

--Jon
This argument could also be used if one says that ELO of engines can not be trusted or compared to ELO of humans for engines almost never play against grandmasters.
Why would one care about how you do against WEAKER opposition, just so long as you play better against stronger opposition. However, many of us DO test against GMs all the time. Just not in the same controlled environment as cluster testing provides.
For there is no stronger human opposition. It might be that the best engines are not playing 400 ELO better than top five grandmasters but only 150 ELO unless they play often against humans as well.

Self-play also holds for a group. There are only two members the engine-group and the human-group. Humans against humans is self-play. Engines against engines is also self-play.

Perhaps there might be more groups for instance brute force like engines.
Humans against humans is anything but "self-play" unless you play yourself. I don't follow that. There is a big difference between two identical opponents playing and a group of "similar" opponents playing. I agree that a larger group is better, but with humans you introduce a lot of noise into the measurement, where computers are stable in their playing level but humans can be wildly variable.
For there are groups of similar opponents a test set should contain members which represent each group. For instance engines are mostly tactical players so you should have positional players too.
It's "good in theory, but impossible in reality." You need consistent results. Computers provide that quick easily. Humans, never. Over 30K games, a computer's Elo varies by +/- 3 Elo or so. A human, +/- 200 when you take the edge cases of being tired, depressed, sick, hungry, angry, etc...

Then the test set should contain engines that represent each group. For instance a test set should not only contain tactical playing engines but also strategic/positional playing engines.

search extensions

Re: search extensions

Re: search extensions

Re: search extensions

Re: search extensions

Re: search extensions

Re: search extensions

Re: search extensions

Re: search extensions

Re: search extensions

Re: search extensions