Singular Extensions

bob · Post by **bob** » Tue Aug 03, 2010 5:24 pm

Don wrote:
Mangar wrote:
bob wrote:My comment is simply that such a thing is a result of poor testing, because one should _never_ test multiple changes at the same time.
Hi,

this is not allways true. There might be single changes that don´t improve but a combination of them that improves.
Currently I had the situation, that 4 changes in reduction/extension tested seperately didn´t bring anything but alltogether gained a good amount of elo. I think that testing only single issues gives you a high chance to starve in a lokal maximum.

Greetings Volker
I suppose it's in the interpretation. I would view this as a single change or if you prefer a single "compound change" because it's all part of the same thing.

You and Bob bring up an interesting issue - can changes be tested in combination? H.G. Muller suggested something called orthogonal multi-tester many months (perhaps years) ago.

It may be that you CAN combine changes if you set up your testing accordingly, but you are still testing them individually as it is required that you separate them. You would not test 2 separate things combined into a single change unless you were convinced that it makes sense and you were looking specifically for interactions - but you could even do that with multi-testing. You could look for pair-wise interactions of everything you test for that matter.

The idea is sound. The implementation details, are difficult. You have to be certain that there is no correlation between the two things you are testing, so that you can get away with the reduced number of games. If you are making a search change, and an eval change, you might well get away with this. But I do not do that kind of test very often. I'm either working on search changes, or testing eval changes. But I don't do both at the same time. Makes life far simpler.

Mangar · Post by **Mangar** » Wed Aug 04, 2010 11:42 am

Hi Bob,

IMHO most things in search/eval are related and its hard to see how strong they are related. There is a high risk of missjudgement of this relation.

I expect that one of the main reason why newly written chess engines are able to get this strong because there is a strong relation between different search terms and eval coming from pruning techniques like lmr.
All engines that had been much optimized with a search and eval without lmr have a huge drawback as this optimization is contraproductive. I thinks that there is no way to optimize toward a lmr search only by proving one change after the other.

Greetings Volker

bob · Post by **bob** » Wed Aug 04, 2010 9:39 pm

Mangar wrote:Hi Bob,

IMHO most things in search/eval are related and its hard to see how strong they are related. There is a high risk of missjudgement of this relation.

I expect that one of the main reason why newly written chess engines are able to get this strong because there is a strong relation between different search terms and eval coming from pruning techniques like lmr.
All engines that had been much optimized with a search and eval without lmr have a huge drawback as this optimization is contraproductive. I thinks that there is no way to optimize toward a lmr search only by proving one change after the other.

Greetings Volker

There is certainly some correlation. But once an engine is running, I can't imagine adding feature A and you get nothing. then adding B and you get nothing. And then adding A+B and you get something significant. Of course if the two changes are directly related, this would happen... in one place you evaluate pawn structure, then you use this information in another place. But those are obviously part of one big change.

But I can't visualize how +A is no better, +B is no better, but +AB is significantly better.

Mangar · Post by **Mangar** » Thu Aug 05, 2010 1:29 pm

Hi,

in "my" case the four changes had been changes to move reduction. The basic research was why stockfish searches that much deeper than spike. I tested "static nullmove pruning", "value based pruning" a more agressive late move reduction even at root and much less extensions. Every single change gave a drop of 10-20 elo in Spike. Alltogether they gave me a gain of about 60 elo.
My tests are far away from perfect. A version is usually tested with 1200 games (50 different positions, 12 opponent engines, 60s + 1s time control). The +60 elo are (for me) proven as I have allready tested about 20 single changes on top of the mentioned 4 changes (all of them with 1200 games) that got about the same elo range.

Greetings Volker

Daniel Shawul · Post by **Daniel Shawul** » Thu Aug 05, 2010 2:12 pm

This pretty much sounds like a case of not enough games. All of your changes look positively correlated . You need to have something like an extension and a reduction together (i.e negatively correlated). One increasing depth the other reducing it so the optimum could be anywhere. If they all go the same way , you could have tuned each and every change with the correct parameters to get a benefit out of it. Also 1200 games if far too less.
Note that the factor you use during the combination, and individual testing are different. For example if you tested with a factor of 1 for each of 4 combined tests, then when you use them together the effect will be roughly 4x. Assuming the combination is a success, using a factor of 4x for each individual test could give you a boost. Hope I am clear enough.

Don · Post by **Don** » Thu Aug 05, 2010 8:53 pm

Daniel Shawul wrote:

What I'm saying is that all you do is talk. That makes me believe it won't matter how we do our testing, you would find it flawed and always be able to produce some reason why it's not the way you think it should be.

You claim that Bob mysteriously stopped his test, etc. Of course it was suspicious since it did not match what you expected.
What a stupid thing to say! If you can't read properly what people posted, just stop posting. I never ever said anything like that..
Recheck the threads and maybe you will find something like that in Ralph's post.

My mistake, yes it was something Ralph said. I take back all those bad things I said about you.

I have been running the singular test on Komodo and the results are not so hot after all. It shows as a small net improvement only.

Here is what I get:

Code: Select all


 Total games played:     4324
     Total this run:     4324
Matches in progress:        4
           PGN file: sing.pgn
     Total run time: 75:38:48
   Games per minute:        0.95

  RANK      ELO     +/-     Tme/Gme  Tot Gms  PLAYER
-------  -------  -----  ----------  -------  ----------------
     1    3000.0   10.6     109.835     4324  komodo 1.2
     2    2994.4   10.6     108.736     4324  komodo 1.2-noSing

            1       2
        -----   -----
  1.       --    50.8    50.8 percent of 4324 games
  2.     49.2      --    49.2 percent of 4324 games

And since the error margins are relatively high I cannot even say with any confidence that this helps.

If this doesn't scale, then it may be a problem. This was run at roughly 60 seconds per game on a core 2 duo laptop.

Daniel Shawul · Post by **Daniel Shawul** » Thu Aug 05, 2010 9:57 pm

My mistake, yes it was something Ralph said. I take back all those bad things I said about you.

Thanks, this is the nicest thing I heard in a while. I was contemplating to take a couple
of months off CC due to the number of "flames" I seem to get myself into. I will take
the break anyway due to other obligations.
I admit that I also misundertood you. When you said "government cover up", I took it literally.
I sometimes miss sarcasm. English is my second language

I have been running the singular test on Komodo and the results are not so hot after all. It shows as a small net improvement only.

Here is what I get:

Code:

Total games played: 4324
Total this run: 4324
Matches in progress: 4
PGN file: sing.pgn
Total run time: 75:38:48
Games per minute: 0.95

RANK ELO +/- Tme/Gme Tot Gms PLAYER
------- ------- ----- ---------- ------- ----------------
1 3000.0 10.6 109.835 4324 komodo 1.2
2 2994.4 10.6 108.736 4324 komodo 1.2-noSing

1 2
----- -----
1. -- 50.8 50.8 percent of 4324 games
2. 49.2 -- 49.2 percent of 4324 games

And since the error margins are relatively high I cannot even say with any confidence that this helps.

If this doesn't scale, then it may be a problem. This was run at roughly 60 seconds per game on a core 2 duo laptop.

Thanks for the update. I tested some variants of SE, and none of them seem to help (all tested at
hyper-blitz setting which could be the problem). I hope one of the variants being tested by Bob gives something significant.

regards,
Daniel

Ralph Stoesser · Post by **Ralph Stoesser** » Thu Aug 05, 2010 11:28 pm

Don wrote:
Daniel Shawul wrote:

What I'm saying is that all you do is talk. That makes me believe it won't matter how we do our testing, you would find it flawed and always be able to produce some reason why it's not the way you think it should be.

You claim that Bob mysteriously stopped his test, etc. Of course it was suspicious since it did not match what you expected.
What a stupid thing to say! If you can't read properly what people posted, just stop posting. I never ever said anything like that..
Recheck the threads and maybe you will find something like that in Ralph's post.
My mistake, yes it was something Ralph said. I take back all those bad things I said about you.

I have been running the singular test on Komodo and the results are not so hot after all. It shows as a small net improvement only.

I said nothing about a mystery, nor a government cover up. That originated from your humor. I said (not literally) that it was no surprise to me that he does not want to look deeper into the scaling issue. Or should I say non-issue?

He stated in this thread that he's mainly interested in myth debunking, so it seemed natural that he will not look deeper into the issue "does ttSE scale with longer TC and if yes, how much", after the Elo gain had quadrupled compared to the 5+5 results.

Nothing to get angry about. Finally he can do what he want with his cluster (as long as the government agents do not abort his tests to free up cpu time for their ww4 simulation

).

bob · Post by **bob** » Fri Aug 06, 2010 3:16 am

Ralph Stoesser wrote:
bob wrote:
Ralph Stoesser wrote:
Daniel Shawul wrote: 5 + 5 gives enough depth so why ask for more ??

Because 10+10 was measured (comparatively much) stronger, with an increasing tendency? Don't you believe in holy cluster test results??

But suddenly the test was stopped ... suprise, surprise.
where is this "much" stronger coming from? I got roughly +5 at one time control, +17 at another. That is not "much stronger".
Why tell roughly about exact measurements?
bob wrote:I've aborted this test. the ttSE version is +4 Elo stronger, maybe. I have started a 10min+10s match although with a lot fewer games. Report tomorrow although I am not sure how long it will run.
Code: Select all
   1 Stockfish 1.8 64bit      2850    4    4 30193   82%  2550   20% 
   2 Stockfish 1.8noSE 64bit  2846    4    4 30246   82%  2551   21% 
TC 5+5: +4 Elo

bob wrote: I finally stopped the test last night, error bar was down to +/- 8, difference was +18 Elo. Not insignificant, but also not in line what claims I had seen on freechess. One person there claimed +100 or so which would be remarkable for any change.
TC 10+10: +18 Elo

Each time the latest results reported by yourself.

In absolute terms +18 Elo difference may look tiny, but in relative terms it's much more compared to the 5+5 results. Roughly TC doubled, ELO gain quadrupled.

Isn't that something worth to look deeper into?

Not particularly. I ran almost 1,000 games at 60+60 (on hardware about 2x faster to boot) before aborting, when asked to do an A/C stress test again... Elo was +16, but with a bigger error bar. At 540 games at a time, 3-4 hours per game per cpu, 1000 took something around 8 hours. 30,000 was not worth it... That +4 could be anywhere between 0 and +8, that +18 could have been anywhere between +10 and +26. Hard to get too excited when logic says that trend is impossible (2x = 4x, 4x=16x, 16x=256x. Pretty soon you are talking serious elo.

Or fantasy...

bob · Post by **bob** » Fri Aug 06, 2010 3:27 am

Daniel Shawul wrote:
My mistake, yes it was something Ralph said. I take back all those bad things I said about you.
Thanks, this is the nicest thing I heard in a while. I was contemplating to take a couple
of months off CC due to the number of "flames" I seem to get myself into. I will take
the break anyway due to other obligations.
I admit that I also misundertood you. When you said "government cover up", I took it literally.
I sometimes miss sarcasm. English is my second language
I have been running the singular test on Komodo and the results are not so hot after all. It shows as a small net improvement only.

Here is what I get:

Code:

Total games played: 4324
Total this run: 4324
Matches in progress: 4
PGN file: sing.pgn
Total run time: 75:38:48
Games per minute: 0.95

RANK ELO +/- Tme/Gme Tot Gms PLAYER
------- ------- ----- ---------- ------- ----------------
1 3000.0 10.6 109.835 4324 komodo 1.2
2 2994.4 10.6 108.736 4324 komodo 1.2-noSing

1 2
----- -----
1. -- 50.8 50.8 percent of 4324 games
2. 49.2 -- 49.2 percent of 4324 games

And since the error margins are relatively high I cannot even say with any confidence that this helps.

If this doesn't scale, then it may be a problem. This was run at roughly 60 seconds per game on a core 2 duo laptop.
Thanks for the update. I tested some variants of SE, and none of them seem to help (all tested at
hyper-blitz setting which could be the problem). I hope one of the variants being tested by Bob gives something significant.

regards,
Daniel

So far, you can wish in one hand, crap in the other, and see which one fills up first.

So far, no improvement. Pretty easy to make it weaker of course, and not that hard to make it break even. But I'm not adding code that just makes it break even, just because it makes the PVs look a bit longer in tactical positions. I think this is where a lot of SE tests go wrong. Yep it looks pretty slick in those tactical positions. But it looks equally bad when burning all those nodes that nowadays might give us another ply or two which would help even more...

Still testing and tweaking on the last idea I explained. but so far, no glory...

I'm getting more pessimistic and am beginning to think that the extra overhead is simply not worth it.

Singular Extensions

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games

Re: Singular Extensions - long games